1. 深圳大学计算机与软件学院大数据技术与应用研究所,广东 深圳 518060
2. 人工智能与数字经济广东省实验室(深圳),广东 深圳 518107
[ "罗凯靖(1997- ),男,深圳大学计算机与软件学院硕士生,主要研究方向为大数据并行与分布式计算技术、大数据系统计算技术。" ]
[ "张育铭(1998- ),男,深圳大学计算机与软件学院硕士生,主要研究方向为数据挖掘、大数据系统计算技术。" ]
[ "何玉林(1982- ),男,博士,人工智能与数字经济广东省实验室(深圳)副研究员,主要研究方向为大数据智能计算、多样本统计分析、数据挖掘与机器学习算法及应用等。" ]
[ "黄哲学(1959- ),男,博士,深圳大学特聘教授,大数据技术与应用研究所所长,大数据系统计算技术国家工程实验室副主任。1993年获瑞典皇家理工学院博士学位。主要研究方向为数据挖掘、分布式机器学习系统、大数据并行处理与分析、大数据系统计算技术等。" ]
网络首发:2024-05,
纸质出版:2024-05-15
移动端阅览
罗凯靖, 张育铭, 何玉林, 等. Bootstrap样本大数据模型和分布式集成学习方法[J]. 大数据, 2024,10(3):93-108.
Kaijing LUO, Yuming ZHANG, Yulin HE, et al. Bootstrap sample partition data model and distributed ensemble learning[J]. Big data research, 2024, 10(3): 93-108.
罗凯靖, 张育铭, 何玉林, 等. Bootstrap样本大数据模型和分布式集成学习方法[J]. 大数据, 2024,10(3):93-108. DOI: 10.11959/j.issn.2096-0271.2024002.
Kaijing LUO, Yuming ZHANG, Yulin HE, et al. Bootstrap sample partition data model and distributed ensemble learning[J]. Big data research, 2024, 10(3): 93-108. DOI: 10.11959/j.issn.2096-0271.2024002.
传统Bootstrap抽样和Bagging集成学习通常以串行方式实现,计算效率低,且存在样本不可重用、扩展性差等问题,不适合高效的大规模Bagging集成学习。从大数据分布式计算的思维入手,提出新的Bootstrap样本划分(BSP)大数据模型和分布式集成学习方法。BSP数据模型通过分布式生成算法将训练数据表达成分布式Bootstrap样本集的集合,存储成HDFS分布式数据文件,为后续的分布式集成学习提供数据支持。分布式集成学习方法从BSP数据模型中随机选取多个BSP数据块,读入集群各个节点的虚拟机,用串行算法对选取的数据块并行计算统计量或训练建模,再将所有的计算子结果回传至主节点中,生成最终的集成学习结果,此过程中可加入对子结果的质量选择以进一步提高预测效果。BSP数据模型的生成和分布式集成学习采用非Map-Reduce计算范式进行,每个数据块的计算独立完成,减少了计算节点间的数据通信开销。提出的算法在Spark开源系统中以新的算子方式实现,供Spark应用程序调用。实验表明,新方法可以高效地生成训练数据的BSP数据模型,提高数据样本的可重用性,在基于有监督机器学习算法构建的大规模Bagging集成学习实验中,计算效率能提高50%以上,同时预测精度进一步提高约2%。
A sequential implementation of Bootstrap sampling and Bagging ensemble learning is computationally inefficient and not scalable to build large Bagging ensemble models with a large number of component models.Inspired by distributed big data computing
a new Bootstrap sample partition (BSP) big data model and a distributed ensemble learning method for large-scale distributed ensemble learning were proposed.The BSP data model extended a dataset as a set of Bootstrap samples stored in Hadoop distributed file system.Our distributed ensemble learning method randomly selected a subset of samples from the BSP data model and read them into Java virtual machines of the cluster.Following this
a serial algorithm was executed in each virtual machine to process each sample data and build a machine learningmodel on each sample data independently and in parallel with other virtual machines.Eventually
allsub-results were collected and processed in the master node to produce the ensemble result
optionally adding a sample preferences trategy for the BSP data blocks.The BSP data model generation and the component model building were computed using a non-MapReduce computing paradigm.All component models were computed in parallel without data communication among the nodes.The algorithms proposed in this paper were implemented in spark as internal operators that can be utilized in Spark applications.Experiments have demonstrated that BSP data model of a dataset can be generated efficiently through the new distributed algorithm.It improves the reusability of data samples and increases computational efficiency by over 50% in large-scale Bagging ensemble learning
while also increasing prediction accuracy by approximately 2%.
BREIMAN L . Bagging predictors [J ] . Machine Learning , 1996 , 24 ( 2 ): 123 - 140 .
EFRON B . Bootstrap methods:another look at the jackknife [J ] . The Annals of Statistics , 1979 , 7 ( 1 ): 1 - 26 .
DAVISON A C , HINKLEY D V . Bootstrap methods and their application [M ] . Cambridge : Cambridge University Press , 1997 .
ROSENBLATT M . A central limit theorem and a strong mixing condition [J ] . Proceedings of the National Academy of Sciences of the United States of America , 1956 , 42 ( 1 ): 43 - 47 .
SAGI O , ROKACH L . Ensemble learning:a survey [J ] . Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery , 2018 , 8 ( 4 ): e1249 .
KREISS J P , PAPARODITIS E . Bootstrap methods for dependent data:a review [J ] . Journal of the Korean Statistical Society , 2011 , 40 ( 4 ): 357 - 378 .
CHEN J G , LI K L , TANG Z , et al . A parallel random forest algorithm for big data in a spark cloud computing environment [J ] . IEEE Transactions on Parallel and Distributed Systems , 2017 , 28 ( 4 ): 919 - 933 .
XU W , HOANG V T . MapReducebased improved random forest model for massive educational data processing and classification [J ] . Mobile Networks and Applications , 2021 , 26 ( 1 ): 191 - 199 .
SENAGI K , JOUANDEAU N . Parallel construction of random forest on GPU [J ] . The Journal of Supercomputing , 2022 , 78 ( 8 ): 10480 - 10500 .
YU Y X , PENG S C , YUAN Y , et al . A classifier using online bagging ensemble method for big data stream learning [J ] . Tsinghua Science and Technology , 2019 , 24 ( 4 ): 379 - 388 .
KLEINER A , TALWALKAR A , SARKAR P , et al . A scalable bootstrap for massive data [J ] . Journal of the Royal Statistical Society Series B:Statistical Methodology , 2014 , 76 ( 4 ): 795 - 816 .
BASIRI S , OLLILA E , KOIVUNEN V . Robust,scalable,and fast bootstrap method for analyzing large scale data [J ] . IEEE Transactions on Signal Processing , 2016 , 64 ( 4 ): 1007 - 1017 .
黄哲学 , 何玉林 , 魏丞昊 , 等 . 大数据随机样本划分模型及相关分析计算技术 [J ] . 数据采集与处理 , 2019 , 34 ( 3 ): 373 - 385 .
HUANG Z X , HE Y L , WEI C H , et al . Random sample partition data model and related technologies for big data analysis [J ] . Journal of Data Acquisition &Processing , 2019 , 34 ( 3 ): 373 - 385 .
SALLOUM S , HUANG J Z , HE Y L . Random sample partition:a distributed data model for big data analysis [J ] . IEEE Transactions on Industrial Informatics , 2019 , 15 ( 11 ): 5846 - 5854 .
MAHMUD M S , HUANG J Z , SALLOUM S , et al . A survey of data partitioning and sampling methods to support big data analysis [J ] . Big Data Mining and Analytics , 2020 , 3 ( 2 ): 85 - 101 .
ZAHARIA M , CHOWDHURY M , FRANKLIN M J , et al . Spark:Cluster computing with working sets [J ] . HotCloud , 2010 , 10 ( 10 ): 95 .
SHVACHKO K , KUANG H R , RADIA S , et al . The hadoop distributed file system [C ] // Proceedings of 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) . Piscataway:IEEE Press , 2010 : 1 - 10 .
ZAHARIA M , CHOWDHURY M , DAS T , et al . Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing [C ] // Proceedings of the 9thUSENIXSymposium on Networked Systems Design and Implementation . San Jose:USENIX Association , 2012 : 15 - 28 .
GUR , QI Y , WU T Y , et al . SparkDQ:efficient generic big data quality management on distributed data-parallel computation [J ] . Journal of Parallel and Distributed Computing , 2021 , 156 : 132 - 147 .
DEAN J , GHEMAWAT S . MapReduce [J ] . Communications of the ACM , 2008 , 51 ( 1 ): 107 - 113 .
SUN X D , HE Y L , WU D M , et al . Survey of distributed computing frameworks for supporting big data analysis [J ] . Big Data Mining and Analytics , 2023 , 6 ( 2 ): 154 - 169 .
EFRON B , TIBSHIRANI R J . An introduction to the bootstrap [M ] . Boca Raton : CRC press , 1994 .
KADKHODAEI H , EFTEKHARI MOGHADAM A M , DEHGHAN M . Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm [J ] . Expert Systems with Applications , 2021 ,183:115369.
0
浏览量
210
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621