1. 中国移动信息技术中心,北京 100033
2. 北京航空航天大学,北京 100191
3. 中移信息技术有限公司,广东 深圳 518048
[ "王冀彬(1980- ),男,中国移动信息技术中心高级工程师、大数据事业群总经理,主要研究方向为大数据、数据分析。" ]
[ "杨海龙(1985- )男,博士,北京航空航天大学教授,主要研究方向为高性能计算、分布式和并行计算、计算机系统结构、深度学习编译优化技术。" ]
[ "冯凯(1977- ),男,中移信息技术有限公司高级工程师、项目总监,主要研究方向为大数据运维、数据分析。" ]
[ "孙欣(2000- ),女,北京航空航天大学硕士生,主要研究方向为计算机系统结构、性能分析工具。" ]
[ "张敏达(1998- ),女,中移信息技术有限公司项目经理,主要研究方向为大数据运维、数据分析。" ]
[ "雷克伦(2000- ),男,北京航空航天大学博士生,主要研究方向为高性能计算、性能分析工具和编译优化。" ]
[ "肖智文(1993- ),男,博士,中国移动信息技术中心中级工程师、项目经理,主要研究方向为大数据、机器学习、云计算。" ]
[ "张逸飞(1996- ),男,博士,中国移动信息技术中心中级工程师、项目经理,主要研究方向为大数据、机器学习、云计算。" ]
[ "吴佳熙(1993- ),男,博士,中国移动信息技术中心中级工程师、项目经理,主要研究方向为大数据、云计算。" ]
网络首发:2024-07,
纸质出版:2024-07-15
移动端阅览
王冀彬, 杨海龙, 冯凯, 等. 面向大数据场景的系统性能优化实践[J]. 大数据, 2024,10(4):21-33.
Jibin WANG, Hailong YANG, Kai FENG, et al. System performance optimization practice for big data scenarios[J]. Big data research, 2024, 10(4): 21-33.
王冀彬, 杨海龙, 冯凯, 等. 面向大数据场景的系统性能优化实践[J]. 大数据, 2024,10(4):21-33. DOI: 10.11959/j.issn.2096-0271.2024049.
Jibin WANG, Hailong YANG, Kai FENG, et al. System performance optimization practice for big data scenarios[J]. Big data research, 2024, 10(4): 21-33. DOI: 10.11959/j.issn.2096-0271.2024049.
在现有大规模分布式环境中,大数据应用的性能与计算效率仍有较大的提升空间。然而,在大规模环境中进行性能分析与优化需要大量领域专家。针对大数据应用中的性能优化问题,提出了一个通用的低效查询语句检测与优化流程,总结了4类显著影响大数据应用性能的低效行为,并针对每一类低效行为,提出了具体的优化策略。最后,通过实验评估验证了提出的优化方案在实际大规模集群中的有效性。
In the existing large-scale distributed environments
there is still much room for improvement in the performance and computational efficiency of big data applications.However
performance analysis and optimization in large-scale environments requires a large number of human resources from domain experts.This paper proposes a general lowperformance query statement detection and optimization process for performance optimization in big data applications
summarizes four types of low-performance behaviors that significantly affect the performance of big data applications
and proposes specific optimization strategies for each type of low-performance behavior.Finally
through experimental evaluation
the effectiveness of the optimization scheme in actual large-scale cluster is verified.
DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters [J ] . Communications of the ACM , 2008 , 53 : 107 - 113 .
SHVACHKO K , KUANG H R , RADIA S , et al . The hadoop distributed file system [C ] // Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) . Piscataway:IEEE Press , 2010 : 1 - 10 .
VAVILAPALLI V K , MURTHY A C , DOUGLAS C , et al . Apache Hadoop YARN:yet another resource negotiator [C ] // Proceedings of the 4th annual Symposium on Cloud Computing . New York:ACM , 2013 : 1 - 16 .
LIN C , ZHUANG J Q , FENG J D , et al . Adaptive code learning for spark configuration tuning [C ] // Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE) . Piscataway:IEEE Press , 2022 : 1995 - 2007 .
LU J H , CHEN Y X , HERODOTOU H , et al . Speedup your analytics:automatic parameter tuning for databases and big data systems [J ] // Proceedings of the VLDB Endowment , 2019 , 12 ( 12 ): 1970 - 1973 .
WU D L , GOKHALE A . A self-tuning system based on application profiling and performance analysis for optimizing Hadoop MapReduce cluster configuration [C ] // Proceedings of the 20th Annual International Conference on High Performance Computing . Piscataway:IEEE Press , 2013 : 89 - 98 .
ZHU N , RAO L , LIU X , et al . Taming power peaks in mapreduce clusters [J ] . ACM SIGCOMM Computer Communication Review , 2011 , 41 ( 4 ): 416 - 417 .
WU W T , LIN W W , HSU C H , et al . Energy-efficient hadoop for big data analytics and computing:a systematic review and research insights [J ] . Future Generation Computer Systems , 2018 , 86 : 1351 - 1367 .
BABU S . Towards automatic optimization of MapReduce programs [C ] // Proceedings of the 1st ACM symposium on Cloud computing . New York:ACM , 2010 : 137 - 142 .
JIANG D W , OOI B C , SHI L , et al . The performance of MapReduce [J ] . Proceedings of the VLDB Endowment , 2010 , 3 ( 1/2 ): 472 - 483 .
ZAHARIA M , CHOWDHURY M , FRANKLIN M J , et al . Spark:cluster computing with working sets [C ] // Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing . Berkeley:USENIX Association , 2010 .
YIGITBASI N , WILLKE T L , LIAO G D , et al . Towards machine learningbased auto-tuning of MapReduce [C ] // Proceedings of the 2013 IEEE 21st International Symposium on Modelling,Analysis and Simulation of Computer and Telecommunication Systems . Piscataway:IEEE Press , 2013 : 11 - 20 .
WANG G L , XU J G , HE B . A novel method for tuning configuration parameters of spark based on machine learning [C ] // Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications;IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) . Piscataway:IEEE Press , 2016 : 586 - 593 .
DE OLIVEIRA D , PORTO F , BOERES C , et al . Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning [J ] . Concurrency and Computation:Practice and Experience , 2021 , 33 ( 5 ): e5972 .
ÖZTÜRK M M . Tuning parameters of Apache Spark with Gauss–Paretobased multi-objective optimization [J ] . Knowledge and Information Systems , 2024 , 66 ( 2 ): 1065 - 1090 .
CHENG G L , YING S , WANG B M , et al . Efficient performance prediction for apache spark [J ] . Journal of Parallel and Distributed Computing , 2021 , 149 : 40 - 51 .
HUANG X , ZHANG H , ZHAI X M . A novel reinforcement learning approach for spark configuration parameter optimization [J ] . Sensors , 2022 , 22 ( 15 ): 5930 .
李耘书 , 滕飞 , 李天瑞 . 基于微操作的Hadoop参数自动调优方法 [J ] . 计算机应用 , 2019 , 39 ( 6 ): 1589 - 1594 .
LI Y S , TENG F , LI T R . Microoperationbased parameter auto-optimization method of Hadoop [J ] . Journal of Computer Applications , 2019 , 39 ( 6 ): 1589 - 1594 .
朱锐 , 王宏志 , 崔双双 , 等 . 面向元宇宙的云边端协同大数据管理 [J ] . 大数据 , 2023 , 9 ( 1 ): 63 - 77 .
ZHU R , WANG H Z , CUI S S , et al . Cloud-edge-end collaborative big data management for metaverse [J ] . Big Data Research , 2023 , 9 ( 1 ): 63 - 77 .
LIANG J C , LIN W W , XU Y G , et al . Energy-aware parameter tuning for mixed workloads in cloud server [J ] . Cluster Computing , 2023 : 1 - 17 .
黄志 , 苏传程 , 苏晓红 . 大数据环境下Spark性能优化分析研究与应用 [J ] . 气象科技 , 2022 , 50 ( 1 ): 51 - 58 .
HUANG Z , SU C C , SU X H . Research and application of spark performance optimization analysis in big data environment [J ] . Meteorological Science and Technology , 2022 , 50 ( 1 ): 51 - 58 .
吴岳 . 一种优化的Hadoop数据放置策略 [J ] . 软件工程 , 2023 , 26 ( 7 ): 44 - 47 .
WU Y . An optimized hadoop data placement strategy [J ] . Software Engineering , 2023 , 26 ( 7 ): 44 - 47 .
郑灵逸 , 李擎 . 一种基于HiveSQL的增加任务并行度与建立中间表组合的优化查询方法 [J ] . 现代计算机 , 2021 , 27 ( 36 ): 55 - 59 .
ZHENG L Y , LI Q . An optimization query method based on HiveSQL to increase task parallelism and build intermediate table combination [J ] . Modern Computer , 2021 , 27 ( 36 ): 55 - 59 .
0
浏览量
296
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621