一种新的以服务质量为导向的Spark作业调度器

何玉林; 莫沛恒; Philippe Fournier-Viger; 黄哲学

doi:10.11959/j.issn.2096-0271.2025048

您当前的位置：

首页 >

文章列表页 >

一种新的以服务质量为导向的Spark作业调度器

研究 | 更新时间：2025-06-27

- 一种新的以服务质量为导向的Spark作业调度器
- A novel quality of service-oriented Spark job scheduler
- 大数据 2025年11卷第4期页码：154-177
- 作者机构：
  
  1.人工智能与数字经济广东省实验室（深圳），广东深圳 518107
  2.深圳大学计算机与软件学院，广东深圳 518060
- 作者简介：
  
  [ "何玉林（1982- ），男，博士，人工智能与数字经济广东省实验室（深圳）研究员，主要研究方向为大数据系统计算技术、面向大数据的数据挖掘和机器学习算法设计及应用。" ]
  [ "莫沛恒（2000- ），男，深圳大学计算机与软件学院硕士生，主要研究方向为大数据分布式计算、Spark性能优化技术、高性能数据挖掘与机器学习算法设计。" ]
  [ "Philippe Fournier-Viger（1980- ），男，博士，深圳大学大数据技术与应用研究所特聘教授，主要研究方向为数据挖掘、人工智能、知识表示和推理、认知模型构建。" ]
  [ "黄哲学（1959- ），男，博士，深圳大学大数据技术与应用研究所特聘教授、所长，主要研究方向为数据挖掘、机器学习、大数据处理与分析、大数据系统计算技术。" ]
- 基金信息：
  
  广东省自然科学基金项目(2023A1515011667);广东省基础与应用基础研究基金项目(2023B1515120020);深圳市科技重大专项项目(KJZD20230923114809020)
- DOI：10.11959/j.issn.2096-0271.2025048
  中图分类号： TP391.9
- 收稿日期：2024-07-02，
  
  纸质出版日期：2025-07-15
- 稿件说明：
移动端阅览
何玉林,莫沛恒,Philippe Fournier-Viger等.一种新的以服务质量为导向的Spark作业调度器[J].大数据,2025,11(04):154-177.

HE Yulin,MO Peiheng,Philippe Fournier-Viger,et al.A novel quality of service-oriented Spark job scheduler[J].BIG DATA RESEARCH,2025,11(04):154-177.
何玉林,莫沛恒,Philippe Fournier-Viger等.一种新的以服务质量为导向的Spark作业调度器[J].大数据,2025,11(04):154-177. DOI： 10.11959/j.issn.2096-0271.2025048.

HE Yulin,MO Peiheng,Philippe Fournier-Viger,et al.A novel quality of service-oriented Spark job scheduler[J].BIG DATA RESEARCH,2025,11(04):154-177. DOI： 10.11959/j.issn.2096-0271.2025048.

摘要

Spark大数据计算框架被广泛用于处理和分析爆发式增长的大数据。云端能够提供按需和按量付费的计算资源来满足用户的请求。当前，许多组织将大数据计算集群部署在云端上开展大数据计算任务，其需要高效地处理Spark作业调度问题以满足各种用户对QoS的要求，如降低使用资源的花费和缩短作业的响应时间。而现有研究大多未能统一考虑多用户要求，忽略了Spark集群环境和工作负载的特性，导致资源浪费和用户对QoS的要求得不到满足等。为此，通过对部署在云端的Spark集群作业调度问题进行建模，设计了一种新的基于DRL技术的Spark作业调度器来满足多个QoS要求。搭建了DRL集群仿真环境，用于对作业调度器的核心DRL Agent进行训练。在调度环境中实现了基于绝对深度Q值网络、基于近端策略优化与广义优势估计联合的训练方法，使DRL Agent可以自适应地学习不同类型作业，以及动态、突发的集群环境特征，实现对Spark作业的合理调度，以降低集群总使用成本、缩短作业的平均响应时间。在基准套件上对DRL Agent测试的结果表明，与其他现有的Spark作业调度解决方案相比，本文设计的DRL Agent作业调度器在集群总使用成本、作业平均响应时间以及QoS达成率上具有显著的优越性，证明了其有效性。

Abstract

Spark is a widely-used big data computing framework to process and analyze the explosive-growing data. The cloud can provide on-demand and pay-as-you-go computing resources to satisfy the users’ requirements. Currently

many organizations have deployed big data computing clusters on the cloud. These clusters are required to efficiently handle the Spark job scheduling problem so as to meet the QoS requirements of various users

such as reducing the cost of resource usage and shortening the job response time. However

most of the existing methods don’t consider the requirements of multiple users together

and fail to take into account the characteristics of Spark cluster environments

and workloads. To address the above-mentioned challenge

a new Spark job scheduler based on DRL technology was designed to adapt to multiple QoS requirements by modeling the job scheduling problem of Spark clusters deployed in the cloud. A DRL cluster simulation environment was built to train the core DRL Agent of job scheduler. In the scheduling environment

training methods based on absolute deep

-network and a combination of proximal policy optimization and generalized advantage estimation were implemented

enabling DRL agent to adaptively learn the characteristics of different types of jobs as well as the characteristics of dynamic and bursty cluster environments. This enables rational scheduling of Spark jobs to reduce the total usage cost of the cluster and shorten the average response time of jobs. Testing results of DRL Agent on the benchmark suite show that compared wit

h other existing Spark job scheduling solutions

the newly designed DRL Agent job scheduler in this paper has significant advantages in terms of total cluster usage cost

average job response time and QoS achievement rate

which confirming the feasibility and effectiveness of the job scheduler designed in this paper.

关键词

Keywords

references

VAVILAPALLI V K , MURTHY A C , DOUGLAS C , et al . Apache hadoop YARN: yet another resource negotiator [C ] // Proceedings of the 4th annual Symposium on Cloud Computing . New York : ACM , 2013 : 1 - 16 .

ZAHARIA M , XIN R S , WENDELL P , et al . Apache spark [J ] . Communications of the ACM , 2016 , 59 ( 11 ): 56 - 65 .

CHENG L , VAN DONGEN B F , VAN DER AALST W M P . Scalable discovery of hybrid process models in a cloud computing environment [J ] . IEEE Transactions on Services Computing , 2020 , 13 ( 2 ): 368 - 380 .

LIU J W , SHEN H Y , CHI H M , et al . A low-cost multi-failure resilient replication scheme for high-data availability in cloud storage [J ] . IEEE/ACM Transactions on Networking , 2021 , 29 ( 4 ): 1436 - 1451 .

ZAHARIA M , CHOWDHURY M , FRANKLIN M J , et al . Spark: Cluster computing with working sets [C ] // Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing[S.l.:s.n.] , 2010 .

SHI W H , LI H J , ZENG H . DRL-based and bsld-aware job scheduling for apache spark cluster in hybrid cloud computing environments [J ] . Journal of Grid Computing , 2022 , 20 ( 4 ): 44 .

TANG S J , HE B S , YU C , et al . A survey on spark ecosystem: big data processing infrastructure, machine learning, and applications [J ] . IEEE Transactions on Knowledge and Data Engineering , 2022 , 34 ( 1 ): 71 - 91 .

ZAHARIA M , CHOWDHURYET M , TATHAGATA D , et al . Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing [C ] // Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation . Berkeley : USENIX , 2012 : 15 - 28 .

SAEED M M , AL AGHBARI Z , ALSHARIDAH M . Big data clustering techniques based on spark: a literature review [J ] . PeerJ Computer Science , 2020 , 6 : e321 .

LU S X , ZHAO M M , LI C L , et al . Time-aware data partition optimization and heterogeneous task scheduling strategies in spark clusters [J ] . The Computer Journal , 2024 , 67 ( 2 ): 762 - 776 .

ZHANG X D , LI X P , DU H A , et al . Task scheduling for spark applications with data affinity on heterogeneous clusters [J ] . IEEE Internet of Things Journal , 2022 , 9 ( 21 ): 21792 - 21801 .

TANG Z , ZENG A L , ZHANG X D , et al . Dynamic memory-aware scheduling in spark computing environment [J ] . Journal of Parallel and Distributed Computing , 2020 , 141 : 10 - 22 .

SHABESTARI F , RAHMANI A M , NAVIMIPOUR N J , et al . A YARN-based energy-aware scheduling method for big data applications under deadline constraints [J ] . Journal of Grid Computing , 2022 , 20 ( 4 ): 38 .

ISLAM M T , SRIRAMA S N , KARUNASEKERA S , et al . Cost-efficient dynamic scheduling of big data applications in apache spark on cloud [J ] . Journal of Systems and Software , 2020 , 162 : 110515 .

张晨浩 , 肖利民 , 秦广军 , 等 . 面向大数据处理应用的广域存算协同调度系统 [J ] . 大数据 , 2021 , 7 ( 5 ): 82 - 97 .

ZHANG C H , XIAO L M , QIN G J , et al . A wide-area collaborative scheduling system oriented to big data processing applications [J ] . Big Data Research , 2021 , 7 ( 5 ): 82 - 97 .

DUAN J J , SHI D , DIAO R S , et al . Deep-reinforcement-learning-based autonomous voltage control for power grid operations [J ] . IEEE Transactions on Power Systems , 2020 , 35 ( 1 ): 814 - 817 .

JIANG L , HUANG H Y , DING Z H . Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge [J ] . IEEE/CAA Journal of Automatica Sinica , 2020 , 7 ( 4 ): 1179 - 1189 .

LIN J P , CUI D L , PENG Z P , et al . A two-stage framework for the multi-user multi-data center job scheduling and resource allocation [J ] . IEEE Access , 2020 , 8 : 197863 - 197874 .

LI H J , LU L , SHI W H , et al . Energy-aware scheduling for spark job based on deep reinforcement learning in cloud [J ] . Computing , 2023 , 105 ( 8 ): 1717 - 1743 .

WANG X G , CAO J , BUYYA R . Adaptive cloud bundle provisioning and multi-workflow scheduling via coalition reinforcement learning [J ] . IEEE Transactions on Computers , 2023 , 72 ( 4 ): 1041 - 1054 .

MARC G B , WILL D , RÉMI M . A distributional perspective on reinforcement learning [C ] // Proceedings of the 34th International Conference on Machine Learning , NSW : JMLR , 2017 : 449 - 458 .

SCHULMAN J , WOLSKI F , DHARIWAL P , et al . Proximal policy optimization algorithms [EB ] . arXiv preprint , 2017 , arXiv: 1707.06347 .

DENNINNART C , GENTRY J , MOKH-TARI A , et al . Efficient task pruning mechanism to improve robustness of heterogeneous computing systems [J ] . Journal of Parallel and Distributed Computing , 2020 , 142 : 46 - 61 .

HUSSAIN S M , BEGH G R . Hybrid heuristic algorithm for cost-efficient QoS aware task scheduling in fog–cloud environment [J ] . Journal of Computational Science , 2022 , 64 : 101828 .

ZHU J Y , YANG R Y , SUN X Y , et al . QoS-aware co-scheduling for distri-buted long-running applications on shared clusters [J ] . IEEE Transactions on Parallel and Distributed Systems , 2022 , 33 ( 12 ): 4818 - 4834 .

SHARMA N , SONAL , GARG P . Ant colony based optimization model for QoS-based task scheduling in cloud computing environment [J ] . Measurement: Sensors , 2022 , 24 : 100531 .

TAGHINEZHAD-NIAR A , PASHAZAD-EH S , TAHERI J . QoS-aware online scheduling of multiple workflows under task execution time uncertainty in clouds [J ] . Cluster Computing , 2022 , 25 ( 6 ): 3767 - 3784 .

FU Z M , TANG Z , YANG L , et al . An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications [J ] . IEEE Transactions on Parallel and Distributed Systems , 2020 , 31 ( 10 ): 2406 - 2420 .

LI H J , WEI Y J , XIONG Y , et al . A frequency-aware and energy-saving strategy based on DVFS for Spark [J ] . The Journal of Supercomputing , 2021 , 77 ( 10 ): 11575 - 11596 .

刘汪根 , 郑淮城 , 荣国平 . 云环境下大规模分布式计算数据感知的调度系统 [J ] . 大数据 , 2020 , 6 ( 1 ): 81 - 98 .

LIU W G , ZHENG H C , RONG G P . A scheduler system for large-scale distributed data computing in cloud [J ] . Big Data Research , 2020 , 6 ( 1 ): 81 - 98 .

LI C L , CAI Q Q , LUO Y L . Dynamic data replacement and adaptive scheduling policies in spark [J ] . Cluster Computing , 2022 , 25 ( 2 ): 1421 - 1439 .

LIU L B , XU H . Elasecutor: elastic executor scheduling in data analytics systems [J ] . IEEE/ACM Transactions on Networking , 2021 , 29 ( 2 ): 681 - 694 .

ISLAM M T , WU H M , KARUN-ASEKERA S , et al . SLA-based scheduling of spark jobs in hybrid cloud computing environments [J ] . IEEE Transactions on Computers , 2022 , 71 ( 5 ): 1117 - 1132 .

MOHAMMAD HASANI ZADE B , MANSOURI N . Improved red fox optimizer with fuzzy theory and game theory for task scheduling in cloud environment [J ] . Journal of Computational Science , 2022 , 63 : 101805 .

ZHANG Z X , ZHAO M K , WANG H , et al . An efficient interval many-objective evolutionary algorithm for cloud task scheduling problem under uncertainty [J ] . Information Sciences , 2022 , 583 : 56 - 72 .

ISLAM M T , KARUNASEKERA S , BUYYA R . Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments [J ] . IEEE Tran-sactions on Parallel and Distributed Systems , 2022 , 33 ( 7 ): 1695 - 1710 .

GUO W X , TIAN W H , YE Y F , et al . Cloud resource scheduling with deep reinforcement learning and imitation learning [J ] . IEEE Internet of Things Journal , 2021 , 8 ( 5 ): 3576 - 3586 .

MAO H Z , SCHWARZKOPF M , VENKATAKRISHNAN S B , et al . Learning scheduling algorithms for data processing clusters [C ] // Proceedings of the ACM Special Interest Group on Data Communication . New York : ACM , 2019 : 270 - 288 .

ZHAO Z H , SHI X Y , SHANG M S . Performance and cost-aware task scheduling via deep reinforcement learning in cloud environment [M ] // Service-Oriented Computing . Cham : Springer Nature Switzerland , 2022 : 600 - 615 .

MOHAMMAD HASANI ZADE B , MAN-SOURI N , JAVIDI M M . A two-stage scheduler based on New Caledonian Crow Learning Algorithm and reinforcement learning strategy for cloud environment [J ] . Journal of Network and Computer Applications , 2022 , 202 : 103385 .

PRADHAN A , BISOY S K , KAUTISH S , et al . Intelligent decision-making of load balancing using deep reinforcement learning and parallel PSO in cloud environment [J ] . IEEE Access , 2022 , 10 : 76939 - 76952 .

GAO W L , ZHAN J F , WANG L , et al . BigDataBench: a scalable and unified big data and AI benchmark suite [EB ] . arXiv preprint , 2018 , arXiv: 1802. 08254 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

面向云边端协同的数据库预聚合方法研究

知识增强策略引导的交互式强化推荐系统

从系统角度审视大图计算

从系统角度审视大数据计算