1. 华中科技大学计算机科学与技术学院,湖北 武汉 430074
2. 华中科技大学大数据技术与系统国家地方联合工程研究中心,服务计算技术与系统教育部重点实验室,湖北武汉 430074
[ "黄为新(2000- ),男,华中科技大学计算机科学与技术学院硕士生,主要研究方向为深度学习系统的优化。" ]
[ "胡伟方(1995- ),男,华中科技大学计算机科学与技术学院博士生,主要研究方向为分布式深度学习系统平台。" ]
[ "曹雪娇(1998- ),女,华中科技大学计算机科学与技术学院硕士生,主要研究方向为边端协同下的模型选择。" ]
[ "石宣化(1978- ),男,博士,华中科技大学计算机科学与技术学院教授,主要研究方向为并行与分布式计算、云计算与大数据处理等。" ]
网络首发:2024-07,
纸质出版:2024-07-15
移动端阅览
黄为新, 胡伟方, 曹雪娇, 等. 基于异构硬件的LSTM训练系统[J]. 大数据, 2024,10(4):172-188.
Weixin HUANG, Weifang HU, Xuejiao CAO, et al. LSTM training system based on heterogeneous hardware[J]. Big data research, 2024, 10(4): 172-188.
黄为新, 胡伟方, 曹雪娇, 等. 基于异构硬件的LSTM训练系统[J]. 大数据, 2024,10(4):172-188. DOI: 10.11959/j.issn.2096-0271.2024053.
Weixin HUANG, Weifang HU, Xuejiao CAO, et al. LSTM training system based on heterogeneous hardware[J]. Big data research, 2024, 10(4): 172-188. DOI: 10.11959/j.issn.2096-0271.2024053.
在大数据时代,以LSTM为代表的深度神经网络模型具有处理海量数据的能力,在语言处理、语音识别、时序数据预测等领域表现优异。随着模型复杂度的提高,训练成本大幅提升。现有的LSTM训练系统使用了算子融合、多流等加速手段,但忽略了训练算子内部计算的可并行性,导致计算资源的利用率低,整体耗时长。为此,设计了基于细粒度模型划分和多流并行调度方法的LSTM训练系统TurboLSTM,在英伟达GPU和国产昇腾NPU这两种异构硬件上构建的全新底层训练算子实现了任务对计算资源的合理利用。与已有训练系统相比,在GPU上TurboLSTM的单算子训练时间缩短了23%,模型的整体训练时间缩短了17%,在NPU上TurboLSTM的单算子训练时间缩短了15%,且对计算资源的利用率显著提高。这表明提出的加速方案是高效的,具有良好的泛化能力。
In the era of big data
deep neurals network models represented by LSTM have the ability to process massive data
and have excellent performance in the fields of language processing
speech recognition and time series data prediction.However
with the increase of model complexity
the training cost increases significantly.The existing LSTM training systems use acceleration methods
such as operator fusion and multi-stream
but neglect the parallelism of the internal calculation of a single training operator
which leads a low utilization rate of computing resources and a long traning time.Therefore
this paper designs a training acceleration system called TurboLSTM based on fine-grained model partitioning method and multi-stream parallel scheduling strategy.A new underlying training operator built on NVIDIA GPU and domestic Ascend NPU heterogeneous hardware realizes reasonable utilization of computing resources for tasks.Compared with the existing training systems
TurboLSTM on NVIDIA GPU has about 23% speed improvement of a single operator and about 17% speed improvement of the overall training time of a model
while TurboLSTM on Ascend NPU has about 15% speed improvement of a single operator
and the significant increase in the utilization of computing resources is observed.This shows that the acceleration method is efficient and has good generalization ability.
HOCHREITER S , SCHMIDHUBER J . Long short-term memory [J ] . Neural Computation , 1997 , 9 ( 8 ): 1735 - 1780 .
ABADI M , AGARWAL A , BARHAM P , et al . Tensorflow:large-scale machine learning on heterogeneous distributed systems [C ] // Proceedings of the 12th USENIX Conference on Operating System Design and Implementation . Savannah:USENIX Association , 2016 : 265 - 283 .
PASZKE A , GROSS S , MASSA F , et al . PyTorch:an imperative style,highperformance deep learning library [EB ] . arXiv preprint,2019,arXiv:1912.01703 .
JIA Y Q , SHELHAMER E , DONAHUE J , et al . Caffe:convolutional architecture for fast feature embedding [C ] // Proceedings of the 22nd ACM International Conference on Multimedia . New York:ACM , 2014 : 675 - 678 .
BRAUN S . LSTM benchmarks for deep learning frameworks [EB ] . arXiv preprint,2018,arXiv:1806.01818 .
APPLEYARD J , KOCISKY T , BLUNSOM P , et al . Optimizing performance of recurrent neural networks on GPUs [EB ] . arXiv preprint,2016,arXiv:1604.01946 .
鲁蔚征 , 张峰 , 贺寅烜 , 等 . 华为昇腾神经网络加速器性能评测与优化 [J ] . 计算机学报 , 2022 , 45 ( 8 ): 1618 - 1637 .
LU W Z , ZHANG F , HE Y X , et al . Evaluation and optimization for Huawei ascend neural network accelerator [J ] . Chinese Journal of Computers , 2022 , 45 ( 8 ): 1618 - 1637 .
梁晓峣 . 昇腾AI处理器架构与编程:深入理解CANN技术原理及应用 [M ] . 北京 : 清华大学出版社 , 2019 .
LIANG X Y . Ascend AI processor architecture and programming:principles and application of CANN [M ] . Beijing : Tsinghua University Press , 2019 .
于璠 . 新一代深度学习框架研究 [J ] . 大数据 , 2020 , 6 ( 4 ): 69 - 80 .
YU F . Research on the next-generation deep learning framework [J ] . Big Data Research , 2020 , 6 ( 4 ): 69 - 80 .
TALLADA M G . Coarse grain parallelization of deep neural networks [C ] // Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . New York:ACM , 2016 : 1 - 12 .
HOPFIELD J J . Neural networks and physical systems with emergent collective computational abilities [J ] . Proceedings of the National Academy of Sciences of the United States of America , 1982 , 79 ( 8 ): 2554 - 2558 .
WU Y H , SCHUSTER M , CHEN Z F , et al . Google’s neural machine translation system:bridging the gap between human and machine translation [EB ] . arXiv preprint,2016,arXiv:1609.08144 .
KLEIN G , KIM Y , DENG Y , et al . OpenNMT:neural machine translation toolkit [C ] // Proceedings of the 13th Conference of the Association for Machine Translation in the Americas . Boston:Association for Machine Translation in the Americas , 2018 : 177 - 184 .
AMODEI D , ANUBHAI R , BATTENBERG E , et al . Deep speech 2:end-to-end speech recognition in English and mandarin [EB ] . arXiv preprint,2015,arXiv:1512.02595 .
WANG Y , SKERRY-RYAN R J , STANTON D , et al . Tacotron:towards end-to-end speech synthesis [EB ] . arXiv preprint,2017,arXiv:1703.10135 .
GÜLMEZ B . Stock price prediction with optimized deep LSTM network with artificial rabbits optimization algorithm [J ] . Expert Systems with Applications , 2023 ,227:120346.
WANG H , YANG J C , CHEN G Z , et al . Machine learning applications on air temperature prediction in the urban canopy layer:a critical review of 20112022 [J ] . Urban Climate , 2023 ,49:101499.
LI B X , ZHOU E J , HUANG B , et al . Large scale recurrent neural network on GPU [C ] // Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN) . Piscataway:IEEE Press , 2014 : 4062 - 4069 .
HWANG K , SUNG W . Single stream parallelization of generalized LSTM-like RNNs on a GPU [C ] // Proceedings of the 2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . Piscataway:IEEE Press , 2015 : 1047 - 1051 .
BECK M , PÖPPEL K , SPANRING M , et al . xLSTM:extended long short-term memory [EB ] . arXiv preprint,2024:arXiv:2405.04517 .
SHARMA R K , CASAS M . Wavefront parallelization of recurrent neural networks on multi-core architectures [C ] // Proceedings of the 34th ACM International Conference on Supercomputing . New York:ACM , 2020 : 1 - 12 .
CHEN Q F , WU J , HUANG F H , et al . Multi-layer LSTM parallel optimization based on hardware and software cooperation [C ] // Proceedings of International Conference on Knowledge Science,Engineering and Management . Cham:Springer , 2022 : 681 - 693 .
WANG B C , YANG C Y , ZHU R , et al . Analysis of performance and optimization in MindSpore on ascend NPUs [C ] // Proceedings of the 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS) . Piscataway:IEEE Press , 2023 : 1701 - 1708 .
JIN H , WU W C , SHI X H , et al . TurboDL:improving the CNN training on GPU with fine-grained multi-streaming scheduling [J ] . IEEE Transactions on Computers , 2021 , 70 ( 4 ): 552 - 565 .
FATICA M . CUDA toolkit and libraries [C ] // Proceedings of the 2008 IEEE Hot Chips 20 Symposium (HCS) . Piscataway:IEEE Press , 2008 : 1 - 22 .
MAAS A L , DALY R E , PHAM P T , et al . Learning word vectors for sentiment analysis [C ] // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies . Portland:The Association for Computational Linguistics , 2011 : 142 - 150 .
0
浏览量
73
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621