[ "张凯(1981- ),男,阿里巴巴科技(北京)有限公司高级技术专家,主要研究方向为云计算、容器、深度学习、分布式系统" ]
[ "车漾(1982- ),男,阿里巴巴科技(北京)有限公司高级技术专家,主要研究方向为云计算、容器、分布式缓存、机器学习系统" ]
网络首发:2021-09,
纸质出版:2021-09-15
移动端阅览
张凯, 车漾. 基于分布式缓存加速容器化深度学习的优化方法[J]. 大数据, 2021,7(5):2021054.
Kai ZHANG, Yang CHE. Method of accelerating deep learning with optimized distributed cache in containers[J]. Big data research, 2021, 7(5): 2021054.
张凯, 车漾. 基于分布式缓存加速容器化深度学习的优化方法[J]. 大数据, 2021,7(5):2021054. DOI: 10.11959/j.issn.2096-0271.2021054.
Kai ZHANG, Yang CHE. Method of accelerating deep learning with optimized distributed cache in containers[J]. Big data research, 2021, 7(5): 2021054. DOI: 10.11959/j.issn.2096-0271.2021054.
使用GPU运行容器化深度学习模型训练任务,性能往往受限于数据加载和预处理效率。很多GPU计算资源浪费在等待从远程存储服务读取数据的过程中。首先介绍了基于容器和分布式缓存技术加速深度学习训练的方法,以及使用Alluxio和Kubernetes实现的系统架构和初步优化手段;然后阐述了TDCS及其训练任务与缓存数据互感知的协同调度策略;接着在Kubernetes容器集群中实现了TDCS,增强了分布式缓存加速大规模深度学习训练的可扩展性;最后用ResNet50图像分类模型训练任务进行性能验证。实验结果表明,相较于直接从远程存储服务中读取数据,TDCS可对运行在128块NVIDIA V100 GPU设备上的分布式训练任务实现2~3倍加速。
When using GPU to train deep learning models with large-scale dataset
the data loading and preprocessing stages often decrease overall performance notably.Lots of GPU computing resources are wasted on waiting for loading data from remote storage.Firstly
the methods of accelerating deep learning training with container and distributed cache were introduced.The architecture and initial optimization of such training system
which was implemented with Alluxio and Kubernetes
were introduced as well.Secondly
the task and data co-located scheduling (TDCS) and the colocated scheduling policy were elaborated.Thirdly
TDCS was implemented in Kubernetes cluster
which made the acceleration result more extensible.Finally
the result of training ResNet50 image classification model on 128 NVIDIAV100 GPU devices demonstrates that the proposed methods can bring 2 to 3 times speed up comparing with load data from remote storage directly.
PINTO C , GKOUFAS Y , REALE A , et al . Hoard:a distributed data caching system to accelerate deep learning training on the cloud [J ] . arXiv preprint,2018,arXiv:1812.00669 .
KUMAR A V , SIVATHANU M . Quiver:an informed storage cache for deep learning [C ] // Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20) . Berkeley:USENIX Association , 2020 : 283 - 296 .
WANG L P , YE S G , YANG B C , et al . DIESEL:a dataset-based distributed storage and caching system for largescale deep learning training [C ] // Proceedings of the 49th International Conference on Parallel Processing . New York:ACM Press , 2020 : 1 - 11 .
ABADI M , AGARWAL A , BARHAM P , et al . TensorFlow:large-scale machine learning on heterogeneous distributed systems [J ] . arXiv preprint,2016,arXiv:1603.04467 .
ABADI M , BARHAM P , CHEN J M , et al . TensorFlow:a system for large-scale machine learning [J ] . arXiv preprint,2016,arXiv:1605.08695 .
PATARASUK P , YUAN X . Bandwidth optimal all-reduce algorithms for clusters of workstations [J ] . Journal of Parallel and Distributed Computing , 2009 , 69 ( 2 ): 117 - 124 .
LI Z W , YAN Y L , MO J T , et al . Performance optimization of in-memory file system in distributed storage system [C ] // Proceedings of the 2017 International Conference on Networking,Architecture,and Storage . Piscataway:IEEE Press , 2017 .
LI H Y , GHODSI A , ZAHARIA M , et al . Tachyon:reliable memory speed storage for cluster computing frameworks [C ] // Proceedings of the ACM Symposium on Cloud Computing . New York:ACM Press , 2014 : 1 - 15 .
CHANG X , ZHA L . The performance analysis of cache architecture based on Alluxio over virtualized infrastructure [C ] // Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops . Piscataway:IEEE Press , 2018 : 515 - 519 .
HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2016 .
DENG J , DONG W , SOCHER R , et al . ImageNet:a large-scale hierarchical image database [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2009 .
SERGEEV A , BALSO M D . Horovod:fast and easy distributed deep learning in TensorFlow [J ] . arXiv preprint,2018,arXiv:1802.05799 .
LIU Z X , BAI Z H , LIU Z M , et al . DistCache:provable load balancing for large-scale storage systems with distributed caching [C ] // Proceedings of the 17th USENIX Conference on File and Storage Technologies . Berkeley:USENIX Association , 2019 : 143 - 157 .
DONG W J , WEN D X , ZHANG Z . Optimization of cache strategy based on Alluxio remote scenario [J ] . Application Research of Computers , 2018 , 35 ( 10 ): 3025 - 3028 .
杨青霖 , 吴桂勇 , 张广艳 . 分布式存储系统中的数据高效缓存方法 [J ] . 大数据 , 2021 , 7 ( 2 ): 147 - 157 .
YANG Q L , WU G Y , ZHANG G Y . An approach to buffering data efficiently in distributed storage systems [J ] . Big Data Research , 2021 , 7 ( 2 ): 147 - 157 .
0
浏览量
797
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621