基于分布式缓存加速容器化深度学习的优化方法

张凯; 车漾

doi:10.11959/j.issn.2096-0271.2021054

您当前的位置：

首页 >

文章列表页 >

基于分布式缓存加速容器化深度学习的优化方法

研究 | 更新时间：2024-06-03

- 基于分布式缓存加速容器化深度学习的优化方法
- Method of accelerating deep learning with optimized distributed cache in containers
- 大数据 2021年7卷第5期页码：2021054
- 作者机构：
- 作者简介：
  
  [ "张凯（1981- ），男，阿里巴巴科技（北京）有限公司高级技术专家，主要研究方向为云计算、容器、深度学习、分布式系统" ]
  [ "车漾（1982- ），男，阿里巴巴科技（北京）有限公司高级技术专家，主要研究方向为云计算、容器、分布式缓存、机器学习系统" ]
- 基金信息：
- DOI：10.11959/j.issn.2096-0271.2021054
  中图分类号： TP311
- 网络首发：2021-09，
  
  纸质出版：2021-09-15
- 稿件说明：
移动端阅览
张凯, 车漾. 基于分布式缓存加速容器化深度学习的优化方法[J]. 大数据, 2021,7(5):2021054.

Kai ZHANG, Yang CHE. Method of accelerating deep learning with optimized distributed cache in containers[J]. Big data research, 2021, 7(5): 2021054.
张凯, 车漾. 基于分布式缓存加速容器化深度学习的优化方法[J]. 大数据, 2021,7(5):2021054. DOI： 10.11959/j.issn.2096-0271.2021054.

Kai ZHANG, Yang CHE. Method of accelerating deep learning with optimized distributed cache in containers[J]. Big data research, 2021, 7(5): 2021054. DOI： 10.11959/j.issn.2096-0271.2021054.

摘要

使用GPU运行容器化深度学习模型训练任务，性能往往受限于数据加载和预处理效率。很多GPU计算资源浪费在等待从远程存储服务读取数据的过程中。首先介绍了基于容器和分布式缓存技术加速深度学习训练的方法，以及使用Alluxio和Kubernetes实现的系统架构和初步优化手段；然后阐述了TDCS及其训练任务与缓存数据互感知的协同调度策略；接着在Kubernetes容器集群中实现了TDCS，增强了分布式缓存加速大规模深度学习训练的可扩展性；最后用ResNet50图像分类模型训练任务进行性能验证。实验结果表明，相较于直接从远程存储服务中读取数据，TDCS可对运行在128块NVIDIA V100 GPU设备上的分布式训练任务实现2~3倍加速。

Abstract

When using GPU to train deep learning models with large-scale dataset

the data loading and preprocessing stages often decrease overall performance notably.Lots of GPU computing resources are wasted on waiting for loading data from remote storage.Firstly

the methods of accelerating deep learning training with container and distributed cache were introduced.The architecture and initial optimization of such training system

which was implemented with Alluxio and Kubernetes

were introduced as well.Secondly

the task and data co-located scheduling (TDCS) and the colocated scheduling policy were elaborated.Thirdly

TDCS was implemented in Kubernetes cluster

which made the acceleration result more extensible.Finally

the result of training ResNet50 image classification model on 128 NVIDIAV100 GPU devices demonstrates that the proposed methods can bring 2 to 3 times speed up comparing with load data from remote storage directly.

关键词

Keywords

references

PINTO C , GKOUFAS Y , REALE A , et al . Hoard:a distributed data caching system to accelerate deep learning training on the cloud [J ] . arXiv preprint,2018,arXiv:1812.00669 .

KUMAR A V , SIVATHANU M . Quiver:an informed storage cache for deep learning [C ] // Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20) . Berkeley:USENIX Association , 2020 : 283 - 296 .

WANG L P , YE S G , YANG B C , et al . DIESEL:a dataset-based distributed storage and caching system for largescale deep learning training [C ] // Proceedings of the 49th International Conference on Parallel Processing . New York:ACM Press , 2020 : 1 - 11 .

ABADI M , AGARWAL A , BARHAM P , et al . TensorFlow:large-scale machine learning on heterogeneous distributed systems [J ] . arXiv preprint,2016,arXiv:1603.04467 .

ABADI M , BARHAM P , CHEN J M , et al . TensorFlow:a system for large-scale machine learning [J ] . arXiv preprint,2016,arXiv:1605.08695 .

PATARASUK P , YUAN X . Bandwidth optimal all-reduce algorithms for clusters of workstations [J ] . Journal of Parallel and Distributed Computing , 2009 , 69 ( 2 ): 117 - 124 .

LI Z W , YAN Y L , MO J T , et al . Performance optimization of in-memory file system in distributed storage system [C ] // Proceedings of the 2017 International Conference on Networking,Architecture,and Storage . Piscataway:IEEE Press , 2017 .

LI H Y , GHODSI A , ZAHARIA M , et al . Tachyon:reliable memory speed storage for cluster computing frameworks [C ] // Proceedings of the ACM Symposium on Cloud Computing . New York:ACM Press , 2014 : 1 - 15 .

CHANG X , ZHA L . The performance analysis of cache architecture based on Alluxio over virtualized infrastructure [C ] // Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops . Piscataway:IEEE Press , 2018 : 515 - 519 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2016 .

DENG J , DONG W , SOCHER R , et al . ImageNet:a large-scale hierarchical image database [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2009 .

SERGEEV A , BALSO M D . Horovod:fast and easy distributed deep learning in TensorFlow [J ] . arXiv preprint,2018,arXiv:1802.05799 .

LIU Z X , BAI Z H , LIU Z M , et al . DistCache:provable load balancing for large-scale storage systems with distributed caching [C ] // Proceedings of the 17th USENIX Conference on File and Storage Technologies . Berkeley:USENIX Association , 2019 : 143 - 157 .

DONG W J , WEN D X , ZHANG Z . Optimization of cache strategy based on Alluxio remote scenario [J ] . Application Research of Computers , 2018 , 35 ( 10 ): 3025 - 3028 .

杨青霖 , 吴桂勇 , 张广艳 . 分布式存储系统中的数据高效缓存方法 [J ] . 大数据 , 2021 , 7 ( 2 ): 147 - 157 .

YANG Q L , WU G Y , ZHANG G Y . An approach to buffering data efficiently in distributed storage systems [J ] . Big Data Research , 2021 , 7 ( 2 ): 147 - 157 .

浏览量

797

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于改进YOLOv8的高分辨率遥感图像目标检测算法

沙尘图像视觉增强技术综述

情感语音合成综述

基于生成对抗网络的多特征融合去雾技术

面向非平行语料的语音转换技术综述