首页 | 本学科首页   官方微博 | 高级检索  
     

面向分布式机器学习的大消息广播设计
引用本文:辛逸杰,谢彬,李振兴. 面向分布式机器学习的大消息广播设计[J]. 计算机系统应用, 2020, 29(1): 1-13
作者姓名:辛逸杰  谢彬  李振兴
作者单位:华东计算技术研究所, 上海 201808;华东计算技术研究所, 上海 201808;华东计算技术研究所, 上海 201808
摘    要:MPI (Message Passing Interface)专为节点密集型大规模计算集群设计,然而,随着MPI+CUDA (Compute Unified Device Architecture)应用程序以及计算节点拥有GPU的计算机集群的出现,类似于MPI的传统通信库已无法满足.而在机器学习领域,也面临着同样的挑战,如Caff以及CNTK (Microsoft CognitiveToolkit)的深度学习框架,由于训练过程中, GPU会缓存庞大的数据量,而大部分机器学习训练的优化算法具有迭代性特点,导致GPU间的通信数据量大,通信频率高,这些已成为限制深度学习训练性能提升的主要因素之一,虽然推出了像NCCL(Nvidia Collective multi-GPU Communication Library)这种解决深度学习通信问题的集合通信库,但也存在不兼容MPI等问题.因此,设计一种更加高效、符合当前新趋势的通信加速机制便显得尤为重要,为解决上述新形势下的挑战,本文提出了两种新型通信广播机制:(1)一种基于MPIBcast的管道链PC (Pipelined ...

关 键 词:深度学习  NCCL  MPIBcast  管道链通信  拓扑感知  PCIe链路
收稿时间:2019-06-17
修稿时间:2019-07-12

Large Message Broadcast Design for Distributed Machine Learning
XIN Yi-Jie,XIE Bin and LI Zhen-Xing. Large Message Broadcast Design for Distributed Machine Learning[J]. Computer Systems& Applications, 2020, 29(1): 1-13
Authors:XIN Yi-Jie  XIE Bin  LI Zhen-Xing
Affiliation:East China Institute of Computing Technology, Shanghai 201808, China,East China Institute of Computing Technology, Shanghai 201808, China and East China Institute of Computing Technology, Shanghai 201808, China
Abstract:Traditionally, Message Passing Interface (MPI) runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this study, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) a Pipelined Chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) a Topology-Aware PC (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node. To highlight the benefits of proposed designs, we present the performance evaluation on three GPU clusters with diverse characteristics: a dense multi-GPU system RX1, with a single K80 GPU card per node RX2, with a single P100 GPU per node RX3. The proposed designs offer up to 14×and 16.6×better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency. we have enhanced the performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10×better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. The results clearly highlight the strength of the proposed solution both in terms of portability as well as performance.
Keywords:deep learning|NCCL|MPI_Bcast|pipelined chain design|topology-aware|PCIe links
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号