首页 | 本学科首页   官方微博 | 高级检索  
     

基于Local SGD的部分同步通信策略
引用本文:魏业鸣,郑美光.基于Local SGD的部分同步通信策略[J].计算机应用研究,2023,40(12).
作者姓名:魏业鸣  郑美光
作者单位:中南大学,中南大学
基金项目:国家自然科学基金资助项目(62172442);湖南省自然科学基金青年科学基金资助项目(2020JJ5775)
摘    要:Local SGD训练方法用于分布式机器学习以缓解通信瓶颈,但其本地多轮迭代特性使异构集群节点计算时间差距增大,带来较大同步时延与参数陈旧问题。针对上述问题,基于Local SGD方法提出了一种动态部分同步通信策略(LPSP),该方法利用两层决策充分发挥Local SGD本地迭代优势。在节点每轮迭代计算结束后,基于本地训练情况判断通信可能性,并在全局划分同步集合以最小化同步等待时延,减少Local SGD通信开销并有效控制straggler负面影响。实验表明LPSP可以在不损失训练精确度的情况下实现最高0.75~1.26倍的加速,此外,最高还有5.14%的精确度提升,可以有效加速训练收敛。

关 键 词:分布式机器学习    随机梯度下降    参数服务器    部分同步
收稿时间:2023/4/3 0:00:00
修稿时间:2023/6/8 0:00:00

Partially synchronous communication strategy based on Local SGD
Wei YeMing and Zheng MeiGuang.Partially synchronous communication strategy based on Local SGD[J].Application Research of Computers,2023,40(12).
Authors:Wei YeMing and Zheng MeiGuang
Affiliation:Central South University,
Abstract:Local SGD is used in distributed machine learning to alleviate communication bottleneck. However, it increases the time gap between heterogeneous cluster workers in local computation with the characteristics of local multi-round iteration, which bring about large synchronization wait time and the problem of stale parameters. Regarding the above problems, based on the Local SGD algorithm, this paper proposed a dynamic partial synchronization communication method(LPSP). This method made full use of the advantage of Local SGD with two-layer decision. After each worker finishes each epoch of Local gradient calculation, it determined its communication possibility based on local training situation. And parameter server divided the synchronization worker sets to minimize the synchronization wait time. Based on above strategies, it reduced the communication overhead of local SGD method and effectively controlled the negative influence of straggler. Experiments show that LPSP can achieve up to 0.75 to 1.26 times acceleration without loss of training accuracy, and up to 5.14% accuracy improvement, which can effectively speed up the convergence of model training.
Keywords:distributed machine learning  stochastic gradient descent  parameter server  partially synchronization
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号