首页 | 本学科首页   官方微博 | 高级检索  
     

分布式机器学习作业性能干扰分析与预测
引用本文:李洪亮,张弄,孙婷,李想.分布式机器学习作业性能干扰分析与预测[J].计算机应用,2022,42(6):1649-1655.
作者姓名:李洪亮  张弄  孙婷  李想
作者单位:吉林大学 计算机科学与技术学院,长春 130012
符号计算与知识工程教育部重点实验室(吉林大学),长春 130012
基金项目:国家重点研发计划项目(2017YFC1502306);;国家自然科学基金资助项目(61602205)~~;
摘    要:通过分析分布式机器学习中作业性能干扰的问题,发现性能干扰是由于内存过载、带宽竞争等GPU资源分配不均导致的,为此设计并实现了快速预测作业间性能干扰的机制,该预测机制能够根据给定的GPU参数和作业类型自适应地预测作业干扰程度。首先,通过实验获取分布式机器学习作业运行时的GPU参数和干扰率,并分析出各类参数对性能干扰的影响;其次,依托多种预测技术建立GPU参数-干扰率模型进行作业干扰率误差分析;最后,建立自适应的作业干扰率预测算法,面向给定的设备环境和作业集合自动选择误差最小的预测模型,快速、准确地预测作业干扰率。选取5种常用的神经网络作业,在两种GPU设备上设计实验并进行结果分析。结果显示,所提出的自适应干扰预测(AIP)机制能够在不提供任何预先假设信息的前提下快速完成预测模型的选择和性能干扰预测,耗时在300 s以内,预测干扰率误差在2%~13%,可应用于作业调度和负载均衡等场景。

关 键 词:分布式机器学习  性能干扰  集群调度  资源共享  干扰预测  
收稿时间:2021-08-05
修稿时间:2021-10-14

Performance interference analysis and prediction for distributed machine learning jobs
Hongliang LI,Nong ZHANG,Ting SUN,Xiang LI.Performance interference analysis and prediction for distributed machine learning jobs[J].journal of Computer Applications,2022,42(6):1649-1655.
Authors:Hongliang LI  Nong ZHANG  Ting SUN  Xiang LI
Affiliation:College of Computer Science and Technology,Jilin University,Changchun Jilin 130012,China
Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education (Jilin University),Changchun Jilin 130012,China
Abstract:By analyzing the problem of job performance interference in distributed machine learning, it is found that performance interference is caused by the uneven allocation of GPU resources such as memory overload and bandwidth competition, and to this end, a mechanism for quickly predicting performance interference between jobs was designed and implemented, which can adaptively predict the degree of job interference according to the given GPU parameters and job types. First, the GPU parameters and interference rates during the operation of distributed machine learning jobs were obtained through experiments, and the influences of various parameters on performance interference were analyzed. Second, some GPU parameter-interference rate models were established by using multiple prediction technologies to analyze the job interference rate errors. Finally, an adaptive job interference rate prediction algorithm was proposed to automatically select the prediction model with the smallest error for a given equipment environment and job set to predict the job interference rates quickly and accurately. By selecting five commonly used neural network tasks, experiments were designed on two GPU devices and the results were analyzed. The results show that the proposed Adaptive Interference Prediction (AIP) mechanism can quickly complete the selection of prediction model and the performance interference prediction without providing any pre-assumed information, it has comsumption time less than 300 s and achieves prediction error rate in the range of 2% to 13%, which can be applied to scenarios such as job scheduling and load balancing.
Keywords:distributed machine learning  performance interference  cluster scheduling  resource sharing  interference prediction  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号