首页 | 官方网站   微博 | 高级检索  
     

一种Spark集群下的shuffle优化机制
引用本文:熊安萍,夏玉冲,杨方方.一种Spark集群下的shuffle优化机制[J].计算机工程与应用,2018,54(4):72-76.
作者姓名:熊安萍  夏玉冲  杨方方
作者单位:1.重庆邮电大学 计算机科学与技术学院,重庆 400065 2.重庆市移动互联网数据应用工程技术研究中心,重庆 400065
摘    要:Spark是基于内存的分布式数据处理框架,其shuffle过程中大量数据需要通过网络传输,已成为Spark最主要的瓶颈之一。针对shuffle过程中存在的数据分布不均造成不同节点网络I/O负载不均的问题,设计了基于task本地性等级的重启策略,进一步提出了均衡的调度策略来平衡各节点的网络I/O负载。最后通过实验验证了优化机制能够减少计算任务的执行时间,提升整个shuffle过程的执行效率。

关 键 词:Spark集群  shuffle过程  数据传输  本地性  调度策略  

Shuffle optimization for Spark cluster. Computer Engineering and Applications
XIONG Anping,XIA Yuchong,YANG Fangfang.Shuffle optimization for Spark cluster. Computer Engineering and Applications[J].Computer Engineering and Applications,2018,54(4):72-76.
Authors:XIONG Anping  XIA Yuchong  YANG Fangfang
Affiliation:1.School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China 2.Chongqing Engineering Research Center of Mobile Internet Data Application, Chongqing 400065, China
Abstract:Spark is a distributed processing framework based on memory. The large amounts of data generated by the shuffle process deeply affect the network transmission, which has become one of the main bottlenecks of the Spark performance. In order to solve the problem of unbalanced data distribution resulting in the I/O load imbalance in different nodes, a restart policy based on task local level is designed. Finally, the optimization mechanism is verified by experiments, which can reduce the execution time of task and improve the efficiency of shuffle process.
Keywords:Spark cluster  shuffle process  data transfer  locality  schedule strategy  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号