一种Spark集群下的shuffle优化机制 Shuffle optimization for Spark cluster. Computer Engineering and Applications期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种Spark集群下的shuffle优化机制

引用本文：	熊安萍,夏玉冲,杨方方.一种Spark集群下的shuffle优化机制[J].计算机工程与应用,2018,54(4):72-76.

作者姓名：	熊安萍夏玉冲杨方方

作者单位：	1.重庆邮电大学计算机科学与技术学院，重庆 400065 2.重庆市移动互联网数据应用工程技术研究中心，重庆 400065

摘要：	Spark是基于内存的分布式数据处理框架，其shuffle过程中大量数据需要通过网络传输，已成为Spark最主要的瓶颈之一。针对shuffle过程中存在的数据分布不均造成不同节点网络I/O负载不均的问题，设计了基于task本地性等级的重启策略，进一步提出了均衡的调度策略来平衡各节点的网络I/O负载。最后通过实验验证了优化机制能够减少计算任务的执行时间，提升整个shuffle过程的执行效率。
关键词：	Spark集群 shuffle过程数据传输本地性调度策略
Shuffle optimization for Spark cluster. Computer Engineering and Applications

XIONG Anping,XIA Yuchong,YANG Fangfang.Shuffle optimization for Spark cluster. Computer Engineering and Applications[J].Computer Engineering and Applications,2018,54(4):72-76.

Authors:	XIONG Anping XIA Yuchong YANG Fangfang

Affiliation:	1.School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China 2.Chongqing Engineering Research Center of Mobile Internet Data Application, Chongqing 400065, China

Abstract:	Spark is a distributed processing framework based on memory. The large amounts of data generated by the shuffle process deeply affect the network transmission, which has become one of the main bottlenecks of the Spark performance. In order to solve the problem of unbalanced data distribution resulting in the I/O load imbalance in different nodes, a restart policy based on task local level is designed. Finally, the optimization mechanism is verified by experiments, which can reduce the execution time of task and improve the efficiency of shuffle process.

Keywords:	Spark cluster shuffle process data transfer locality schedule strategy

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏