广播机制解决Shuffle过程数据倾斜的方法 Method Research to Solve Shuffle Data Skew Based on Broadcast期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

广播机制解决Shuffle过程数据倾斜的方法

引用本文：	吴恩慈.广播机制解决Shuffle过程数据倾斜的方法[J].计算机系统应用,2019,28(6):189-197.

作者姓名：	吴恩慈

作者单位：	上海淇毓信息科技有限公司,上海,200120

摘要：	在Spark 计算平台中,数据倾斜往往导致某些节点承受更大的网络流量和计算压力,给集群的CPU、内存、磁盘和流量带来了巨大的负担,影响整个集群的计算性能.本文通过对Spark Shuffle 设计和算法实现的研究,深入分析在大规模分布式环境下发生数据倾斜的本质原因.提出了广播机制避免Shuffle 过程数据倾斜的方法,分析了广播变量分发逻辑过程,给出广播变量性能优势分析和该方法的算法实现.通过Broadcast Join 实验验证了该方法在性能上有稳定的提升.
关键词：	数据倾斜分区策略洗牌算法广播机制
收稿时间：	2018/12/19 0:00:00
修稿时间：	2019/1/15 0:00:00
Method Research to Solve Shuffle Data Skew Based on Broadcast

WU En-Ci.Method Research to Solve Shuffle Data Skew Based on Broadcast[J].Computer Systems& Applications,2019,28(6):189-197.

Authors:	WU En-Ci

Affiliation:	Shanghai Qiyu Information Technology Co. Ltd., Shanghai 200120, China

Abstract:	In the Spark computing platform, data skew often causes some nodes to withstand greater network traffic and computing pressure, which imposes a huge burden on the cluster''s CPU, memory, disk, and traffic, affecting the computing performance of the entire cluster. Through the research on Spark Shuffle design and algorithm implementation, and deep analyses on the essential reasons of data skew in large-scale distributed environment, this study proposes a method to avoid data skew in shuffle process through the broadcast mechanism, analyzes the process of broadcast variable distribution logic, and gives the algorithm implementation and performance advantage analysis of the method. The performance of the method is improved by the Broadcast Join experiment.

Keywords:	data skew partition shuffle broadcast
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏