基于列存储的MapReduce分布式Hash连接算法 Hash Join in MapReduce Distributed Environment Based on Column-store期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于列存储的MapReduce分布式Hash连接算法

引用本文：	张滨,乐嘉锦. 基于列存储的MapReduce分布式Hash连接算法[J]. 计算机科学, 2018, 45(Z6): 471-475, 505

作者姓名：	张滨乐嘉锦

作者单位：	浙江财经大学杭州 310018,东华大学计算机科学与技术学院上海201620

基金项目：	本文受浙江省哲学社会科学规划课题基金(17NDJC179YB)资助

摘要：	大数据具有规模大、深度大、宽度大、处理时间短、硬件系统普通化、软件系统开源化的特点。传统关系型数据库在对大数据进行操作时存在系统性能严重下降、计算效率提升有限以及可扩展性差等问题,因此引入MapReduce并行计算模型,提出一种大数据上基于列存储的MapReduce分布式Hash连接算法。首先,设计面向大数据的分布式计算模型,在设计的分片聚集并行连接的基础上,利用Hash连接以及动态探测方法优化了数据并行连接处理效率；然后,针对该算法开发了基于Hadoop的原型系统。通过实验证明,在大数据分析处理中,所提算法在执行时间和负载能力上都有很好的性能表现,也能提供良好的可扩展性。
关键词：	大数据列存储 Hash连接 MapReduce 并行计算
Hash Join in MapReduce Distributed Environment Based on Column-store

ZHANG Bin and LE Jia-jin. Hash Join in MapReduce Distributed Environment Based on Column-store[J]. Computer Science, 2018, 45(Z6): 471-475, 505

Authors:	ZHANG Bin and LE Jia-jin

Affiliation:	Zhejiang University of Finance & Economics,Hangzhou 310018,China and School of Computer Science and Technology,Donghua University,Shanghai 201620,China

Abstract:	The characters of big data are volume,variety,value,velocity,and common hardware and open source.Aiming at the system inefficiency and limited scalability of traditional relational database in big data analysis,this paper presented an algorithm of Hash joins in MapReduce distributed environment based on column-store by introducing MapReduce computing model.First of all,this paper proposed the design of large data-oriented distributed computing models. Then,it proposed the partition aggregation and the heuristic optimization strategy to realize the implementation of Hash join algorithm.Lastly,the experiments evaluated execution time and load capacity.The results show that the proposed method is effective and can provid good scalability in big data analysis.

Keywords:	Big data Column-store Hash join MapReduce Parallel computing

	点击此处可从《计算机科学》浏览原始摘要信息
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏