首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于MapReduce的大数据集相似自连接算法
引用本文:孙德才,王晓霞.一种基于MapReduce的大数据集相似自连接算法[J].计算机科学,2017,44(5):20-25, 32.
作者姓名:孙德才  王晓霞
作者单位:渤海大学信息科学与技术学院 锦州121013,渤海大学大学基础教研部 锦州121013
基金项目:本文受教育部人文社会科学研究青年基金项目(15YJC870021,5YJC870028),辽宁省博士科研启动基金计划项目(20141138),辽宁省教育厅科学研究项目(L2015010,L2014451),辽宁省自然科学基金(2015020009),国家自然科学基金青年基金项目(61602056)资助
摘    要:如何快速发现数据集中重复或相似的记录是大数据处理技术中的一个基本问题。相似连接是一种有效的相似数据查找方法,且基于MapReduce的相似连接算法因对大数据集的处理能力强而得到广泛关注。通过分析当前相似连接算法进行自连接时存在的自连接冗余、读取原字符串复杂等问题,在Massjoin算法的基础上提出了一种改进的基于MapReduce的自连接算法。改进算法在过滤阶段增加了消除自身冗余的过滤条件,在验证阶段又采用了生成正反候选对和组合id等去冗余技术,并且读取原始字符串内容时只需读取数据集一次。实验数据显示,改进算法无论在过滤阶段还是在验证阶段都减少了算法的CPU时耗,结果表明所提改进策略是有效的。

关 键 词:相似连接  大数据  MapReduce  数据清洗
收稿时间:2016/8/22 0:00:00
修稿时间:2016/11/12 0:00:00

MapReduce Based Similarity Self-join Algorithm for Big Dataset
SUN De-cai and WANG Xiao-xia.MapReduce Based Similarity Self-join Algorithm for Big Dataset[J].Computer Science,2017,44(5):20-25, 32.
Authors:SUN De-cai and WANG Xiao-xia
Affiliation:College of Information Science and Technology,Bohai University,Jinzhou 121013,China and Research and Teaching Institute of College Basics,Bohai University,Jinzhou 121013,China
Abstract:How to find out duplicates/similarities in dataset is a key issue in big data processing.Similarity join is a va-lid operation for finding similarities,and similarity join algorithm based on MapReduce has attracted serious concern for the advantage of processing big dataset.In this paper,similarity self-join algorithms were researched and some factors which slow self-join were discovered.To accelerate self-join,an improved similarity self-join algorithm based on Massjoin was proposed.In filtration stage,new filtration criterion is added to eliminating self-join redundant pairs.In verification stage,the techniques of backward-forward pairs and combined id are adopted to eliminate more self-join redundant candidate pairs,and the dataset is scanned only once in reading original strings.The experimental results demonstrate that both filtration CPU time and the verification CPU time of new algorithm decrease.As a result,the efficiency of similarity self-join algorithm is increased by using our revision strategies.
Keywords:Similarity join  Big data  MapReduce  Data cleaning
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号