首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于MapReduce的局部相似自连接算法
引用本文:王晓霞,孙德才.一种基于MapReduce的局部相似自连接算法[J].计算机技术与发展,2020(2):88-93.
作者姓名:王晓霞  孙德才
作者单位:渤海大学信息科学与技术学院
基金项目:教育部人文社会科学研究青年基金项目(15YJC870021);国家自然科学基金青年基金项目(61602056);辽宁省自然科学基金(20170540015);辽宁省社会科学基金(L18AXW001);辽宁省教育科学研究项目(L2015010)
摘    要:局部相似自连接能在给定的单个数据集中快速找到所有满足相似要求的记录对,它在数据清洗、基因序列比对和剽窃检测等领域都有广泛的应用。为研究基于单个字符串集的并行自连接算法,提出了一种基于MapReduce框架的自连接算法,解决了局部相似自连接的定位问题。该算法采用了过滤验证二阶段模式;在过滤阶段,采用无关对过滤和冗余对过滤抛弃了大量的无效字符串对;在验证阶段,通过生成小编号串内容保留项解决了字符串编号和内容的快速配对问题。实验结果显示,该算法在大数据集上的自连接速度一直快于当前的优秀算法LS-Join,同时非常适合动态编辑距离参数环境下的局部相似自连接操作。实验结果也证明,该算法中提出的相关技术有效地提高了局部相似自连接的速度。

关 键 词:相似连接  自连接  MAPREDUCE  数据清洗  大数据

A MapReduce-based Local Similarity Self-join Algorithm
WANG Xiao-xia,SUN De-cai.A MapReduce-based Local Similarity Self-join Algorithm[J].Computer Technology and Development,2020(2):88-93.
Authors:WANG Xiao-xia  SUN De-cai
Affiliation:(School of Information Science and Technology,Bohai University,Jinzhou 121013,China)
Abstract:Local similarity self-join can find all local similar pairs from a given set quickly,which is widely used in many areas,such as data cleaning,gene sequence alignment,near duplicate detection and so on.In order to study the parallel self-join algorithm based on single string set,a self-join algorithm based on MapReduce framework is proposed to solve the locating problem of local similarity self-join.Filter-verify framework is employed in this algorithm.In filter stage,a lot of dissimilar string pairs are discarded by using the techniques of irrelevant-pair filter and redundant-pair filter.In verify stage,the technique of generating reserved terms is adopted to solve the problem of matching string contents with IDs quickly.Experiment shows that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset and performs well on condition of dynamic parameter of edit distance.It also demonstrates that the performance of local similarity self-join is improved by using the techniques of the proposed algorithm.
Keywords:similarity join  self-join  MapReduce  data cleaning  big data
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号