首页 | 本学科首页   官方微博 | 高级检索  
     

双向过滤的字符串相似连接验证方法
引用本文:黄 樱,宋春花,牛保宁. 双向过滤的字符串相似连接验证方法[J]. 计算机工程与应用, 2017, 53(9): 72-79. DOI: 10.3778/j.issn.1002-8331.1512-0309
作者姓名:黄 樱  宋春花  牛保宁
作者单位:太原理工大学 计算机学院,太原 030024
摘    要:字符串相似连接是指在字符串集合中找出相似的字符串对,是许多应用的关键操作,寻找高效的字符串相似连接算法已成为研究热点。基于划分的过滤-验证方法(Pass-Join)与其他方法相比具有较高的效率。它按照字符串长度递增的顺序访问字符串集合,通过查找一个字符串的划分块是否存在于另一个字符串中,快速筛选出可能相似的字符串对(候选集),然后利用编辑距离进行相似性验证。研究发现,按照字符串长度递减的顺序进行过滤(长度递减过滤)的效果优于按照长度递增的顺序过滤(长度递增过滤)的效果,基于此,提出双向过滤-验证机制:在过滤阶段对长度递减过滤的结果再进行一次长度递增过滤,进一步减小候选集大小;在验证阶段利用双向过滤产生的两对划分块和其匹配子串分隔字符串对,从而减小需要验证的字符串的长度,加速验证过程。实验证明,双向过滤-验证算法在真实数据集上优于原算法。

关 键 词:字符串相似连接  双向过滤-验证机制  过滤-验证框架  

Verification method for string similarity joins based on bi-directional filtering
HUANG Ying,SONG Chunhua,NIU Baoning. Verification method for string similarity joins based on bi-directional filtering[J]. Computer Engineering and Applications, 2017, 53(9): 72-79. DOI: 10.3778/j.issn.1002-8331.1512-0309
Authors:HUANG Ying  SONG Chunhua  NIU Baoning
Affiliation:School of Computer Science and Technology, Taiyuan University of Technology, Taiyuan 030024, China
Abstract:A string similarity join finds similar string pairs from two sets of strings. It plays an important role in many real-world applications. Various algorithms have been proposed to address its efficiency issues. Partition-based filter-verification methods, such as Pass-Join, are promising, which quickly screens out possible similar string pairs(candidate set)by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification based on edit-distance. Motivated by the fact that the effect produced by filtering in the descending order of string length is better than in the ascending order, a novel bi-directional filtering-verification mechanism is proposed. At the filtering stage, it pipelines the results from length descending filtering to length ascending filtering to further reduce the size of the candidate set. At the verification stage, it makes use of the two pairs of matched substrings from the bi-directional filtering to partition the target string pairs into several short substring pairs to accelerate the verification process. Experimental results show that the proposed bi-directional filtering-verification algorithm outperforms the origin algorithm on real-world datasets.
Keywords:string similarity joins  bi-directional filtering-verification mechanism  filter-verification framework  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号