首页 | 本学科首页   官方微博 | 高级检索  
     

基于参考集索引的高效序列相似性查找算法
引用本文:戴东波,熊赟,朱扬勇.基于参考集索引的高效序列相似性查找算法[J].软件学报,2010,21(4):718-731.
作者姓名:戴东波  熊赟  朱扬勇
作者单位:复旦大学,计算机科学技术学院,上海,200433
基金项目:Supported by the National Natural Science Foundation of China under Grant No.0573093 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant No.2006AA02Z329 (国家高技术研究发展计划(863))
摘    要:序列数据在文本、Web访问日志文件、生物数据库中普遍存在,对其进行相似性查找是一种重要的获取和分析知识的手段.基于参考集索引技术是一类解决序列相似性查找的有效方法,主要思想是找到序列数据库中的少数序列作为参考集,通过参考集过滤掉数据库中与查询序列不相关的数据,从而高效地回答查询.在现有基于参考集索引技术的基础上,提出一种过滤能力更强的序列相似性查询算法IRI(improved reference indexing).首先,充分利用了先前的查询结果集来加速当前的查询,其次考虑了基于序列特征的上界和下界,使得应用参考集进行过滤的上下界更紧,过滤能力进一步加强.最后,为了避免候选集中费时的编辑距离计算,则只计算前缀序列间的编辑距离,从而进一步加速算法运行.实验采用真实的DNA序列和蛋白质序列数据,结果表明,算法IRI在查询性能上明显优于现有的基于参考集索引方法RI(reference indexing).

关 键 词:序列相似性查找  参考集索引  编辑距离
收稿时间:2008/9/29 0:00:00
修稿时间:2009/3/31 0:00:00

Efficient Algorithm for Sequence Similarity Search Based on Reference Indexing
DAI Dong-Bo,XIONG Yun and ZHU Yang-Yong.Efficient Algorithm for Sequence Similarity Search Based on Reference Indexing[J].Journal of Software,2010,21(4):718-731.
Authors:DAI Dong-Bo  XIONG Yun and ZHU Yang-Yong
Affiliation:DAI Dong-Bo,XIONG Yun,ZHU Yang-Yong (School of Computer Science , Technology,Fudan University,Shanghai 200433,China)
Abstract:Sequence data are ubiquitous in many domains such as text, Web access log and biological database. Similarity search in this kind of data is very important for knowledge acquisition and analysis. An indexing technique based on reference is an effective method for sequence similarity search, the main idea of which is to assign some sequences in database as reference sets, then filter out those sequences unrelated to query sequence and finally get the answer efficiently. This paper presents a similarity search algorithm IRI (improved reference indexing) which is based on current indexing technique using reference set and is more powerful in terms of filtration. First, previous query results are used to accelerate the current query. Then, the upper bound and lower bound based on sequence characteristic are proposed to make the bound tighter and improve the filtration capability. Finally, to avoid the time-consuming edit distance computing, only partial edit distance between prefix sequences need to compute, which makes the algorithm run faster. Real data including DNA and protein sequence data are used in the experiment. Comprehensive experimental results show that IRI is more efficient than the current reference-based indexing algorithm RI (reference indexing).
Keywords:sequence similarity search  reference indexing  edit distance
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号