首页 | 本学科首页   官方微博 | 高级检索  
     


A sublinear algorithm for approximate keyword searching
Authors:E W Myers
Affiliation:(1) Department of Computer Science, University of Arizona, 85721 TucsonAZ, USA
Abstract:Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the ldquodatabaserdquoA is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N c for somec<1. The sequenceA must be overa. finite alphabet sgr. More precisely, our algorithm requires 0(DN pow(epsiv) logN) expected-time where epsiv=D/P is the maximum number of differences as a percentage of query length, and pow(epsiv) is an increasing and concave function that is 0 when epsiv=0. Thus the algorithm is superior to current O(DN) algorithms when epsiv is small enough to guarantee that pow(epsiv) < 1. As seen in the paper, this is true for a wide range of epsiv, e.g., epsiv. up to 33% for DNA sequences (¦Barwed¦=4) and 56% for proteins sequences (¦Barwed¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.This work was supported in part by the National Institutes of Health under Grant R01 LM04960-01 and the Aspen Center for Physics.
Keywords:Approximate match  Dynamic programming  Index  word neighborhood
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号