A sublinear algorithm for approximate keyword searching |
| |
Authors: | E W Myers |
| |
Affiliation: | (1) Department of Computer Science, University of Arizona, 85721 TucsonAZ, USA |
| |
Abstract: | Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the database A is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N
c
for somec<1. The sequenceA must be overa. finite alphabet . More precisely, our algorithm requires 0(DN
pow( )
logN) expected-time where =D/P is the maximum number of differences as a percentage of query length, and pow( ) is an increasing and concave function that is 0 when =0. Thus the algorithm is superior to current O(DN) algorithms when is small enough to guarantee that pow( ) < 1. As seen in the paper, this is true for a wide range of , e.g., . up to 33% for DNA sequences (¦ ¦=4) and 56% for proteins sequences (¦ ¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.This work was supported in part by the National Institutes of Health under Grant R01 LM04960-01 and the Aspen Center for Physics. |
| |
Keywords: | Approximate match Dynamic programming Index word neighborhood |
本文献已被 SpringerLink 等数据库收录! |
|