A sublinear algorithm for approximate keyword searching期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

A sublinear algorithm for approximate keyword searching

Authors:	E W Myers

Affiliation:	(1) Department of Computer Science, University of Arizona, 85721 TucsonAZ, USA

Abstract:	Given a relatively short query stringW of lengthP, a long subject stringA of lengthN, and a thresholdD, theapproximate keyword search problem is to find all substrings ofA that align withW with not more than D insertions, deletions, and mismatches. In typical applications, such as searching a DNA sequence database, the size of the databaseA is much larger than that of the queryW, e.g.,N is on the order of millions or billions andP is a hundred to a thousand. In this paper we present an algorithm that given a precomputedindex of the databaseA, finds rare matches in time that issublinear inN, i.e.,N ^c for somec<1. The sequenceA must be overa. finite alphabet . More precisely, our algorithm requires 0(DN^pow() logN) expected-time where =D/P is the maximum number of differences as a percentage of query length, and pow() is an increasing and concave function that is 0 when =0. Thus the algorithm is superior to current O(DN) algorithms when is small enough to guarantee that pow() < 1. As seen in the paper, this is true for a wide range of , e.g., . up to 33% for DNA sequences (¦¦=4) and 56% for proteins sequences (¦¦=20). In preliminary practical experiments, the approach gives a 50-to 500-fold improvement over previous algorithms for prolems of interest in molecular biology.This work was supported in part by the National Institutes of Health under Grant R01 LM04960-01 and the Aspen Center for Physics.

Keywords:	Approximate match Dynamic programming Index word neighborhood
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏