Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported.  相似文献   

We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings—Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (mk)(k+2) bits of space to represent an NFA with (mk)(k+1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (mk)(k+1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m.  相似文献   

Multiple filtration and approximate pattern matching   总被引:5,自引:0,他引:5  
Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230.  相似文献   

In molecular biology, it is said that two biological sequences tend to have similar properties if they have similar three-dimensional structures. Hence, it is very important to find not only similar sequences in the string sense, but also structurally similar sequences from databases. In this paper we propose a new data structure that is a generalization of a parameterized suffix tree (p-suffix tree for short) introduced by Baker. We call it the structural suffix tree or s-suffix tree for short. The s-suffix tree can be used for finding structurally related patterns of RNA or single-stranded DNA. Furthermore, we propose an O(n(log|| + log||)) on-line algorithm for constructing it, where n is the sequence length, || is the size of the normal alphabet, and || is that of the alphabet called parameter, which is related to the structure of the sequence. Our algorithm achieves linear time when it is used to analyze RNA and DNA sequences. Furthermore, as an algorithm for constructing the p-suffix tree, it is the first on-line algorithm, though the computing bound of our algorithm is the same as that of Kosarajus best-known algorithm. The results of computational experiments using actual RNA and DNA sequences are also given to demonstrate our algorithms practicality.  相似文献   

A subquadratic algorithm for approximate limited expression matching   总被引:1,自引:0,他引:1  
In this paper we present an efficient subquadratic-time algorithm for matching strings and limited expressions in large texts. Limited expressions are a subset of regular expressions that appear often in practice. The generalization from simple strings to limited expressions has a negligible affect on the speed of our algorithm, yet allows much more flexibility. Our algorithm is similar in spirit to that of Masek and Paterson [MP], but it is much faster in practice. Our experiments show a factor of four to five speedup against the algorithms of Sellers [Se] and Ukkonen [Uk1] independent of the sizes of the input strings. Experiments also reveal our algorithm to be faster, in most cases, than a recent improvement by Chang and Lampe [CL2], especially for small alphabet sizes for which it is two to three times faster.The research of U. Manber was supported in part by a Presidential Young Investigator Award DCR-8451397, with matching funds from AT&T, and by NSF Grant CCR-9001619. G. Myers research was supported in part by NIH Grant LM04960, NSF Grant CCR-9001619, and the Aspen Center for Physics.  相似文献   

Alden H. Wright 《Software》1994,24(4):337-362
Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given.  相似文献   

There is no known algorithm that solves the general case of the approximate edit distance problem, where the edit operations are insertion, deletion, mismatch, and swap, in time o(nm), where n is the length of the text and m is the length of the pattern.In the effort to study this problem, the edit operations have been analyzed independently. Karloff [10] showed an algorithm that approximates the edit distance problem with only the mismatch operation in time . Amir et al. [4] showed that if the only edit operations allowed are swap and mismatch, then the exact edit distance problem can be solved in time .In this paper, we discuss the problem of approximate edit distance with swap and mismatch. We show a randomized time algorithm for the problem. The algorithm guarantees an approximation factor of (1+?) with probability of at least .  相似文献   

Ken-Chih Liu 《Software》1986,16(6):541-548
This paper presents an extension of Pascal with string pattern matching. Pattern definitions are built using six basic operations: alternation, concatenation, immediate value assignment, intersection, difference and complement. The last three have not been previously implemented and they increase the expressive power beyond context-free languages. The pattern matching actions are augmented with three options: trace, prefix and suffix. Comparisons with a SNOBOL4 implementation are also presented. This experiment demonstrates that Pascal with pattern matching is a useful tool for string processing applications.  相似文献   

Given a textstringx of lengthn, theMinimal Augmented Suffix Tree T (x) ofx is a digital-search index that returns, for anyquery stringw and in a number of comparisons bounded by the length ofw, the maximum number of nonoverlapping occurrences ofw inx. It is shown that, denoting the length ofx byn, T(x) can be built in timeO(n log2 n) and spaceO(n logn), off-line on a RAM.This research was supported in part, through the Leonardo Fibonacci Institute, by the Istituto Trentino di Cultura, Trento, Italy.Additional support was provided by NSF Grants CCR-8900305 and CCR-9201078, by NATO Grant CRG 900293, by the National Research Council of Italy, and by the ESPRIT III Basic Research Programme of the EC under Contract No. 9072 (Project GEPPCOM).Additional support was provided by NSF Grant CCR-91-96176 and ONR Contract N 00014-91-J-4052, ARPA Order 2225.  相似文献   

Suffix arrays are a key data structure for solving a run of problems on texts and sequences, from data compression and information retrieval to biological sequence analysis and pattern discovery. In their simplest version, they can just be seen as a permutation of the elements in {1,2,…,n}, encoding the sorted sequence of suffixes from a given text of length n, under the lexicographic order. Yet, they are on a par with ubiquitous and sophisticated suffix trees. Over the years, many interesting combinatorial properties have been devised for this special class of permutations: for instance, they can implicitly encode extra information, and they are a well characterized subset of the n! permutations. This paper gives a short tutorial on suffix arrays and their compressed version to explore and review some of their algorithmic features, discussing the space issues related to their usage in text indexing, combinatorial pattern matching, and data compression.  相似文献   

针对近似模式匹配算法在处理带有灵活通配符和长度约束近似模式匹配(APMWL)问题时只能解决替换操作, 提出一种基于动态规划的编辑距离矩阵(EDM)构造方法,设计了基于EDM的近似模式匹配算法APM, 可以处理近似匹配中的三种编辑操作,即插入、替换和删除操作。此外,根据文本中字符是否允许被重复使用的约束条件,设计APM-OF算法。实验结果表明,APM和APM-OF与同类算法相比具备显著的优势:与Sail_Approx匹配算法实验对比, 获取解的平均增长率分别达到8.34%和12.37%; 将APM-OF算法应用至模式挖掘中, 挖掘出的频繁近似模式个数为OneoffMining算法的2.07倍。  相似文献   

Given a set W of k sequences and a tree T with k leaves labeled with a unique sequence in W, a label tree is used to assign a sequence label to each internal node of the tree T. The cost of a tree edge is defined as the distance, such as the Hamming distance or the Levenshtein (edit) distance, between the sequence labels of a pair of nodes in the edge. The bottleneck edge of a label tree is the edge with the maximum cost in the label tree. The bottleneck tree alignment problem is concerned with the determination of a label tree with minimum cost of the bottleneck edge. A lifted label tree is a label tree in which the sequence label of each internal node is equal to the sequence label of some child of the node. The bottleneck lifted tree alignment problem involves the minimization of cost of the bottleneck edge among all possible lifted label trees of the tree T. This paper shows that when the distance function is a metric, the bottleneck tree alignment problem is NP-complete even when the tree structure resembles a binary tree. For special cases, an exact algorithm is used to solve the bottleneck lifted tree alignment problem in polynomial time. A 2(h-1)-approximation algorithm is used to solve the bottleneck tree alignment problem, where h is the height of the tree T. It is observed that the bound is existentially tight. Finally, this paper shows that any lifted label tree is an optimal solution to the bottleneck tree alignment problem if the distance function is an ultrametric.  相似文献   

By extending the previously proposed geometric branch-and-bound algorithm with bounded alignment for point pattern matching, the paper presents the development and evaluation of a new and fast algorithm for image registration based on line segments. Using synthetically generated data sets with randomly distributed line segments and hard test cases with highly symmetric line patterns, as well as real remote sensing images, the developed algorithm is shown to be computationally fast, highly robust, capable of handling severely corrupted data sets with considerable line segment position errors as well as significant fragmented and spurious line segments in the images to be matched.  相似文献   

Y. Liu   《Knowledge》2006,19(8):719-729
Similarity measuring is one substantial part in self-organizing maps (SOM) for its direct influence on the mapping results. The common used similarity measuring method – Euclidean distance cannot always express the exact similarity. In this paper, a novel adaptive self-organized maps based on bidirectional approximate reasoning (ASOMBAR) is proposed to improve the competitive and cooperative process based on the similarity measuring. Unlike the SOM, the proposed ASOMBAR employs the novel fuzzy similarity distance and fuzzy matching criterion to replace the Euclidean distance and original matching criterion, respectively. The fuzzy similarity distance describes the similarity relation more precisely than the Euclidean distance does. The fuzzy matching criterion pays more attention on the large weighted elements and less emphasis on the small weighted elements. Moreover, the new compatible topological neighborhood is also modified basing on the new fuzzy similarity distance and fuzzy matching criterion. Since the ASOMBAR network is self-organizing, the weights of the networks change adaptively according to the input changes. Compared with the well-known growing neural gas (GNG) and SOM, ASOMBAR (when λ  0.5) converges quicker to a smaller distortion error. An information filtering example is used to show the effectiveness of ASOMBAR.  相似文献   

Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and areversal takes an arbitrary substring of elements, and reverses their order.For this problem, we develop two algorithms: a greedy approximation algorithm, that finds a solution provably close to optimal inO(n 2) time and0(n) space forn-element permutations, and a branch- and-bound exact algorithm, that finds an optimal solution in0(mL(n, n)) time and0(n 2) space, wherem is the size of the branch- and-bound search tree, andL(n, n) is the time to solve a linear program ofn variables andn constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch- and-bound algorithm are a novel application of maximum-weight matchings, shortest paths, and linear programming.In a series of experiments, we study the performance of an implementation on random permutations, and permutations generated by random reversals. For permutations differing byk random reversals, we find that the average upper bound on reversal distance estimatesk to within one reversal fork<1/2n andn<100. For the difficult case of random permutations, we find that the average difference between the upper and lower bounds is less than three reversals forn<50. Due to the tightness of these bounds, we can solve, to optimality, problems on 30 elements in a few minutes of computer time. This approaches the scale of mitochondrial genomes.This research was supported by a postdoctoral fellowship from the Program in Mathematics and Molecular Biology of the University of California at Berkeley under National Science Foundation Grant DMS-8720208, and by a fellowship from the Centre de recherches mathématiques of the Université de Montréal.This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada, and the Fonds pour la formation de chercheurs et l'aide à la recherche (Québec). The author is a fellow of the Canadian Institute for Advanced Research.  相似文献   

