共查询到20条相似文献,搜索用时 0 毫秒
1.
There is no known algorithm that solves the general case of the approximate edit distance problem, where the edit operations are insertion, deletion, mismatch, and swap, in time o(nm), where n is the length of the text and m is the length of the pattern.In the effort to study this problem, the edit operations have been analyzed independently. Karloff [10] showed an algorithm that approximates the edit distance problem with only the mismatch operation in time . Amir et al. [4] showed that if the only edit operations allowed are swap and mismatch, then the exact edit distance problem can be solved in time .In this paper, we discuss the problem of approximate edit distance with swap and mismatch. We show a randomized time algorithm for the problem. The algorithm guarantees an approximation factor of (1+?) with probability of at least . 相似文献
2.
Graph matching and graph edit distance have become important tools in structural pattern recognition. The graph edit distance concept allows us to measure the structural similarity of attributed graphs in an error-tolerant way. The key idea is to model graph variations by structural distortion operations. As one of its main constraints, however, the edit distance requires the adequate definition of edit cost functions, which eventually determine which graphs are considered similar. In the past, these cost functions were usually defined in a manual fashion, which is highly prone to errors. The present paper proposes a method to automatically learn cost functions from a labeled sample set of graphs. To this end, we formulate the graph edit process in a stochastic context and perform a maximum likelihood parameter estimation of the distribution of edit operations. The underlying distortion model is learned using an Expectation Maximization algorithm. From this model we finally derive the desired cost functions. In a series of experiments we demonstrate the learning effect of the proposed method and provide a performance comparison to other models. 相似文献
3.
Kaizhong Zhang 《Algorithmica》1996,15(3):205-222
This paper considers the problem of computing a constrained edit distance between unordered labeled trees. The problem of approximate unordered tree matching is also considered. We present dynamic programming algorithms solving these problems in sequential timeO(|T
1|×|T
2|×(deg(T
1)+deg(T
2))× log2(deg(T
1)+deg(T
2))). Our previous result shows that computing the edit distance between unordered labeled trees is NP-complete.This research was supported by the Natural Sciences and Engineering Research Council of Canada under Grant No. OGP0046373. 相似文献
4.
Tatsuya Akutsu 《Information Processing Letters》2006,100(3):105-109
We consider a relationship between the unit cost edit distance for two rooted ordered trees and the unit cost edit distance for the corresponding Euler strings. We show that the edit distance between trees is at least half of the edit distance between the Euler strings and is at most 2h+1 times the edit distance between the Euler strings, where h is the minimum height of two trees. The result can be extended for more general cost functions. 相似文献
5.
The greedy algorithm for edit distance with moves 总被引:1,自引:0,他引:1
6.
Minghui Jiang 《Theory of Computing Systems》2009,44(3):349-355
The Hamming distance with shifts was introduced by Bookstein et al. as a generalization of the traditional Hamming distance
to allow a tunable degree of fuzziness when comparing two binary sequences of the same length. We present a linear-time algorithm
for computing this distance. The previous best time bound was quadratic. 相似文献
7.
This paper describes a new method for quantifying the regularity of contours and comparing them (when encoded by Freeman chain codes) in terms of a similarity criterion which relies on information gathered from Levenshtein edit distance computation. The criterion used allows subsequences to be found from the minimal cost edit sequence that specifies an alignment of contour segments which are similar. Two external parameters adjust the similarity criterion. The information about each similar part is encoded by strings that represent an average contour region. An explanation of how to construct a prototype based on the identified regularities is also reviewed. The reliability of the prototypes is evaluated by replacing contour groups (samples) by new prototypes used as the training set in a classification task. This way, the size of the data set can be reduced without sensibly affecting its representational power for classification purposes. Experimental results show that this scheme achieves a reduction in the size of the training data set of about 80% while the classification error only increases by 0.45% in one of the three data sets studied. 相似文献
8.
Dekel Tsur 《Information Processing Letters》2008,108(4):251-254
The guided tree edit distance problem is to find a minimum cost series of edit operations that transforms two input forests F and G into isomorphic forests F′ and G′ such that a third input forest H is included in F′ (and G′). The edit operations are relabeling a vertex and deleting a vertex. We show efficient algorithms for this problem that are faster than the previous algorithm for this problem of Peng and Ting [Z. Peng, H. Ting, Guided forest edit distance: Better structure comparisons by using domain-knowledge, in: Proc. 18th Symposium on Combinatorial Pattern Matching (CPM), 2007, pp. 28-39]. 相似文献
9.
Finding a sequence of edit operations that transforms one string of symbols into another with the minimum cost is a well-known
problem. The minimum cost, or edit distance, is a widely used measure of the similarity of two strings. An important parameter
of this problem is the cost function, which specifies the cost of each insertion, deletion, and substitution. We show that
cost functions having the same ratio of the sum of the insertion and deletion costs divided by the substitution cost yield
the same minimum cost sequences of edit operations. This leads to a partitioning of the universe of cost functions into equivalence
classes. Also, we show the relationship between a particular set of cost functions and the longest common subsequence of the
input strings.
This work was supported in part by the U.S. Department of Defense and the U.S. Department of Energy. 相似文献
10.
Silvia García-Díez Author Vitae François Fouss Author Vitae 《Pattern recognition》2011,44(6):1172-1182
This paper introduces a simple Sum-over-Paths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences, and extends related previous work from bioinformatics to the case of graphs with cycles. Each alignment ℘, with a total cost C(℘), is assigned a probability of occurrence P(℘)=exp[−θC(℘)]/Z where Z is a normalization factor. Therefore, good alignments (having a low cost) are favored over bad alignments (having a high cost). The expected cost ∑℘∈PC(℘)exp[−θC(℘)]/Z computed over all possible alignments ℘∈P defines the SoP edit distance. When θ→∞, only the best alignments matter and the measure reduces to the standard edit distance. The rationale behind this definition is the following: for some applications, two sequences sharing many good alignments should be considered as more similar than two sequences having only one single good, optimal, alignment in common. In other words, sub-optimal alignments could also be taken into account. Forward/backward recurrences allowing to efficiently compute the expected cost are developed. Virtually any Viterbi-like sequence comparison algorithm computed on a lattice can be generalized in the same way; for instance, a SoP longest common subsequence is also developed. Pattern classification tasks performed on five data sets show that the new measures usually outperform the standard ones and, in any case, never perform significantly worse, at the expense of tuning the parameter θ. 相似文献
11.
In this paper we present an efficient subquadratic-time algorithm for matching strings and limited expressions in large texts. Limited expressions are a subset of regular expressions that appear often in practice. The generalization from simple strings to limited expressions has a negligible affect on the speed of our algorithm, yet allows much more flexibility. Our algorithm is similar in spirit to that of Masek and Paterson [MP], but it is much faster in practice. Our experiments show a factor of four to five speedup against the algorithms of Sellers [Se] and Ukkonen [Uk1] independent of the sizes of the input strings. Experiments also reveal our algorithm to be faster, in most cases, than a recent improvement by Chang and Lampe [CL2], especially for small alphabet sizes for which it is two to three times faster.The research of U. Manber was supported in part by a Presidential Young Investigator Award DCR-8451397, with matching funds from AT&T, and by NSF Grant CCR-9001619. G. Myers research was supported in part by NIH Grant LM04960, NSF Grant CCR-9001619, and the Aspen Center for Physics. 相似文献
12.
Heikki Hyyrö 《Information Processing Letters》2008,108(5):313-319
We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings—Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (m−k)(k+2) bits of space to represent an NFA with (m−k)(k+1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (m−k)(k+1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m. 相似文献
13.
一种有效的编辑距离和编辑路径求解技术 总被引:1,自引:0,他引:1
邹旭楷 《小型微型计算机系统》1996,17(7):72-76
给定字符串T.P,将T转换成P所需的插入,删除,替代序列称为T到P的编辑路径,其最短编辑路径所需的插入,删除替代总数称为T到P的编辑距离,本文提出一种有效的编辑距离和编辑路径求解技术,该技术首先通过一有效的字符串相似匹配算法计算出编辑距离,而后仅通过简单的二进制字位运算正确计算出编辑路径。 相似文献
14.
It has long been known that pattern matching in the Hamming distance metric can be done in time, where n is the length of the text, m is the length of the pattern, and Σ is the alphabet. The classic algorithm for this is due to Abrahamson and Kosaraju. This paper considers the following generalization, motivated by the situation where the entries in the text and pattern are analog, or distorted by additive noise, or imprecisely given for some other reason: in any alignment of the pattern with the text, two aligned symbols a and b contribute +1 to the similarity score if they differ by no more than a given threshold θ, otherwise they contribute zero. We give an time algorithm for this more general version of the problem; the classic Hamming distance matching problem is the special case of θ=0. 相似文献
15.
We study a recent algorithm for fast on-line approximate string matching. This is the problem of searching a pattern in a
text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short
patterns using a nondeterministic finite automaton, which is simulated using bit-parallelism. A number of techniques to extend
this kernel for longer patterns are presented in that work. However, the techniques can be integrated in many ways and the
optimal interplay among them is by no means obvious.
The solution to this problem starts at a very low level, by obtaining basic probabilistic information about the problem which
was not previously known, and ends integrating analytical results with empirical data to obtain the optimal heuristic. The
conclusions obtained via analysis are experimentally confirmed. We also improve many of the techniques and obtain a combined
heuristic which is faster than the original work.
This work shows an excellent example of a complex and theoretical analysis of algorithms used for design and for practical
algorithm engineering, instead of the common practice of first designing an algorithm and then analyzing it.
Received March 31, 1998; revised November 18, 1998. 相似文献
16.
Let X and Y be two run-length encoded strings, of encoded lengths k and l, respectively. We present a simple O(|X|l+|Y|k) time algorithm that computes their edit distance. 相似文献
17.
Alden H. Wright 《Software》1994,24(4):337-362
Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given. 相似文献
18.
This paper presents a novel pattern matching technique that is robust to illumination changes and the occlusion problem. The technique is based on the matching of gradient orientations in place of traditional image features such as intensities or gradients. Gradient orientations depend on the texture in an image. They are known to be insensitive to changes of image intensities that are often caused by time-varying illuminations or the auto-gain control (AGC) function of the camera. Moreover,the proposed method employs a voting strategy in the process of matching gradient orientations. The method works remarkably well even when a large part of the pattern is occluded with a foreign object. Consequently, the proposed method is robust to both irregular lighting conditions and the occlusion problem. 相似文献
19.
20.
Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported. 相似文献