首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
During the past few years, several works have been done to derive string kernels from probability distributions. For instance, the Fisher kernel uses a generative model M (e.g. a hidden Markov model) and compares two strings according to how they are generated by M. On the other hand, the marginalized kernels allow the computation of the joint similarity between two instances by summing conditional probabilities. In this paper, we adapt this approach to edit distance-based conditional distributions and we present a way to learn a new string edit kernel. We show that the practical computation of such a kernel between two strings x and x built from an alphabet Σ requires (i) to learn edit probabilities in the form of the parameters of a stochastic state machine and (ii) to calculate an infinite sum over Σ* by resorting to the intersection of probabilistic automata as done for rational kernels. We show on a handwritten character recognition task that our new kernel outperforms not only the state of the art string kernels and string edit kernels but also the standard edit distance used by a neighborhood-based classifier.  相似文献   

2.
We consider a relationship between the unit cost edit distance for two rooted ordered trees and the unit cost edit distance for the corresponding Euler strings. We show that the edit distance between trees is at least half of the edit distance between the Euler strings and is at most 2h+1 times the edit distance between the Euler strings, where h is the minimum height of two trees. The result can be extended for more general cost functions.  相似文献   

3.
4.
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.  相似文献   

5.
We present an algorithm to approximate edit distance between two ordered and rooted trees of bounded degree. In this algorithm, each input tree is transformed into a string by computing the Euler string, where labels of some edges in the input trees are modified so that structures of small subtrees are reflected to the labels. We show that the edit distance between trees is at least 1/6 and at most O(n 3/4) of the edit distance between the transformed strings, where n is the maximum size of two input trees and we assume unit cost edit operations for both trees and strings. The algorithm works in O(n 2) time since computation of edit distance and reconstruction of tree mapping from string alignment takes O(n 2) time though transformation itself can be done in O(n) time.  相似文献   

6.
An algorithm for the computation of the edit distance of run-length coded strings is given. In run-length coding, not all individual symbols in a string are explicitly listed. Instead, one run of identical consecutive symbols is coded by giving one representative symbol together with its multiplicity. The algorithm determines the minimum cost sequence of edit operations transforming one string into another. In the worst case, the algorithm has a time complexity ofO(n·m), wheren andm give the lengths of the strings to be compared. In the best case, the time complexity isO(k·l), wherek andl are the numbers of runs of identical symbols in the two strings under comparison.  相似文献   

7.
A recent trend in stringology explores the possibility of utilizing text compression to speed up similarity computation between strings. In this line of investigation, run-length encoding is one of the earliest studied compression schemes. Despite its simple coding nature, the only positive result before this work is the computation of the in-del distance (dual of longest common subsequence), which requires O(mnlogmn) time, where m and n denote the number of runs of the input strings. The worst-case time complexity of computing the edit distance between two run-length encoded strings still depends on the uncompressed string lengths. In this paper, we break the foundational gap by providing its first “fully compressed” algorithm whose running time depends solely on the compressed string lengths. Specifically, given two strings, compressed into m and n runs, mn, we present an O(mn 2)-time algorithm for computing the edit distance of the strings. Our approach also yields the first fully compressed solution to approximate matching of a pattern of m runs in a text of n runs in O(mn 2) time.  相似文献   

8.
We consider the problem of privacy enforcement for dynamic systems using the technique of obfuscation. Our approach captures the trade-off between privacy and utility, in a formal reactive framework. Specifically, we model a dynamic system as an automaton or labeled transition system with predefined secret behaviors. The system generates event strings for some useful computation (utility). At the same time, it must hide its secret behaviors from any outside observer of its behavior (privacy). We formally capture both privacy and utility specifications within the model of the system. We propose as obfuscation mechanism for privacy enforcement the use of edit functions that suitably alter the output behavior of the system by inserting, deleting, or replacing events in its output strings. The edit function must hide secret behaviors by making them observationally equivalent to non-secret behaviors, while at the same time satisfying the utility requirement on the output strings. We develop algorithmic procedures that synthesize a correct-by-construction edit function satisfying both privacy and utility specifications. The synthesis procedure is based on the solution of a game where the edit function must react to the system moves by suitable output editing. After presenting an explicit algorithm for solving for the winning strategies of the game, we present two complementary symbolic implementations to address scalability of our methodology. The first symbolic implementation uses a direct encoding of the explicit algorithm using binary decision diagrams (BDDs). The second symbolic implementation reframes the synthesis of edit functions as a supervisory control problem and then applies a recently-developed tool for solving supervisory control problems using BDDs. Experimental results comparing the two symbolic implementations are provided.  相似文献   

9.
We propose a new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string. For an uncompressed string of length n and a compressed string with M runs, the algorithm computes their edit distance in time O(Mn). This result directly implies an O(min{mN,Mn}) time algorithm for strings of lengths m and n with M and N runs, respectively. It improves the previous best known time bound O(mN+Mn).  相似文献   

10.
《Pattern recognition letters》2002,23(1-3):203-213
An algorithm to compute the mean shape, when the shape is represented by a string, is presented as a modification of the well-known string edit algorithm. Given N strings of symbols, a string edit sequence defines a mapping between their corresponding symbols. We transform these sets of mapped symbols (edges) into piecewise linear functions and we compute their mean. To transform them into functions, we use the equation of the line defining their edges, and the percentage of their length, in order to have a common parameterization. The algorithm has been experimentally tested in the computation of a representative among a class of shapes in a clustering procedure in the domain of a graphics recognition application.  相似文献   

11.
The edit distance between given two strings X and Y is the minimum number of edit operations that transform X into Y without performing multiple operations that involve the same position. Ordinarily, string editing is based on character insert, delete, and substitute operations. Motivated from the facts that substring reversals are observed in genomic sequences, and it is not always possible to transform a given sequence X into a given sequence Y by reversals alone (e.g., X is all 0's, and Y is all 1's), Muthukrishnan and Sahinalp [S. Muthukrishnan, S.C. Sahinalp, Approximate nearest neighbors and sequence comparison with block operations, in: Proc. ACM Symposium on Theory of Computing (STOC), 2000, pp. 416-424; S. Muthukrishnan, S.C. Sahinalp, An improved algorithm for sequence comparison with block reversals, Theoretical Computer Science 321 (1) (2004) 95-101] considered a “simple” well-defined edit distance model in which the edit operations are: replace a character, and reverse and replace a substring. A substring of X can only be reversed if the reversal results in a match in the same position in Y. The cost of each character replacement and substring reversal is 1. The distance in this case is defined only when |X|=|Y|=n. There is an algorithm for computing the distance in this model with worst-case time complexity O(nlog2n) [S. Muthukrishnan, S.C. Sahinalp, An improved algorithm for sequence comparison with block reversals, Theoretical Computer Science 321 (1) (2004) 95-101]. We present a dynamic programming algorithm with worst-case time complexity O(n2) but its expected running-time is O(n). In our dynamic programming solution the weights of edit operations can vary for different substitutions, and the costs of reversals can be a function of the reversal-length.  相似文献   

12.
Let X and Y be two run-length encoded strings, of encoded lengths k and l, respectively. We present a simple O(|X|l+|Y|k) time algorithm that computes their edit distance.  相似文献   

13.
Windowed pq-grams for approximate joins of data-centric XML   总被引:1,自引:0,他引:1  
In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.  相似文献   

14.
两字符串的编辑距离是从一个串转换到另一个串所需要的最少基本操作数。编辑距离广泛应用于字符串近似匹配、字符串相似连接等领域。动态规划法利用编辑距离矩阵来计算两个串的编辑距离,需要计算矩阵中的所有元素,时间效率低。改进的方法改变了矩阵中元素的计算次序,减少了需要比对的元素,但仍需要比对一半以上的元素,时间效率还有待提高。提出基于基本操作序列的编辑距离顺序验证方法。首先,分析了基本操作序列的可列性,给出了列举基本操作序列的方法。然后依次顺序验证基本操作数从小到大的基本操作序列直到某一序列通过验证,得到其编辑距离。在阈值为2的字符串近似搜索实验中发现,所提方法比动态规划类方法具有更高的效率。  相似文献   

15.
A new algorithm for string edit distance computation is given. The algorithm assumes that one of the two strings to be compared is a dictionary entry that is known a priori. This dictionary word is converted in an off-line phase into a deterministic finite state automaton. Given an input string and the automaton derived from the dictionary word, the computation of the edit distance between the two strings corresponds to a traversal of the states of the automaton. This procedure needs time which is only linear in the length of the input string. It is independent of the length of the dictionary word. Given not only one butN different dictionary words, their corresponding automata can be combined into a single deterministic finite state automaton. Thus the computation of the edit distance between the input word and each dictionary entry, and the determination of the nearest neighbor in the dictionary need time that is only linear in the length of the input string. However, the number os states of the automation is exponential.  相似文献   

16.
Recognition of noisy subsequences using constrained edit distances   总被引:1,自引:0,他引:1  
Let X* be any unknown word from a finite dictionary H. Let U be any arbitrary subsequence of X*. We consider the problem of estimating X* by processing Y, which is a noisy version of U. We do this by defining the constrained edit distance between XH and Y subject to any arbitrary edit constraint involving the number and type of edit operations to be performed. An algorithm to compute this constrained edit distance has been presented. Although in general the algorithm has a cubic time complexity, within the framework of our solution the algorithm possesses a quadratic time complexity. Recognition using the constrained edit distance as a criterion demonstrates remarkable accuracy. Experimental results which involve strings of lengths between 40 and 80 and which contain an average of 26.547 errors per string demonstrate that the scheme has about 99.5 percent accuracy.  相似文献   

17.
M?kinen  Ukkonen  Navarro 《Algorithmica》2008,35(4):347-369
Abstract. We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.  相似文献   

18.
A classical measure of similarity between strings is the length of the longest common subsequence (LCS) between the two given strings. The search for efficient algorithms for finding the LCS has been going on for more than three decades. To date, all known algorithms may take quadratic time (shaved by logarithmic factors) to find large LCS. In this paper, the problem of approximating LCS is studied, while focusing on the hard inputs for this problem, namely, approximating LCS of near-linear size in strings over a relatively large alphabet (of size at least n? for some constant ?>0, where n is the length of the string). We show that, any given string over a relatively large alphabet can be embedded into a locally non-repetitive string. This embedding has a negligible additive distortion for strings that are not too dissimilar in terms of the edit distance. We also show that LCS can be efficiently approximated in locally-non-repetitive strings. Our new method (the embedding together with the approximation algorithm) gives a strictly sub-quadratic time algorithm (i.e., of complexity O(n2-?) for some constant ?) which can find common subsequences of linear (and near linear) size that cannot be detected efficiently by the existing tools.  相似文献   

19.
G. Tremblay  F. Champagne 《Software》2007,37(2):207-230
Musical dictations for ear training and training in music writing form a key practice of basic musical training. Marking students' dictation exercises for large groups of students can require a lot of work. In this paper, we present a tool, called CADiM, that can help automate the marking of such musical dictations. The edit distance, which computes the similarity between any two strings, has been used in various areas such as string/text analysis, protein/genome matching in bio‐computing and musical applications, for example music retrieval or musicological analysis. CADiM's marking algorithm is based on an earlier edit distance proposed for musical sequences, but adapted to reflect the marking heuristic used by a domain expert's specific approach to musical training. Computing an edit distance on musical scores requires using an appropriate representation. More precisely, given our specific context, a symbolic representation is required. We use MusicXML, an XML application for standard Western music notation. Given a Document Type Definition for MusicXML, existing Java tools can generate a MusicXML parser. Such a parser, given appropriate input files, then generates an intermediate form (DOM object) on which analyses and transformations are performed in order to compute the edit distance. In turn, the edit distance is used to give a mark as well as identify the key errors. CADiM has been applied to a number of test cases and the results compared with those obtained by a domain expert. Overall, the results are promising, namely, only 3% difference between the domain expert's marks and those produced by CADiM. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

20.
Hierarchical data are often modelled as trees. An interesting query identifies pairs of similar trees. The standard approach to tree similarity is the tree edit distance, which has successfully been applied in a wide range of applications. In terms of runtime, the state-of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast independent of the tree shape. Unfortunately, this algorithm requires up to twice the memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for the tree edit distance computation.In this paper we present a new, memory efficient algorithm for the tree edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without trading in memory efficiency. This is achieved by releasing memory early during the first step of the algorithm, which computes a decomposition strategy for the actual distance computation. We show the correctness of our approach and prove an upper bound for the memory usage. The strategy computed by AP-TED is optimal in the class of all-path strategies, which subsumes the class of LRH strategies used in RTED. We further present the AP-TED+ algorithm, which requires less computational effort for very small subtrees and improves the runtime of the distance computation. Our experimental evaluation confirms the low memory requirements and the runtime efficiency of our approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号