首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
We define the unrestricted modified edit distance based on the modified edit distance defined by Galil and Giancarlo (1989) where the cost of substring deletions and insertions are contextsensitive and the cost of character substitutions are context-free. The modified edit distance is the minimum cost of converting a string X to a string Y where the sequence of edit operations has the property that all substring deletions precede all character substitutions and all character substitutions precede all substring insertions. Note that the modified edit distance does not satisfy the triangle inequality. We show that the problem of finding the unrestricted modified edit distance which is the minimum cost over all edit sequences (without these constraints) of converting X to Y is undecidable.  相似文献   

2.
We give two parallel algorithms for sequence comparison on the Connection Machine 2 (CM-2). The specific comparison measure we compute is the edit distance: given a finite alphabet ∑ and two input sequences X ϵ ∑+ and Y ϵ ∑+ the edit distance d(X,Y) is the minimum cost of transforming X into Y via a series of weighted insertions, deletions and substitutions of characters. The edit distance comparison measure is equivalent to or subsumes a broad range of well known sequence comparison measures. The CM-2 is very fast at performing parallel prefix operations. Our contribution consists of casting the problem in terms of these operations. Our first algorithm computes d(X,Y) using N processors and O(M S) time units, where M = min(|X|,||Y|) + 1, N = max(|X|,|Y|) + 1 and S is the time required for a parallel prefix operation. The second algorithm computes d(X,Y) using NM processors and O((log N log M)(S + R)) time units, where R is the time for a ‘router’ communication step—one in which each processor is able to read data, in parallel, from the memory of any other processor. Our algorithms can also be applied to several variants of the problem, such as subsequence comparisons, and one—many and many-many comparisons on 'sequence databases'.  相似文献   

3.
Let X and Y be two run-length encoded strings, of encoded lengths k and l, respectively. We present a simple O(|X|l+|Y|k) time algorithm that computes their edit distance.  相似文献   

4.
Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported.  相似文献   

5.
In this paper, we consider a generalized longest common subsequence problem, the string-excluding constrained LCS problem. For the two input sequences X and Y of lengths n and m, and a constraint string P of length r, the problem is to find a common subsequence Z of X and Y excluding P as a substring and the length of Z is maximized. The problem and its solution were first proposed by Chen and Chao (2011) [1], but we found that their algorithm cannot solve the problem correctly. A new dynamic programming solution for the STR-EC-LCS problem is then presented in this paper. The correctness of the new algorithm is proved. The time complexity of the new algorithm is O(nmr).  相似文献   

6.
An algorithm for the computation of the edit distance of run-length coded strings is given. In run-length coding, not all individual symbols in a string are explicitly listed. Instead, one run of identical consecutive symbols is coded by giving one representative symbol together with its multiplicity. The algorithm determines the minimum cost sequence of edit operations transforming one string into another. In the worst case, the algorithm has a time complexity ofO(n·m), wheren andm give the lengths of the strings to be compared. In the best case, the time complexity isO(k·l), wherek andl are the numbers of runs of identical symbols in the two strings under comparison.  相似文献   

7.
Given two strings X=a1an and P=b1bm over an alphabet Σ, the problem of testing whether P occurs as a subsequence of X is trivially solved in linear time. It is also known that a simple O(n log |Σ|) time preprocessing of X makes it easy to decide subsequently, for any P and in at most |P| log |Σ| character comparisons, whether P is a subsequence of X. These problems become more complicated if one asks instead whether P occurs as a subsequence of some substring Y of X of bounded length. This paper presents an automaton built on the textstring X and capable of identifying all distinct minimal substrings Y of X having P as a subsequence. By a substring Y being minimal with respect to P, it is meant that P is not a subsequence of any proper substring of Y. For every minimal substring Y, the automaton recognizes the occurrence of P having the lexicographically smallest sequence of symbol positions in Y. It is not difficult to realize such an automaton in time and space O(n2) for a text of n characters. One result of this paper consists of bringing those bounds down to linear or O(n log n), respectively, depending on whether the alphabet is bounded or of arbitrary size, thereby matching the corresponding complexities of automata constructions for offline exact string searching. Having built the automaton, the search for all lexicographically earliest occurrences of P in X is carried out in time O(∑i=1mrocci·i) or O(n+∑i=1mrocci·i· log n), depending on whether the alphabet is fixed or arbitrary, where rocci is the number of distinct minimal substrings of X having b1bi as a subsequence (note that each such substring may occur many times in X but is counted only once in the bound). All log factors appearing in the above bounds can be further reduced to log log by resorting to known integer-handling data structures.  相似文献   

8.
A recent trend in stringology explores the possibility of utilizing text compression to speed up similarity computation between strings. In this line of investigation, run-length encoding is one of the earliest studied compression schemes. Despite its simple coding nature, the only positive result before this work is the computation of the in-del distance (dual of longest common subsequence), which requires O(mnlogmn) time, where m and n denote the number of runs of the input strings. The worst-case time complexity of computing the edit distance between two run-length encoded strings still depends on the uncompressed string lengths. In this paper, we break the foundational gap by providing its first “fully compressed” algorithm whose running time depends solely on the compressed string lengths. Specifically, given two strings, compressed into m and n runs, mn, we present an O(mn 2)-time algorithm for computing the edit distance of the strings. Our approach also yields the first fully compressed solution to approximate matching of a pattern of m runs in a text of n runs in O(mn 2) time.  相似文献   

9.
The “Common Substring Alignment” problem is defined as follows. The input consists of a set of strings S1,S2…,Sc, with a common substring appearing at least once in each of them, and a target string T. The goal is to compute similarity of all strings Si with T, without computing the part of the common substring over and over again.In this paper we consider the Common Substring Alignment problem for the LCS (Longest Common Subsequence) similarity metric. Our algorithm gains its efficiency by exploiting the sparsity inherent to the LCS problem. Let Y be the common substring, n be the size of the compared sequences, Ly be the length of the LCS of T and Y, denoted |LCS[T,Y]|, and L be max{|LCS[T,Si]|}. Our algorithm consists of an O(nLy) time encoding stage that is executed once per common substring, and an O(L) time alignment stage that is executed once for each appearance of the common substring in each source string. The additional running time depends only on the length of the parts of the strings that are not in any common substring.  相似文献   

10.
We present an algorithm to approximate edit distance between two ordered and rooted trees of bounded degree. In this algorithm, each input tree is transformed into a string by computing the Euler string, where labels of some edges in the input trees are modified so that structures of small subtrees are reflected to the labels. We show that the edit distance between trees is at least 1/6 and at most O(n 3/4) of the edit distance between the transformed strings, where n is the maximum size of two input trees and we assume unit cost edit operations for both trees and strings. The algorithm works in O(n 2) time since computation of edit distance and reconstruction of tree mapping from string alignment takes O(n 2) time though transformation itself can be done in O(n) time.  相似文献   

11.
There is no known algorithm that solves the general case of the approximate edit distance problem, where the edit operations are insertion, deletion, mismatch, and swap, in time o(nm), where n is the length of the text and m is the length of the pattern.In the effort to study this problem, the edit operations have been analyzed independently. Karloff [10] showed an algorithm that approximates the edit distance problem with only the mismatch operation in time . Amir et al. [4] showed that if the only edit operations allowed are swap and mismatch, then the exact edit distance problem can be solved in time .In this paper, we discuss the problem of approximate edit distance with swap and mismatch. We show a randomized time algorithm for the problem. The algorithm guarantees an approximation factor of (1+?) with probability of at least .  相似文献   

12.
We describe a new algorithm for the problem of perfect sorting a signed permutation by reversals. The worst-case time complexity of this algorithm is parameterized by the maximum prime degree d of the strong interval tree, i.e., f(d).nO(1). This improves the best known algorithm which complexity was based on a parameter always larger than or equal to d.  相似文献   

13.
Due to a large number of applications, bicliques of graphs have been widely considered in the literature. This paper focuses on non-induced bicliques. Given a graph G=(V,E) on n vertices, a pair (X,Y), with X,YV, XY=∅, is a non-induced biclique if {x,y}∈E for any xX and yY. The NP-complete problem of finding a non-induced (k1,k2)-biclique asks to decide whether G contains a non-induced biclique (X,Y) such that |X|=k1 and |Y|=k2. In this paper, we design a polynomial-space O(n1.6914)-time algorithm for this problem. It is based on an algorithm for bipartite graphs that runs in time O(n1.30052). In deriving this algorithm, we also exhibit a relation to the spare allocation problem known from memory chip fabrication. As a byproduct, we show that the constraint bipartite vertex cover problem can be solved in time O(n1.30052).  相似文献   

14.
This paper deals with the problem of estimating a transmitted string X * by processing the corresponding string Y, which is a noisy version of X *. We assume that Y contains substitution, insertion, and deletion errors, and that X * is an element of a finite (but possibly, large) dictionary, H. The best estimate X + of X *, is defined as that element of H which minimizes the generalized Levenshtein distance D(X, Y) between X and Y such that the total number of errors is not more than K, for all XH. The trie is a data structure that offers search costs that are independent of the document size. Tries also combine prefixes together, and so by using tries in approximate string matching we can utilize the information obtained in the process of evaluating any one D(X i , Y), to compute any other D(X j , Y), where X i and X j share a common prefix. In the artificial intelligence (AI) domain, branch and bound (BB) schemes are used when we want to prune paths that have costs above a certain threshold. These techniques have been applied to prune, for example, game trees. In this paper, we present a new BB pruning strategy that can be applied to dictionary-based approximate string matching when the dictionary is stored as a trie. The new strategy attempts to look ahead at each node, c, before moving further, by merely evaluating a certain local criterion at c. The search algorithm according to this pruning strategy will not traverse inside the subtrie(c) unless there is a “hope” of determining a suitable string in it. In other words, as opposed to the reported trie-based methods (Kashyap and Oommen in Inf Sci 23(2):123–142, 1981; Shang and Merrettal in IEEE Trans Knowledge Data Eng 8(4):540–547, 1996), the pruning is done a priori before even embarking on the edit distance computations. The new strategy depends highly on the variance of the lengths of the strings in H. It combines the advantages of partitioning the dictionary according to the string lengths, and the advantages gleaned by representing H using the trie data structure. The results demonstrate a marked improvement (up to 30% when costs are of a 0/1 form, and up to 47% when costs are general) with respect to the number of operations needed on three benchmark dictionaries.  相似文献   

15.
We propose a new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string. For an uncompressed string of length n and a compressed string with M runs, the algorithm computes their edit distance in time O(Mn). This result directly implies an O(min{mN,Mn}) time algorithm for strings of lengths m and n with M and N runs, respectively. It improves the previous best known time bound O(mN+Mn).  相似文献   

16.
The guided tree edit distance problem is to find a minimum cost series of edit operations that transforms two input forests F and G into isomorphic forests F and G such that a third input forest H is included in F (and G). The edit operations are relabeling a vertex and deleting a vertex. We show efficient algorithms for this problem that are faster than the previous algorithm for this problem of Peng and Ting [Z. Peng, H. Ting, Guided forest edit distance: Better structure comparisons by using domain-knowledge, in: Proc. 18th Symposium on Combinatorial Pattern Matching (CPM), 2007, pp. 28-39].  相似文献   

17.
Some upper and lower bounds are obtained for the maximum of the absolute value of the difference between the mutual information |I(X; Y) ? I(X′; Y′)| of two pairs of discrete random variables (X, Y) and (X′, Y′) via the variational distance between the probability distributions of these pairs. In particular, the upper bound obtained here substantially generalizes and improves the upper bound of [1]. In some special cases, our upper and lower bounds coincide or are rather close. It is also proved that the lower bound is asymptotically tight in the case where the variational distance between (X, Y) and (XY′) tends to zero.  相似文献   

18.
This paper deals with the problem of estimating a transmitted string Xs from the corresponding received string Y, which is a noisy version of Xs. We assume that Y contains any number of substitution, insertion, and deletion errors, and that no two consecutive symbols of Xs were deleted in transmission. We have shown that for channels which cause independent errors, and whose error probabilities exceed those of noisy strings studied in the literature [12], at least 99.5% of the erroneous strings will not contain two consecutive deletion errors. The best estimate X+ of Xs is defined as that element of H which minimizes the generalized Levenshtein distance D(XY) between X and Y. Using dynamic programming principles, an algorithm is presented which yields X+ without computing individually the distances between every word of H and Y. Though this algorithm requires more memory, it can be shown that it is, in general, computationally less complex than all other existing algorithms which perform the same task.  相似文献   

19.
Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and areversal takes an arbitrary substring of elements, and reverses their order.For this problem, we develop two algorithms: a greedy approximation algorithm, that finds a solution provably close to optimal inO(n 2) time and0(n) space forn-element permutations, and a branch- and-bound exact algorithm, that finds an optimal solution in0(mL(n, n)) time and0(n 2) space, wherem is the size of the branch- and-bound search tree, andL(n, n) is the time to solve a linear program ofn variables andn constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch- and-bound algorithm are a novel application of maximum-weight matchings, shortest paths, and linear programming.In a series of experiments, we study the performance of an implementation on random permutations, and permutations generated by random reversals. For permutations differing byk random reversals, we find that the average upper bound on reversal distance estimatesk to within one reversal fork<1/2n andn<100. For the difficult case of random permutations, we find that the average difference between the upper and lower bounds is less than three reversals forn<50. Due to the tightness of these bounds, we can solve, to optimality, problems on 30 elements in a few minutes of computer time. This approaches the scale of mitochondrial genomes.This research was supported by a postdoctoral fellowship from the Program in Mathematics and Molecular Biology of the University of California at Berkeley under National Science Foundation Grant DMS-8720208, and by a fellowship from the Centre de recherches mathématiques of the Université de Montréal.This research was supported by grants from the Natural Sciences and Engineering Research Council of Canada, and the Fonds pour la formation de chercheurs et l'aide à la recherche (Québec). The author is a fellow of the Canadian Institute for Advanced Research.  相似文献   

20.
Given an unlabeled, unweighted, and undirected graph with n vertices and small (but not necessarily constant) treewidth k, we consider the problem of preprocessing the graph to build space-efficient encodings (oracles) to perform various queries efficiently. We assume the word RAM model where the size of a word is Ω(logn) bits. The first oracle, we present, is the navigation oracle which facilitates primitive navigation operations of adjacency, neighborhood, and degree queries. By way of an enumeration argument, which is of interest in its own right, we show the space requirement of the oracle is optimal to within lower order terms for all graphs with n vertices and treewidth k. The oracle supports the mentioned queries all in constant worst-case time. The second oracle, we present, is an exact distance oracle which facilitates distance queries between any pair of vertices (i.e., an all-pairs shortest-path oracle). The space requirement of the oracle is also optimal to within lower order terms. Moreover, the distance queries perform in O(k 3log3 k) time. Particularly, for the class of graphs of popular interest, graphs of bounded treewidth (where k is constant), the distances are reported in constant worst-case time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号