首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 46 毫秒
A recent trend in stringology explores the possibility of utilizing text compression to speed up similarity computation between strings. In this line of investigation, run-length encoding is one of the earliest studied compression schemes. Despite its simple coding nature, the only positive result before this work is the computation of the in-del distance (dual of longest common subsequence), which requires O(mnlogmn) time, where m and n denote the number of runs of the input strings. The worst-case time complexity of computing the edit distance between two run-length encoded strings still depends on the uncompressed string lengths. In this paper, we break the foundational gap by providing its first “fully compressed” algorithm whose running time depends solely on the compressed string lengths. Specifically, given two strings, compressed into m and n runs, mn, we present an O(mn 2)-time algorithm for computing the edit distance of the strings. Our approach also yields the first fully compressed solution to approximate matching of a pattern of m runs in a text of n runs in O(mn 2) time.  相似文献   

两字符串的编辑距离是从一个串转换到另一个串所需要的最少基本操作数。编辑距离广泛应用于字符串近似匹配、字符串相似连接等领域。动态规划法利用编辑距离矩阵来计算两个串的编辑距离,需要计算矩阵中的所有元素,时间效率低。改进的方法改变了矩阵中元素的计算次序,减少了需要比对的元素,但仍需要比对一半以上的元素,时间效率还有待提高。提出基于基本操作序列的编辑距离顺序验证方法。首先,分析了基本操作序列的可列性,给出了列举基本操作序列的方法。然后依次顺序验证基本操作数从小到大的基本操作序列直到某一序列通过验证,得到其编辑距离。在阈值为2的字符串近似搜索实验中发现,所提方法比动态规划类方法具有更高的效率。  相似文献   

Evolutionary trees describing the relationship for a set of species are central in evolutionary biology, and quantifying differences between evolutionary trees is therefore an important task. The quartet distance is a distance measure between trees previously proposed by Estabrook, McMorris, and Meacham. The quartet distance between two unrooted evolutionary trees is the number of quartet topology differences between the two trees, where a quartet topology is the topological subtree induced by four species. In this paper we present an algorithm for computing the quartet distance between two unrooted evolutionary trees of n species, where all internal nodes have degree three, in time O(n log n. The previous best algorithm for the problem uses time O(n 2).  相似文献   

Evolutionary trees describing the relationship for a set of species are central in evolutionary biology, and quantifying differences between evolutionary trees is therefore an important task. The quartet distance is a distance measure between trees previously proposed by Estabrook, McMorris, and Meacham. The quartet distance between two unrooted evolutionary trees is the number of quartet topology differences between the two trees, where a quartet topology is the topological subtree induced by four species. In this paper we present an algorithm for computing the quartet distance between two unrooted evolutionary trees of n species, where all internal nodes have degree three, in time O(n log n. The previous best algorithm for the problem uses time O(n 2).  相似文献   

Distance labeling schemes are composed of a marker algorithm for labeling the vertices of a graph with short labels, coupled with a decoder algorithm allowing one to compute the distance between any two vertices directly from their labels (without using any additional information). As applications for distance labeling schemes concern mainly large and dynamically changing networks, it is of interest to study distributed dynamic labeling schemes. The current paper considers the problem on dynamic trees, and proposes efficient distributed schemes for it. The paper first presents a labeling scheme for distances in the dynamic tree model, with amortized message complexity O(log2 n) per operation, where n is the size of the tree at the time the operation takes place. The protocol maintains O(log2 n) bit labels. This label size is known to be optimal even in the static scenario. A more general labeling scheme is then introduced for the dynamic tree model, based on extending an existing static tree labeling scheme to the dynamic setting. The approach fits a number of natural tree functions, such as distance, separation level, and flow. The main resulting scheme incurs an overhead of an O(log n) multiplicative factor in both the label size and amortized message complexity in the case of dynamically growing trees (with no vertex deletions). If an upper bound on n is known in advance, this method yields a different tradeoff, with an O(log2 n/log log n) multiplicative overhead on the label size but only an O(log n/log log n) overhead on the amortized message complexity. In the fully dynamic model the scheme also incurs an increased additive overhead in amortized communication, of O(log2 n) messages per operation.  相似文献   

We propose a new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string. For an uncompressed string of length n and a compressed string with M runs, the algorithm computes their edit distance in time O(Mn). This result directly implies an O(min{mN,Mn}) time algorithm for strings of lengths m and n with M and N runs, respectively. It improves the previous best known time bound O(mN+Mn).  相似文献   

The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.  相似文献   

支持块编辑距离的索引结构   总被引:1,自引:0,他引:1  
在近似字符串匹配中,传统的编辑距离不能很好地衡量诸如人名、地址等数据的相似关系,而块编辑距离可以很好地衡量两个字符串的相似性.如何有效地支持块编辑距离,进行近似字符串查询处理具有重要的意义.计算两个字符串的块编辑距离是一个NP完全问题,因此希望提供有效的方法可以增强过滤能力,并减少假通过率.设计了一种支持移动编辑距离的新颖的索引结构SHV-Trie,通过研究移动编辑距离的操作特性,使用字母出现的频率作为支持移动编辑距离操作的一个下界,并且提出相应的查询过滤算法,同时,针对索引SHV-Trie的空间开销过大的问题,提出一种优化字母排列的索引结构和一种压缩的索引结构及相关查询过滤算法.真实数据集上的实验结果与分析显示了所提出的索引结构具有良好的过滤能力,并通过减少效率假通过率提高查询的效率.  相似文献   

The problem of outputting all parse trees of a string accepted by a context-free grammar is considered. A systolic algorithms is presented that operates inO(m·n) time, wherem is the number of distinct parse trees andn is the length of the input. The systolic array usesn 2 processors, each of which requires at mostO(logn) bits of storage. This is much more space-efficient that a previously reported systolic algorithm for the same problem, which requiredO(n logn) space per processor. The algorithm also extends previous algorithms that only output a single parse tree of the input.Research squpported in part by NSF Grant DCR-8420935 and DCR-8604603.  相似文献   

Ferraro  Godin 《Algorithmica》2008,36(1):1-39
Abstract. In this paper we propose a dynamic programming algorithm to compare two quotiented trees using a constrained edit distance. A quotiented tree is a tree defined with an additional equivalent relation on vertices and such that the quotient graph is also a tree. The core of the method relies on an adaptation of an algorithm recently proposed by Zhang for comparing unordered rooted trees. This method is currently being used in plant architecture modelling to quantify different types of variability between plants represented by quotiented trees.  相似文献   

A classical measure of similarity between strings is the length of the longest common subsequence (LCS) between the two given strings. The search for efficient algorithms for finding the LCS has been going on for more than three decades. To date, all known algorithms may take quadratic time (shaved by logarithmic factors) to find large LCS. In this paper, the problem of approximating LCS is studied, while focusing on the hard inputs for this problem, namely, approximating LCS of near-linear size in strings over a relatively large alphabet (of size at least n? for some constant ?>0, where n is the length of the string). We show that, any given string over a relatively large alphabet can be embedded into a locally non-repetitive string. This embedding has a negligible additive distortion for strings that are not too dissimilar in terms of the edit distance. We also show that LCS can be efficiently approximated in locally-non-repetitive strings. Our new method (the embedding together with the approximation algorithm) gives a strictly sub-quadratic time algorithm (i.e., of complexity O(n2-?) for some constant ?) which can find common subsequences of linear (and near linear) size that cannot be detected efficiently by the existing tools.  相似文献   

Many string manipulations can be performed efficiently on suffix trees. In this paper a CRCW parallel RAM algorithm is presented that constructs the suffix tree associated with a string ofn symbols inO(logn) time withn processors. The algorithm requires Θ(n 2) space. However, the space needed can be reduced toO(n 1+?) for any 0< ? ≤1, with a corresponding slow-down proportional to 1/?. Efficient parallel procedures are also given for some string problems that can be solved with suffix trees.  相似文献   

We introduce a novel definition of approximate palindromes in strings, and provide an algorithm to find all maximal approximate palindromes in a string with up to k errors. Our definition is based on the usual edit operations of approximate pattern matching, and the algorithm we give, for a string of size n on a fixed alphabet, runs in O(k2n) time. We also discuss two implementation-related improvements to the algorithm, and demonstrate their efficacy in practice by means of both experiments and an average-case analysis.  相似文献   

We consider on-line construction of the suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem.  相似文献   

Squares are strings of the form ww where w is any nonempty string. Main and Lorentz proposed an O(nlogn)-time algorithm for finding the positions of all squares in a string of length n. Based on their result, we show how to find the positions of all squares in a run-length encoded string in time O(NlogN) where N is the number of runs in this string, provided that we do not explicitly compute at all “trivial squares” occurring within runs. The algorithm is optimal and its time complexity is independent of the length of the original uncompressed string.  相似文献   

This paper presents an efficient scheme maintaining a separator decomposition representation in dynamic trees using asymptotically optimal labels. In order to maintain the short labels, the scheme uses relatively low message complexity. In particular, if the initial dynamic tree contains only the root, then the scheme incurs an O(log4 n) amortized message complexity per topology change, where n is the current number of vertices in the tree. As a separator decomposition is a fundamental decomposition of trees used extensively as a component in many static graph algorithms, our dynamic scheme for separator decomposition may be used for constructing dynamic versions to these algorithms. The paper then shows how to use our dynamic separator decomposition to construct efficient labeling schemes on dynamic trees, using the same message complexity as our dynamic separator scheme. Specifically, we construct efficient routing schemes on dynamic trees, for both the designer and the adversary port models, which maintain optimal labels, up to a multiplicative factor of O(log log n). In addition, it is shown how to use our dynamic separator decomposition scheme to construct dynamic labeling schemes supporting the ancestry and NCA relations using asymptotically optimal labels, as well as to extend a known result on dynamic distance labeling schemes. Supported in part at the Technion by an Aly Kaufman fellowship. Supported in part by a grant from the Israel Science Foundation.  相似文献   

Ferraro  Godin 《Algorithmica》2003,36(1):1-39
In this paper we propose a dynamic programming algorithm to compare two quotiented trees using a constrained edit distance. A quotiented tree is a tree defined with an additional equivalent relation on vertices and such that the quotient graph is also a tree. The core of the method relies on an adaptation of an algorithm recently proposed by Zhang for comparing unordered rooted trees. This method is currently being used in plant architecture modelling to quantify different types of variability between plants represented by quotiented trees.  相似文献   

有向标记根树之间的编辑距离(TED)被广泛应用在文档的结构化相似度计算上。文中提出有向标记根树之间的语义编辑距离(TSED)的概念,并给出计算公式。组合TED和TSED形成距离测度,并应用在XML文档的结构聚类上。实验表明该距离模型在结构化聚类的准确率和召回率上明显优于单纯利用TED算法的聚类结果。该算法在时间复杂性上也等同于利用动态规划计算TED的最好算法。  相似文献   

A new algorithm for string edit distance computation is given. The algorithm assumes that one of the two strings to be compared is a dictionary entry that is known a priori. This dictionary word is converted in an off-line phase into a deterministic finite state automaton. Given an input string and the automaton derived from the dictionary word, the computation of the edit distance between the two strings corresponds to a traversal of the states of the automaton. This procedure needs time which is only linear in the length of the input string. It is independent of the length of the dictionary word. Given not only one butN different dictionary words, their corresponding automata can be combined into a single deterministic finite state automaton. Thus the computation of the edit distance between the input word and each dictionary entry, and the determination of the nearest neighbor in the dictionary need time that is only linear in the length of the input string. However, the number os states of the automation is exponential.  相似文献   

Theapproximate string matching problem is, given a text string, a pattern string, and an integerk, to find in the text all approximate occurrences of the pattern. An approximate occurrence means a substring of the text with edit distance at mostk from the pattern. We give a newO(kn) algorithm for this problem, wheren is the length of the text. The algorithm is based on the suffix automaton with failure transitions and on the diagonalwise monotonicity of the edit distance table. Some experiments showing that the algorithm has a small overhead are reported.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号