首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 410 毫秒
1.
张毅超  车玫  马骏 《计算机仿真》2007,24(12):97-100,116
高效求解2个字符串的最长公共子串(Longest Common Substring)是实现很多字符串算法的关键.文中首先给出了求解LCP问题的动态规划算法,广义后缀树算法,研究并分析了这两种算法,得出动态规划算法易于理解,但时间复杂度较高;广义后缀树算法的时间复杂度较低,但实现较为复杂并且广义后缀树占用的空间也较多.最后提出了一个新算法,该算法使用2个字符串的广义后缀数组,在保持和广义后缀树时间复杂度相等的基础上,可以简单地实现并且占用较少的空间.  相似文献   

2.
一种适合于GPU计算的并行后缀数组构造算法   总被引:1,自引:0,他引:1  
后缀数组广泛应用于序列分析、字符串匹配和文本压缩,近年来,有关后缀数组构造和应用算法的不断探索构成了计算机科学中一个非常活跃的研究领域.在对现有串行算法进行了分析和对比之后,提出了一种新的、简洁的适合于GPU计算的并行后缀数组倍增构造算法,以排序方法替代传统的分组策略,不但能独立完成后缀数组的并行构造,还可与现存的串行倍增算法结合使用,以达到最高的执行效率.实验结果表明该算法在解决实际应用问题时,具有易于实现、执行速度快和可扩展性强等优点,尤其在处理小字符集的数据时效率更高.  相似文献   

3.
后缀树的重要性可以为多年来学术界对它总是有新的发现而印证.它的结构简单,但可以在线性的时间里解决许多复杂的问题,被大量的使用在字符串及树的模式匹配中,对于XML标准,有很多基于关系库和对象库的索引技术和查询方案被提出来,我们试图给出一种基于后缀树进行路径导航的查询机制:用后缀树构造XML路径字典加速路径查询评价速度,我们提出可以在线地建立一个trie树的后缀树,讨论了XML路径字典中的后缀树建树算法,阐述了整个索引方案和查询机制,并探讨了包括RPE在内的它所支持的各种查询操作,XML路径字典被用于加快路径查询的评价速度.  相似文献   

4.
为了解决移动数据形成的轨迹间用户相似性问题,提出了一种基于位置序列的广义后缀树(LSGST)用户相似性计算方法。该算法首先从移动数据中抽取位置序列,同时将位置序列映射为字符串,完成了对位置序列的处理到对字符串处理的转化工作;然后,构建不同用户间的位置序列广义后缀树;最后,分别从经过的相似地方个数、最长公共子序列、频繁公共位置序列三方面对相似性进行具体计算。理论分析和仿真表明,该算法提出的三个计算指标在计算相似性方面具有理想的效果;除此之外,与构造后缀树的普通方法相比,时间复杂度较低;与动态规划和朴素字符串匹配方法相比,该算法在寻找最长公共子串、频繁公共位置序列时,效率更高。实验结果表明LSGST能够有效测量相似性,同时减少了寻找测量指标时需要处理的轨迹数据量,并在时间复杂度方面明显优于对比算法。  相似文献   

5.
在传统的字符串处理算法中往往分别考虑字符串的频度和长度.然而,在实际应用中,将字符串的频度和长度结合考虑是有意义的.基于这点我们提出了频长积的概念,规定字符串的频度和长度的乘积为字符串的频长积.并基于广义后缀树和Ukkonen算法,提出了时间复杂度为O(N)的查找算法.效率实验证实了该算法的高效性.语义实验表明,本算法找出的最大频长积字符串相比于最大频度字符串或最大长度字符串,其实际语义更为明确.这样的字符串在文本压缩、基因序列的分析以及其他注重语义的应用中将具有很高的应用价值.  相似文献   

6.
为用后缀树聚类算法对维吾尔文网页进行聚类,通过分析可扩展后缀树和维吾尔文的特点设计了维吾尔文后缀树构造算法。实验结果证明该方法能够在线性的时间范围内构造维吾尔文后缀树,并用它来对维吾尔文网页进行聚类。  相似文献   

7.
后缀树是处理字符串的一个优秀算法。利用图像化设计可使后缀树更加清晰。按照递推的思路,建立前i个字符对应的后缀树,通过插入第i+1个字符的方式,建立前i+1个字符对应的后缀树。由于字符串的任意子串都可以表示为某个后缀的前缀,因此可以设定当前节点为根节点。父节点取子节点中贡献最大的节点,同时,记录其对应的字符串。  相似文献   

8.
研究了一种重要的字符规律性--字符串的seed.我们利用等价类的概念,提出了等价类树的结构.并在构建前缀等价类树和后缀等价类树的过程中,实时地求解出任意长度的字符串的seed.给定长度为n的字符串,这一算法的时间复杂度为O(nlogn).  相似文献   

9.
快速中文字符串模糊匹配算法   总被引:9,自引:3,他引:9  
本文解决了中文字符串模糊匹配的两个主要问题:空间问题和时间问题。目前字符串模糊匹配的两个主要方法是位向量方法和过滤方法。由于汉字众多,应用位向量方法时,需要大量空间。对于某些内存很少的小型计算机,比如嵌入式系统,这将会是一个问题。本文改进了位向量方法,使其在应用于中文字符串时,空间需求降低到约5%。本文还利用汉字非常多的特点,提出一种新的基于过滤方法的中文字符串模糊匹配算法,BPM-BM,其速度比世界上最快的算法至少提高14%;在大部分情况下,是其速度的1.5~2倍。  相似文献   

10.
非编码区重复序列分析在基因组研究中起着重要作用,其基础就是在非编码DNA序列中识别和定位所有的重复结构。重复序列识别问题在计算机科学中主要体现为字符串匹配问题。在分析了后缀树和后缀数组字符串匹配算法的基础上,详细阐述了基于后缀数组的精确串联重复序列识别方法。实验表明,该方法适合用于非编码DNA序列分析。  相似文献   

11.
Practical methods for constructing suffix trees   总被引:7,自引:0,他引:7  
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.  相似文献   

12.
Many string manipulations can be performed efficiently on suffix trees. In this paper a CRCW parallel RAM algorithm is presented that constructs the suffix tree associated with a string ofn symbols inO(logn) time withn processors. The algorithm requires Θ(n 2) space. However, the space needed can be reduced toO(n 1+?) for any 0< ? ≤1, with a corresponding slow-down proportional to 1/?. Efficient parallel procedures are also given for some string problems that can be solved with suffix trees.  相似文献   

13.
A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algorithms which avoid massive random access to the trees being built. However, all the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The constantly growing pool of string data, especially biological sequences, requires us to build suffix trees for much larger strings.  相似文献   

14.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

15.
The suffix tree is a key data structure for biological sequence analysis, since it permits efficient solutions to many string-based problems. Constructing large suffix trees is challenging because of high memory overheads and poor memory locality. Even though efficient suffix tree construction algorithms exist, their run-time is still very high for long DNA sequences such as whole human chromosomes. In this paper, we are using a hierarchical grid system as a computational platform in order to reduce this run-time significantly. To achieve an efficient mapping onto this type of architecture we introduce a parallel suffix tree construction algorithm that makes use of a new data structure called the common prefix suffix tree. Using this algorithm together with a dynamic load balancing strategy we show that our distributed grid implementation leads to significant run-time savings.  相似文献   

16.
We consider on-line construction of the suffix tree for a parameterized string, where we always have the suffix tree of the input string read so far. This situation often arises from source code management systems where, for example, a source code repository is gradually increasing in its size as users commit new codes into the repository day by day. We present an on-line algorithm which constructs a parameterized suffix tree in randomized O(n) time, where n is the length of the input string. Our algorithm is the first randomized linear time algorithm for the on-line construction problem.  相似文献   

17.
Constructing suffix tree for gigabyte sequences with megabyte memory   总被引:7,自引:0,他引:7  
Mammalian genomes are typically 3 Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [S. Kurtz (1999)]. We show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(nlogn) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16 MB main memory only to construct more than 200 Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.  相似文献   

18.
R. Giegerich  S. Kurtz  J. Stoye 《Software》2003,33(11):1035-1049
We present an efficient implementation of a write‐only top‐down construction for suffix trees. Our implementation is based on a new, space‐efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated only when it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy top‐down construction is often faster and more space efficient than other methods. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

19.
We discuss an intrinsic generalization of the suffix tree, designed to index a string of length n which has a natural partitioning into m multicharacter substrings or words . This word suffix tree represents only the m suffixes that start at word boundaries. These boundaries are determined by delimiters , whose definition depends on the application. Since traditional suffix tree construction algorithms rely heavily on the fact that all suffixes are inserted, construction of a word suffix tree is nontrivial, in particular when only O(m) construction space is allowed. We solve this problem, presenting an algorithm with O(n) expected running time. In general, construction cost is Ω(n) due to the need of scanning the entire input. In applications that require strict node ordering, an additional cost of sorting O(m') characters arises, where m' is the number of distinct words. In either case, this is a significant improvement over previously known solutions. Furthermore, when the alphabet is small, we may assume that the n characters in the input string occupy o(n) machine words. We illustrate that this can allow a word suffix tree to be built in sublinear time. Received September 2, 1997; revised December 10, 1997.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号