共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
为用后缀树聚类算法对维吾尔文网页进行聚类,通过分析可扩展后缀树和维吾尔文的特点设计了维吾尔文后缀树构造算法。实验结果证明该方法能够在线性的时间范围内构造维吾尔文后缀树,并用它来对维吾尔文网页进行聚类。 相似文献
3.
4.
一种适合于GPU计算的并行后缀数组构造算法 总被引:1,自引:0,他引:1
后缀数组广泛应用于序列分析、字符串匹配和文本压缩,近年来,有关后缀数组构造和应用算法的不断探索构成了计算机科学中一个非常活跃的研究领域.在对现有串行算法进行了分析和对比之后,提出了一种新的、简洁的适合于GPU计算的并行后缀数组倍增构造算法,以排序方法替代传统的分组策略,不但能独立完成后缀数组的并行构造,还可与现存的串行倍增算法结合使用,以达到最高的执行效率.实验结果表明该算法在解决实际应用问题时,具有易于实现、执行速度快和可扩展性强等优点,尤其在处理小字符集的数据时效率更高. 相似文献
5.
提出一种基于后缀表示的构建系统发生树的蚁群算法(SR-PTC),该算法用蚂蚁访问物种集合以形成一个对应最优系统发生树的后缀表示序列。为构成一个合法的系统发生树的后缀表示,蚂蚁对内部结点的选择要受到限制,分别为叶结点和内部结点设置两个不同的选择概率,并用赌轮盘选择方法来决定两种结点的选择。另外,在信息素更新时,加入当前树的评价值来影响蚂蚁的运动方向。实验结果表明,此方法能得到较为准确的拓扑结构,在物种数目较小时可以较快地得到结果。 相似文献
6.
DNA序列中基于适应性后缀树的重复体识别算法 总被引:1,自引:0,他引:1
现有的在DNA序列中识别重复体的算法多数是基于比对的,对识别速度和吞吐量有很大的限制.针对这个问题文中根据一个平衡重复体的长度和频率的定义,提出了一种基于Ukkonen后缀树的快速识别重复体的RepSeeker算法.算法采用最低限制频率,最大程度地扩展了重复体的长度,同时为了进一步地提高RepSeeker算法的效率,对Ukkonen的后缀树构造算法进行了适应性改进,在构造时加入RepSeeker算法所需的结点信息并将叶子结点和分支结点加以区分,从而使得RepSeeker算法能通过直接读取结点信息来求得子串频率和子串位置.这种改进较大地提高了RepSeeker算法的性能,而且空间开销不大.实验中使用了NCBI中的9条典型DNA序列作为测试数据,并对后缀树改进前后的重复体识别算法做了比较分析.结果表明,RepSeeker在没有损失精度的情况下缩短了算法的运行时间.实验结果与理论上的分析一致. 相似文献
7.
入侵检测中基于后缀树的多模式匹配算法 总被引:1,自引:0,他引:1
针对模式匹配算法已经成为误用入侵检测系统性能瓶颈的现状,提出了一种新的基于后缀树的多模式匹配算法FSM(Fast String Matching Algorithm).该算法构建了一个后缀自动机,匹配中应用了好后缀启发机制进行启发跳跃.将改算法在Snort2.4.3中实现,实验结果表明,在耗费一定空间的基础上,Snort的时间性能有了较大提高. 相似文献
8.
9.
文章首先介绍了PDBMS采用的Hash-Round-Robin(HRR)数据划分方法以及基于该划分方法的并行RDBn树,最后着重、详细地给出了基于该树的并行Join算法,分析了该算法的效率。 相似文献
10.
11.
Practical methods for constructing suffix trees 总被引:7,自引:0,他引:7
Yuanyuan Tian Sandeep Tata Richard A. Hankins Jignesh M. Patel 《The VLDB Journal The International Journal on Very Large Data Bases》2005,14(3):281-299
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation
in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of
queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However,
methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not
fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior
of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large
number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios
are not well characterized.
In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show
that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction.
For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this
problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have
been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second,
we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing
very large suffix trees with very little resources, this algorithm is more efficient than TDD. 相似文献
12.
S.Srinivasa Rao 《Information Processing Letters》2002,82(6):307-311
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier. 相似文献
13.
Suffix trees are the fundamental data structure of combinatorial pattern
matching on words. Suffix trees have been used in order to give optimal
solutions to a great variety of problems on static words, but for practical
situations, such as in a text editor, where the incremental changes of
the text make dynamic updating of the corresponding suffix trees necessary, this
data structure alone has not been used with success. We prove that, for dynamic
modifications of order O(1) of words of length n, any suffix tree updating
algorithm, such as the ones proposed by McCreight, requires O(n) worst-case
running time, as for the full reconstruction of the suffix tree. Consequently,
we argue that this data structure alone is not appropriate for the solution
of combinatorial problems on words that change dynamically. 相似文献
14.
The suffix tree is a key data structure for biological sequence analysis, since it permits efficient solutions to many string-based problems. Constructing large suffix trees is challenging because of high memory overheads and poor memory locality. Even though efficient suffix tree construction algorithms exist, their run-time is still very high for long DNA sequences such as whole human chromosomes. In this paper, we are using a hierarchical grid system as a computational platform in order to reduce this run-time significantly. To achieve an efficient mapping onto this type of architecture we introduce a parallel suffix tree construction algorithm that makes use of a new data structure called the common prefix suffix tree. Using this algorithm together with a dynamic load balancing strategy we show that our distributed grid implementation leads to significant run-time savings. 相似文献
15.
On-line construction of suffix trees 总被引:47,自引:0,他引:47
E. Ukkonen 《Algorithmica》1995,14(3):249-260
An on-line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It always has the suffix tree for the scanned part of the string ready. The method is developed as a linear-time version of a very simple algorithm for (quadratic size) suffixtries. Regardless of its quadratic worst case this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give, in a natural way, the well-known algorithms for constructing suffix automata (DAWGs).This research was supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany). 相似文献
16.
后缀数组构建算法的时间和空间开销是它在实际应用中的瓶颈。该文介绍了两种较好的构建算法,对它们的性能作了评估和分析,指出了各自的适用范围,给出并比较了两种算法在不同情况下的实验结果。 相似文献
17.
We present a new simple algorithm that constructs an Aho Corasick automaton for a set of patterns, P, of total length n, in O(n) time and space for integer alphabets. Processing a text of size m over an alphabet Σ with the automaton costs O(mlog|Σ|+k), where there are k occurrences of patterns in the text.A new, efficient implementation of nodes in the Aho Corasick automaton is introduced, which works for suffix trees as well. 相似文献
18.
基于后缀树的带有通配符的模式匹配研究 总被引:1,自引:1,他引:0
由于在生物序列分析、文本索引、网络入侵检测等领域的应用需求,带有通配符的模式匹配问题一直是研究的热点。针对已有的研究工作中通配符和长度约束具有较强的局限性问题,研究带有灵活通配符的模式匹配问题,其中通配符可以在模式的任意两子串间出现且可以指定灵活的长度约束。采用非线性数据结构—后缀树,设计了求解模式所有解的完备算法PAS\"I'。预处理阶段采用在线增量式算法构建具有文本先验知识的后缀树,搜索阶段结合动态规划的思想,逐个匹配模式中字符,最终得到完备解。在基因序列上的实验表明,PAST比其他算法具有更好的时间性能。 相似文献
19.
Engineering a Lightweight Suffix Array Construction
Algorithm 总被引:6,自引:0,他引:6
In this paper we describe a new algorithm for building the suffix array of
a string. This task is equivalent to the problem of lexicographically sorting
all the suffixes of the input string. Our algorithm is based on a new
approach called deep–shallow sorting: we use a shallow sorter for the
suffixes with a short common prefix, and a deep sorter for the suffixes
with a long common prefix.
All the known algorithms for building the suffix array either require a large
amount of space or are inefficient when the input string contains many
repeated substrings. Our algorithm has been designed to overcome this
dichotomy. Our algorithm is lightweight in the sense that it
uses very small space in addition to the space required by the suffix array
itself. At the same time our algorithm is fast even when the input contains
many repetitions: this has been shown by extensive experiments with inputs of
size up to 110 Mb.
The source code of our algorithm, as well as a C library providing a
simple API, is available under the GNU GPL. 相似文献
20.
Tandem repeat在基因组成和进化中起到非常重要的作用,查找和分析Tandem repeat已经成为当前生物信息学的一个前沿领域和研究焦点.目前在这一研究领域存在多类解决方法,主要有基于LZ分解技术的方法和最近兴起的基于后缀树索引的方法.本文选取了两种时间复杂度达到O(nlogn)数量级的代表性的方法,对这两种方法进行了全面的综述,并对它们的性能进行了系统的比较和分析. 相似文献