首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
New Techniques for Regular Expression Searching   总被引:1,自引:0,他引:1  
We present two new techniques for regular expression searching and use them to derive faster practical algorithms. Based on the specific properties of Glushkovs nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m2m) bits, where m is the number of characters, excluding operator symbols, in the regular expression. This compares favorably against the worst case of O(m2m||) bits needed by a classical DFA representation (where is the alphabet) and O(m22m) bits needed by the Wu and Manber approach implemented in Agrep. We also present a new way to search for regular expressions, which is able to skip text characters. The idea is to determine the minimum length of a string matching the regular expression, manipulate the original automaton so that it recognizes all the reverse prefixes of length up to of the strings originally accepted, and use it to skip text characters as done for exact string matching in previous work. We combine these techniques into two algorithms, one able and one unable to skip text characters. The algorithms are simple to implement, and our experiments show that they permit fast searching for regular expressions, normally faster than any existing algorithm.  相似文献   

2.
A minimized automaton representation of reachable states   总被引:1,自引:0,他引:1  
We consider the problem of storing a set S⊂Σkas a deterministic finite automaton (DFA). We show that inserting a new string σ∈Σk or deleting a string from the set S represented as a minimized DFA can be done in expected time O(k|Σ|), while preserving the minimality of the DFA. The method can be applied to reduce the memory requirements of model checkers that are based on explicit state enumeration. As an example, we discuss an implementation of the method for the model checker Spin.  相似文献   

3.
Faster Approximate String Matching   总被引:11,自引:0,他引:11  
We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k<m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m=O(log n) . Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost. We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to . Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near average complexity (σ is the alphabet size). We perform a complete analysis of all the techniques and show how to combine them in an optimal form, also obtaining new tighter bounds for the probability of an approximate occurrence in random text. Finally, we show experimental results comparing our algorithms against previous work. These experiments show that our algorithms are among the fastest for typical text searching, being the fastest in some cases. Although we aim mainly at text searching, we believe that our ideas can be successfully applied to other areas such as computational biology. Received November 22, 1996; revised October 13 and December 5, 1997.  相似文献   

4.
Given an alphabet Σ={1,2,…,|Σ|} text string T∈Σ n and a pattern string P∈Σ m , for each i=1,2,…,nm+1 define L p (i) as the p-norm distance when the pattern is aligned below the text and starts at position i of the text. The problem of pattern matching with L p distance is to compute L p (i) for every i=1,2,…,nm+1. We discuss the problem for d=1,2,∞. First, in the case of L 1 matching (pattern matching with an L 1 distance) we show a reduction of the string matching with mismatches problem to the L 1 matching problem and we present an algorithm that approximates the L 1 matching up to a factor of 1+ε, which has an O(\frac1e2nlogmlog|S|)O(\frac{1}{\varepsilon^{2}}n\log m\log|\Sigma|) run time. Then, the L 2 matching problem (pattern matching with an L 2 distance) is solved with a simple O(nlog m) time algorithm. Finally, we provide an algorithm that approximates the L matching up to a factor of 1+ε with a run time of O(\frac1enlogmlog|S|)O(\frac{1}{\varepsilon}n\log m\log|\Sigma|) . We also generalize the problem of String Matching with mismatches to have weighted mismatches and present an O(nlog 4 m) algorithm that approximates the results of this problem up to a factor of O(log m) in the case that the weight function is a metric.  相似文献   

5.
Howard Chivers 《Software》2017,47(5):669-688
The jsre regular expression library was designed to provide fast matching of complex expressions over large input streams using user‐selectable character encodings. An established design approach was used: a simulated non‐deterministic automaton (NFA) implemented as a virtual machine, avoiding exponential cost functions in either space or time. A deterministic automaton (DFA) was chosen as a general dispatching mechanism for Unicode character classes, and this also provided the opportunity to use compact DFAs in various optimization strategies. The result was the development of a regular expression Preview which provides a summary of all the matches possible from a given point in a regular expression in a form that can be implemented as a compact DFA and can be used to further improve the performance of the standard NFA simulation algorithm. This paper formally defines a preview and describes and evaluates several optimizations using this construct. They provide significant speed improvements accrued from fast scanning of anchor positions, avoiding retesting of repeated strings in unanchored searches and efficient searching of multiple alternate expressions which in the case of keyword searching has a time complexity which is logarithmic in the number of words to be searched. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

6.
Gonzalo Navarro  Jorma Tarhio 《Software》2005,35(12):1107-1130
We present a Boyer–Moore (BM) approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BM approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern searches. We show that our algorithms obtain speedups of up to 50% compared to the simple decompress‐then‐search approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer grep‐like capabilities directly searching files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompress‐then‐search techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

7.
Suffix trees and suffix arrays are fundamental full-text index data structures to solve problems occurring in string processing. Since suffix trees and suffix arrays have different capabilities, some problems are solved more efficiently using suffix trees and others are solved more efficiently using suffix arrays. We consider efficient index data structures with the capabilities of both suffix trees and suffix arrays without requiring much space. When the size of an alphabet is small, enhanced suffix arrays are such index data structures. However, when the size of an alphabet is large, enhanced suffix arrays lose the power of suffix trees. Pattern searching in an enhanced suffix array takes O(m|Σ|) time while pattern searching in a suffix tree takes O(mlog |Σ|) time where m is the length of a pattern and Σ is an alphabet. In this paper, we present linearized suffix trees which are efficient index data structures with the capabilities of both suffix trees and suffix arrays even when the size of an alphabet is large. A linearized suffix tree has all the functionalities of the enhanced suffix array and supports the pattern search in O(mlog |Σ|) time. In a different point of view, it can be considered a practical implementation of the suffix tree supporting O(mlog |Σ|)-time pattern search. In addition, we also present two efficient algorithms for computing suffix links on the enhanced suffix array and the linearized suffix tree. These are the first algorithms that run in O(n) time without using the range minima query. Our experimental results show that our algorithms are faster than the previous algorithms.  相似文献   

8.
With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.  相似文献   

9.
Given a text string of lengthn and a pattern string of lengthm over ab-letter alphabet, thek differences approximate string matching problem asks for all locations in the text where the pattern occurs with at mostk differences (substitutions, insertions, deletions). We treatk not as a constant but as a fraction ofm (not necessarily constant-fraction). Previous algorithms require at leastO(kn) time (or exponential space). We give an algorithm that is sublinear time0((n/m)k log b m) when the text is random andk is bounded by the threshold m/(logb m + O(1)). In particular, whenk=o(m/logb m) the expected running time iso(n). In the worst case our algorithm is O(kn), but is still an improvement in that it is practical and uses0(m) space compared with0(n) or0(m 2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.This work was supported in part by NSF Grants CCR-87-04184 and FD-89-02813; by the Human Genome Center, Lawrence Berkeley Laboratory, supported by the Director, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098; and by Department of Energy Grants DE-FG03-90ER60999 and DE-FG02-91ER61190. Earlier versions of this paper appeared as [8] and part of [5].  相似文献   

10.
一种用于内容过滤和检测的快速多关键词识别算法   总被引:13,自引:0,他引:13  
基于字符串匹配的检测方法是内容过滤和检测系统中一类很重要的分析方法,首先分析了现有的几种快速字符串匹配算法,然后提出了一种新的多模式字符串匹配算法,并简单分析了算法的复杂性,算法在设计的过程中吸取了BM算法中跳跃的特性,采用了后缀树算法得到了最大跳跃值,采用AC算法的匹配自动机原理从而避免对搜索树内每一个字符的匹配,最后,通过具体的实验数据验证了这些算法的性能,通过实验可以看出,新算法使得检测速度有很大提高,并有效屏蔽了关键词数量的增加对检测速度的影响。  相似文献   

11.
Mahmoud Moh'd Mhashi 《Software》2005,35(13):1299-1315
The effect of multiple reference characters and the condition types on the performance of exact string‐searching algorithms is tested. In order to perform such a test a new algorithm called the Multiple Reference Characters Algorithm (MRCA) is developed. An experiment is performed using English text; the results are compared with the known string‐matching algorithms called Boyer–Moore–Horspool (BMH) and Straight Forward (Naïve). With the MRCA algorithm, the shift distance is increased up to 3m + 1 positions in comparison with exactly one position in the Naïve algorithm and up to m positions in BMH. Furthermore, by using the new algorithm MRCA, the results suggest that the evaluation criteria of the average number of comparisons, the average number of shifts, and the clock time required by BMH are improved up to 73.1%, 64.7%, and 49.6%, respectively. The same evaluation criteria required by Naïve are improved by MRCA up to 98.1%, 98%, and 94.7%, respectively. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

12.
There is no known algorithm that solves the general case of theapproximate string matching problem with the extended edit distance, where the edit operations are: insertion, deletion, mismatch and swap, in timeo(nm), wheren is the length of the text andm is the length of the pattern. In an effort to study this problem, the edit operations were analysed independently. It turns out that the approximate matching problem with only the mismatch operation can be solved in timeO(nm logm). If the only edit operation allowed is swap, then the problem can be solved in timeO(n logm logσ), whereσ=min(m, |Σ|). In this paper we show that theapproximate string matching problem withswap andmismatch as the edit operations, can be computed in timeO(nm logm). Amihood Amir was partially supported by NSF Grant CCR-01-04494 and ISF Grant 35/05. This work is part of Estrella Eisenberg’s M.Sc. thesis. Ely Porat was partially supported by GIF Young Scientists Program Grant 2055-1168.6/2002.  相似文献   

13.
Signature-based intrusion detection is required to inspect network traffic at wire-speed. Matching packet payloads against patterns specified with regular expression is a computation intensive task. Hence, the design of hardware accelerator to speed up regular expression matching has been an active research area. A systematic approach to detect regular expression is based on finite automaton. The space-time trade-off between deterministic finite automaton (DFA) and non-deterministic finite automaton (NFA) is well-known. DFA can offer constant throughput but it may suffer from the state explosion problem. Hence, implementation of DFA for large pattern sets on embedded device with limited on-chip memory may not be viable. NFA requires linear space but the throughput can be very low. Implementations of NFA with hardwired circuits can overcome the speed deficiency by exploiting the massive parallelism offered by dedicated hardware circuitries, but this approach does not support efficient dynamic updates. In this paper, we shall present a memory-based architecture for the implementation of NFA to speed up regular expression matching for signature-based intrusion detection. The proposed method supports dynamic updates and offers constant throughput so that it can be used to supplement the existing DFA-based methods in handling large pattern sets.  相似文献   

14.
针对确定有限自动机(DFA)的正则表达式匹配技术存在状态膨胀和一次状态转移只能处理单个字符的问题,提出了一种基于布鲁姆过滤器的正则表达式匹配算法。该算法将正则表达式中的每个确定字符串组成DFA的一个状态,添加比特向量完成匹配过程,并且在一次状态转移中根据确定字符串的匹配结果达到处理多个字符的目的。实验分析表明该算法有效降低了DFA状态的膨胀,提高了匹配速率。  相似文献   

15.
We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings—Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (mk)(k+2) bits of space to represent an NFA with (mk)(k+1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (mk)(k+1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m.  相似文献   

16.
PRAM和LARPBS模型上的近似串匹配并行算法   总被引:15,自引:1,他引:15  
钟诚  陈国良 《软件学报》2004,15(2):159-169
近似串匹配技术在网络信息搜索、数字图书馆、模式识别、文本挖掘、IP路由查找、网络入侵检测、生物信息学、音乐研究计算等领域具有广泛的应用.基于CREW-PRAM(parallel random access machine with concurrent read and exclusive write)模型,采用波前式并行推进的方法直接计算编辑距离矩阵D,设计了一个允许k-差别的近似串匹配动态规划并行算法,该算法使用(m+1)个处理器,时间复杂度为O(n),算法理论上达到线性加速;采取水平和斜向双并行计算编辑距离矩阵D的方法,设计了一个使用((m+1)个处理器和O(n/(+m)时间的、可伸缩的、允许k-差别的近似串匹配动态规划并行算法,.基于分治策略,通过灵活拆分总线和合并子总线动态重构光总线系统,并充分利用光总线的消息播送技术和并行计算前缀和的方法,实现了汉明距离的并行计算,设计了两个基于LARPBS(linear arrays with reconfigurable pipelined bus system)模型的通信高效、可扩放的允许k-误配的近似串匹配并行算法,其中一个算法使用n个处理器,时间为O(m);另一个为常数时间算法,使用mn个处理器.  相似文献   

17.
A word w is called synchronizing (recurrent, reset, directable) word of deterministic finite automata (DFA) if w brings all states of the automaton to a unique state. According to the famous conjecture of Cerny from 1964, every n-state synchronizing automaton possesses a synchronizing word of length at most (n - 1)2. The problem is still open. It will be proved that the Cerny conjecture holds good for synchronizing DFA with transition monoid having no involutions and for every n-state (n 〉 2) synchronizing DFA with transition monoid having only trivial subgroups the minimal length of synchronizing word is not greater than (n - 1)2/2. The last important class of DFA involved and studied by Schutzenberger is called aperiodic; its automata accept precisely star-free languages. Some properties of an arbitrary synchronizing DFA were established.  相似文献   

18.
Given a text T[1…n] and a pattern P[1…m] over some alphabet Σ of size σ, we want to find all the (exact) occurrences of P in T. The well-known shift-or algorithm solves this problem in time O(nm/w⌉), where w is the number of bits in machine word, using bit-parallelism. We show how to extend the bit-parallelism in another direction, using super-alphabets. This gives a speed-up by a factor s, where s is the number of characters processed simultaneously. The algorithm is implemented, and we show that it works well in practice too. The result is the fastest known algorithm for exact string matching for short patterns and small alphabets.  相似文献   

19.
Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer–Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.  相似文献   

20.

模式匹配算法是入侵检测系统(IDS) 中非常重要的一种算法. 在研究和分析几种常用模式匹配算法的基础 上, 提出一种快速的基于BM(Boyer-Moore) 模式匹配的改进算法—–IBM 算法. 该算法充分利用模式串的末字符和 末字符所对应的文本串的后两字符的唯一性, 同时参考文本串本身的信息来提高模式串的移动量, 使得每次失配后, 在保证不丢失匹配成功可能性的前提下尽可能多地向后跳跃. 实验结果表明, 该算法相比其他模式匹配算法, 在检测 性能和匹配效率上均具有很大优势, 并且能够有效地提高IDS 的检测效率和性能.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号