首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
String matching is the problem of finding all the occurrences of a pattern in a text. We propose a very fast new family of string matching algorithms based on hashing q-grams. The new algorithms are the fastest on many cases, in particular, on small size alphabets.  相似文献   

2.
We present improved variations of the BNDM algorithm for exact string matching. At each alignment our bit-parallel algorithms process a q-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction. Our point of view is practical efficiency of algorithms. Our experiments show that the new variations are faster than earlier algorithms in many cases.  相似文献   

3.
Windowed pq-grams for approximate joins of data-centric XML   总被引:1,自引:0,他引:1  
In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique.  相似文献   

4.
This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms.  相似文献   

5.
This is an extended version of the paper presented at the 4th International Workshop NFMCP 2015 held in conjunction with ECML PKDD 2015. The initial version has been published in NFMCP 2015 conference proceedings as part of Springer Series. This paper presents a novel approach to financial times series (FTS) prediction by mapping hourly foreign exchange data to string representations and deriving simple trading strategies from them. To measure the degree of similarity in these market strings we apply familiar string kernels, bag of words and n-grams, whilst also introducing a new kernel, time-decay n-grams, that captures the temporal nature of FTS. In the process we propose a sequential Parzen windows algorithm based on discrete representations where trading decisions for each string are learned in an online manner and are thus subject to temporal fluctuations. We evaluate the strength of a number of representations using both the string version and its continuous counterpart, whilst also comparing the performance of different learning algorithms on these representations, namely support vector machines, Parzen windows and Fisher discriminant analysis. Our extensive experiments show that the simple string representation coupled with the sequential Parzen windows approach is capable of outperforming other more exotic approaches, supporting the idea that when it comes to working in high noise environments often the simplest approach is the most effective.  相似文献   

6.
By reducing an array matching problem to a string matching problem in a natural way, it is shown that efficient string matching algorithms may be applied to arrays. In this paper, based on the ideas due to Baker, an application of the two-dimensional on-line tessellation acceptor (2-dota) is presented for very rapid on-line detection of occurrences of a fixed set of keyarrays as embedded subarrays in a text array. The main part of the algorithm described in this paper consists of constructing two finite state pattern (string) matching machines from the keyarrays. By combining these two finite state pattern matching machines, we construct the 2-dota which, given an m × n text array, solves the two-dimensional pattern matching problem in m + n?1 steps.  相似文献   

7.
Justin Zobel  Philip Dart 《Software》1995,25(3):331-345
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching.  相似文献   

8.
We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings—Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (mk)(k+2) bits of space to represent an NFA with (mk)(k+1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (mk)(k+1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m.  相似文献   

9.
Two improved algorithms for string matching with k mismatches are presented. One algorithm is based on fast integer multiplication algorithms whereas the other follows more closely classic string-matching techniques.  相似文献   

10.
In parameterized string matching the pattern P matches a substring t of the text T if there exist a bijective mapping from the symbols of P to the symbols of t. We give simple and practical algorithms for finding all such pattern occurrences in sublinear time on average. The algorithms work for a single and multiple patterns.  相似文献   

11.

模式匹配算法是入侵检测系统(IDS) 中非常重要的一种算法. 在研究和分析几种常用模式匹配算法的基础 上, 提出一种快速的基于BM(Boyer-Moore) 模式匹配的改进算法—–IBM 算法. 该算法充分利用模式串的末字符和 末字符所对应的文本串的后两字符的唯一性, 同时参考文本串本身的信息来提高模式串的移动量, 使得每次失配后, 在保证不丢失匹配成功可能性的前提下尽可能多地向后跳跃. 实验结果表明, 该算法相比其他模式匹配算法, 在检测 性能和匹配效率上均具有很大优势, 并且能够有效地提高IDS 的检测效率和性能.

  相似文献   

12.
Mahmoud Moh'd Mhashi 《Software》2005,35(13):1299-1315
The effect of multiple reference characters and the condition types on the performance of exact string‐searching algorithms is tested. In order to perform such a test a new algorithm called the Multiple Reference Characters Algorithm (MRCA) is developed. An experiment is performed using English text; the results are compared with the known string‐matching algorithms called Boyer–Moore–Horspool (BMH) and Straight Forward (Naïve). With the MRCA algorithm, the shift distance is increased up to 3m + 1 positions in comparison with exactly one position in the Naïve algorithm and up to m positions in BMH. Furthermore, by using the new algorithm MRCA, the results suggest that the evaluation criteria of the average number of comparisons, the average number of shifts, and the clock time required by BMH are improved up to 73.1%, 64.7%, and 49.6%, respectively. The same evaluation criteria required by Naïve are improved by MRCA up to 98.1%, 98%, and 94.7%, respectively. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

13.
A signature-based intrusion detection system identifies intrusions by comparing the data traffic with known signature patterns. In this process, matching of packet strings against signature patterns is the most time-consuming step and dominates the overall system performance. Many signature-based network intrusion detection systems (NIDS), e.g., the Snort, employ one or multiple pattern matching algorithms to detect multiple attack types. So far, many pattern matching algorithms have been proposed. Most of them use single-byte standard unit for search, while a few algorithms such as the Modified Wu-Manber (MWM) algorithm use typically two-byte unit, which guarantees better performance than others even as the number of different signatures increases. Among those algorithms, the MWM algorithm has been known as the fastest pattern matching algorithm when the patterns in a rule set rarely appear in packets. However, the matching time of the MWM algorithm increases as the length of the shortest pattern in a signature group decreases.In this paper, by extending the length of the shortest pattern, we minimize the pattern matching time of the algorithm which uses multi-byte unit. We propose a new pattern matching algorithm called the L+1-MWM algorithm for multi-pattern matching. The proposed algorithm minimizes the performance degradation that is originated from the dependency on the length of the shortest pattern. We show that the L+1-MWM algorithm improves the performance of the MWM algorithm by as much as 20% in average under various lengths of shortest patterns and normal traffic conditions. Moreover, when the length of the shortest pattern in a rule set is less than 5, the L+1-MWM algorithm shows 38.87% enhancement in average. We also conduct experiments on a real campus network and show that 12.48% enhancement is obtained in average. In addition, it is shown that the L+1-MWM algorithm provides a better performance than the MWM algorithm by as much as 25% in average under various numbers of signatures and normal traffic conditions, and 20.12% enhancement in average with real on-line traffic.  相似文献   

14.
Recently a new variation of approximate Boyer-Moore string matching was presented for the k-mismatch problem. The variation, called FAAST, is specifically tuned for small alphabets. We further improve this algorithm gaining speedups in both preprocessing and searching. We also present three variations of the algorithm for the k-difference problem. We show that the searching time of the algorithms is average-optimal and the preprocessing also has a lower time complexity than FAAST. Our experiments show that our algorithm for the k-mismatch problem is about 30% faster than FAAST and the new algorithms compare well against other state-of-the-art algorithms for approximate string matching.  相似文献   

15.
An aggressive algorithm for multiple string matching   总被引:1,自引:0,他引:1  
A new algorithm based on the Wu-Manber algorithm for multiple string matching is presented in this paper. The algorithm eliminates the functional overlap of the table HASH and SHIFT, and computes the shift distances in an aggressive manner. After each test, the algorithm examines the character next to the scan window to maximize the shift distance. This idea is consistent with that of the quick-search (QS) algorithm. Experimental results on four alphabets show that the new algorithm is more efficient than Wu-Manber and other recent algorithms, particularly on short pattern sets and large alphabet.  相似文献   

16.
Interest matching is an important data-filtering mechanism for a large-scale distributed virtual environment. Many of the existing algorithms perform interest matching at discrete timesteps. Thus, they may suffer the missing-event problem: failing to report the events between two consecutive timesteps. Some algorithms solve this problem, by setting short timesteps, but they have a low computing efficiency. Additionally, these algorithms cannot capture all events, and some spurious events may also be reported. In this paper, we present an accurate interest matching algorithm called the predictive interest matching algorithm, which is able to capture the missing events between discrete timesteps. The PIM algorithm exploits the polynomial functions to model the movements of virtual entities, and predict the time intervals of region overlaps associated with the entities accurately. Based on the prediction of the space–time intersection of regions, our algorithm can capture all missing events and does not report the spurious events at the same time. To improve the runtime performance, a technique called region pruning is proposed and used in our algorithm. In experiments, we compare the new algorithm with the frequent interest matching algorithm and the space–time interest matching algorithm on the HLA/RTI distributed infrastructure. The results prove that although an additional matching effort is required in the new algorithm, it outperforms the baselines in terms of event-capturing ability, redundant matching avoidance, runtime efficiency and scalability.  相似文献   

17.
Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer–Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.  相似文献   

18.
In this paper we show that n×n matrices with entries from a semiring R which is generated additively by q generators can be multiplied in time O(q2nω), where nω is the complexity for matrix multiplication over a ring (Strassen: ω<2.807, Coppersmith and Winograd: ω<2.376).We first present a combinatorial matrix multiplication algorithm for the case of semirings with q elements, with complexity , matching the best known methods in this class.Next we show how the ideas used can be combined with those of the fastest known boolean matrix multiplication algorithms to give an O(q2nω) algorithm for matrices of, not necessarily finite, semirings with q additive generators.For finite semirings our combinatorial algorithm is simple enough to be a practical algorithm and is expected to be faster than the O(q2nω) algorithm for matrices of practically relevant sizes.  相似文献   

19.
基于字符串匹配的检测方法是入侵检测系统中的一种重要方法。通过分析几种常见的字符串匹配算法(AC、AC_BMH、Sunday等)的基础,提出了一种对AC算法的改进,新算法每一次匹配不成功后都能跳过尽可能多的字符以进行下一轮匹配,使得匹配次数大大减少,从而提高了匹配效率。分析了该算法的性能,并用具体的实验数据给出了几种匹配算法的测试结果。  相似文献   

20.
一种用于内容过滤和检测的快速多关键词识别算法   总被引:13,自引:0,他引:13  
基于字符串匹配的检测方法是内容过滤和检测系统中一类很重要的分析方法,首先分析了现有的几种快速字符串匹配算法,然后提出了一种新的多模式字符串匹配算法,并简单分析了算法的复杂性,算法在设计的过程中吸取了BM算法中跳跃的特性,采用了后缀树算法得到了最大跳跃值,采用AC算法的匹配自动机原理从而避免对搜索树内每一个字符的匹配,最后,通过具体的实验数据验证了这些算法的性能,通过实验可以看出,新算法使得检测速度有很大提高,并有效屏蔽了关键词数量的增加对检测速度的影响。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号