首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Efficient string matching with wildcards and length constraints   总被引:1,自引:2,他引:1  
This paper defines a challenging problem of pattern matching between a pattern P and a text T, with wildcards and length constraints, and designs an efficient algorithm to return each pattern occurrence in an online manner. In this pattern matching problem, the user can specify the constraints on the number of wildcards between each two consecutive letters of P and the constraints on the length of each matching substring in T. We design a complete algorithm, SAIL that returns each matching substring of P in T as soon as it appears in T in an O(n+klmg) time with an O(lm) space overhead, where n is the length of T, k is the frequency of P's last letter occurring in T, l is the user-specified maximum length for each matching substring, m is the length of P, and g is the maximum difference between the user-specified maximum and minimum numbers of wildcards allowed between two consecutive letters in P.SAIL stands for string matching with wildcards and length constraints. Gong Chen received the B.Eng. degree from the Beijing University of Technology, China, and the M.Sc. degree from the University of Vermont, USA, both in computer science. He is currently a graduate student in the Department of Statistics at the University of California, Los Angeles, USA. His research interests include data mining, statistical learning, machine learning, algorithm analysis and design, and database management. Xindong Wu is a professor and the chair of the Department of Computer Science at the University of Vermont. He holds a Ph.D. in Artificial Intelligence from the University of Edinburgh, Britain. His research interests include data mining, knowledge-based systems, and Web information exploration. He has published extensively in these areas in various journals and conferences, including IEEE TKDE, TPAMI, ACM TOIS, IJCAI, AAAI, ICML, KDD, ICDM and WWW, as well as 12 books and conference proceedings. Dr. Wu is the Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (by the IEEE Computer Society), the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM),an Honorary Editor-in-Chief of Knowledge and Information Systems (by Springer), and a Series Editor of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He is the 2004 ACM SIGKDD Service Award winner. Xingquan Zhu received his Ph.D degree in Computer Science from Fudan University, Shanghai, China, in 2001. He spent 4 months with Microsoft Research Asia, Beijing, China, where he was working on content-based image retrieval with relevance feedback. From 2001 to 2002, he was a postdoctoral associate in the Department of Computer Science at Purdue University, West Lafayette, IN. He is currently a research assistant professor in the Department of Computer Science, the University of Vermont, Burlington, VT. His research interests include data mining, machine learning, data quality, multimedia computing, and information retrieval. Since 2000, Dr. Zhu has published extensively, including over 50 refereed papers in various journals and conference proceedings. Abdullah N. Arslan got his Ph.D. degree in Computer Science in 2002 from the University of California at Santa Barbara. Upon his graduation he joined the Department of Computer Science at the University of Vermont as an assistant professor. He has been with the computer science faculty there since then. Dr. Arslan's main research interests are on algorithms on strings, computational biology and bioinformatics. Dr. Arslan earned his Master's degree in Computer Science in 1996 from the University of North Texas, Denton, Texas and his Bachelor's degree in Computer Engineering in 1990 from the Middle East Technical University, Ankara, Turkey. He worked as a programmer for the Central Bank of Turkey between 1991 and 1994. Yu He received her B.E. degree in Information Engineering from Zhejiang University, China, in 2001. She is currently a graduate student in the Department of Computer Science at the University of Vermont. Her research interests include data mining, bioinformatics and pattern recognition.  相似文献   

2.
The problem of pattern matching with wildcards is to find all the occurrences of a pattern of length m in a text of length n over a finite alphabet Σ (both the text and the pattern are allowed to contain wildcards). Based on the prime number encoding scheme (Chaim Linhart, Ron Shamir, Faster pattern matching with character classes using prime number encoding, J. Comput. Syst. Sci. 75 (3) (2009) 155-162), we present a new integer encoding and an efficient fast Fourier transforms based algorithm for this problem. The algorithm takes time to search the pattern in the text by computing one convolution. For matching with wildcards, our encoding uses fewer prime numbers and has shorter code words comparing with the prime number encoding. We use at most 2lg|Σ| prime numbers to encode the symbols while in the prime number encoding |Σ| prime numbers are required. This number reduces to 1.5lg|Σ| when |Σ|>40. The code word used in the algorithm is at most 2⌊lg|Σ|⌋⌈lg(5m)⌉ bits while in the prime encoding it is at least bits. We also show that the length of words can be further reduced by increasing the number of convolutions computed.  相似文献   

3.
Given a text T[1…n] and a pattern P[1…m] over some alphabet Σ of size σ, we want to find all the (exact) occurrences of P in T. The well-known shift-or algorithm solves this problem in time O(nm/w⌉), where w is the number of bits in machine word, using bit-parallelism. We show how to extend the bit-parallelism in another direction, using super-alphabets. This gives a speed-up by a factor s, where s is the number of characters processed simultaneously. The algorithm is implemented, and we show that it works well in practice too. The result is the fastest known algorithm for exact string matching for short patterns and small alphabets.  相似文献   

4.
We present improved variations of the BNDM algorithm for exact string matching. At each alignment our bit-parallel algorithms process a q-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction. Our point of view is practical efficiency of algorithms. Our experiments show that the new variations are faster than earlier algorithms in many cases.  相似文献   

5.
We present a simple and faster solution to the problem of matching a set of patterns with variable length don't cares. Given an alphabet Σ, a pattern p is a word p1@p2?@pm, where pi is a string over Σ called a keyword and @∉Σ is a symbol called a variable length don't care (VLDC) symbol. Pattern p matches a text t if t=u0p1u1um−1pmum for some u0,…,umΣ. The problem addressed in this paper is: given a set of patterns P and a text t, determine whether one of the patterns of P matches t.Kucherov and Rusinowitch (1997) [9] presented an algorithm that solves the problem in time O((|t|+|P|)log|P|), where |P| is the total length of keywords in every pattern of P. We give a new algorithm based on Aho-Corasick automaton. It uses the solutions of Dynamic Marked Ancestor Problem of Chan et al. (2007) [5]. The algorithm takes O((|t|+‖P‖)logκ/loglogκ) time, where ‖P‖ is the total number of keywords in every pattern of P, and κ is the number of distinct keywords in P. The algorithm is faster and simpler than the previous approach.  相似文献   

6.
The paper shows how to compute exactly expectations, standard deviations, and cumulative probabilities of the searching times of string-matching algorithms based on the use of automata. This is derived from a methodology based on viewing the underlying Markov chains as exponential families and applying recent results on them.  相似文献   

7.
Light-based string matching   总被引:1,自引:1,他引:0  
String matching is a very important problem in computer science. The problem consists in finding all the occurrences of a pattern P of length m in a text T of length n. We describe a special device which can do string matching by performing nm + 1 text-to-pattern comparisons. The proposed device uses light and optical filters for performing computations. Two physical implementations are proposed. One of them uses colored glass and the other one uses polarizing filters. The strengths and the weaknesses of each method are deeply discussed.  相似文献   

8.
This paper presents simple and deterministic algorithms for partial point set pattern matching in 2D. Given a set P of n points, called sample set, and a query set Q of k points (n?k), the problem is to find a matching of Q with a subset of P under rigid motion. The match may be of two types: exact and approximate. If an exact matching exists, then each point in Q coincides with the corresponding point in P under some translation and/or rotation. For an approximate match, some translation and/or rotation may be allowed such that each point in Q lies in a predefined ε-neighborhood region around some point in P. The proposed algorithm for the exact matching needs O(n2) space and preprocessing time. The existence of a match for a given query set Q can be checked in time in the worst-case, where α is the maximum number of equidistant pairs of point in P. For a set of n points, α may be O(n4/3) in the worst-case. Some applications of the partial point set pattern matching are then illustrated. Experimental results on random point sets and some fingerprint databases show that, in practice, the computation time is much smaller than the worst-case requirement. The algorithm is then extended for checking the exact match of a set of k line segments in the query set with a k-subset of n line segments in the sample set under rigid motion in time. Next, a simple version of the approximate matching problem is studied where one point of Q exactly matches with a point of P, and each of the other points of Q lie in the ε-neighborhood of some point of P. The worst-case time and space complexities of the proposed algorithm are and O(n), respectively. The proposed algorithms will find many applications to fingerprint matching, image registration, and object recognition.  相似文献   

9.
S2S: structural-to-syntactic matching similar documents   总被引:2,自引:2,他引:0  
Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.
Ramazan S. AygünEmail:
  相似文献   

10.
Multiple filtration and approximate pattern matching   总被引:5,自引:0,他引:5  
Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230.  相似文献   

11.
Approximate pattern matching algorithms have become an important tool in computer assisted music analysis and information retrieval. The number of different problem formulations has greatly expanded in recent years, not least because of the subjective nature of measuring musical similarity. From an algorithmic perspective, the complexity of each problem depends crucially on the exact definition of the difference between two strings. We present an overview of advances in approximate string matching in this field focusing on new measures of approximation.  相似文献   

12.
String matching is the problem of finding all the occurrences of a pattern in a text. We propose a very fast new family of string matching algorithms based on hashing q-grams. The new algorithms are the fastest on many cases, in particular, on small size alphabets.  相似文献   

13.
M. Farach  M. Thorup 《Algorithmica》1998,20(4):388-404
String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log 2 u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the na?ve ``decompresion' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log 2 u/n . Our algorithm is competitive in the sense that O(n log 2 u/n + p)=O(u+p) , and opportunistic in the sense that O(n log 2 u/n + p)=o(u+p) if n=o(u) and p=o(u) . Received December 20, 1995; revised October 29, 1996, and February 6, 1997.  相似文献   

14.
G. De V. Smit 《Software》1982,12(1):57-66
Three string matching algorithms—straightforward, Knuth-Morris-Pratt and Boyer-Moor—re examined and their time complexities discussed. A comparison of their actual average behaviour is made, based on empirical data presented. It is shown that the Boyel-Moore algorithm is extremely efficient in most cases and that, contrary to the impression one might get from the analytical results, the Knuth-Morris-Pratt algorithm is not significantly better on the average than the straightforward algorithm.  相似文献   

15.
Experimental results are given for the application of a new n-gram algorithm to substring searching in DNA strings. The results confirm theoretical predictions of expected running times based on the assumption that the data are drawn from a stationary ergodic source. They also confirm that the algorithms tested are the most efficient known for searches involving larger patterns.  相似文献   

16.
With the help of a general simulation technique of deterministic finite two-way multi-head automata by automata with blind heads we show O(n2/logn) to be an upper time bound on string matching. This result is tight by a previously known lower bound.  相似文献   

17.
Given q+1 strings (a text t of length n and q patterns m1,…,mq) and a natural number w, the multiple serial episode matching problem consists in finding the number of size w windows of text t which contain patterns m1,…,mq as subsequences, i.e., for each mi, if mi=p1,…,pk, the letters p1,…,pk occur in the window, in the same order as in mi, but not necessarily consecutively (they may be interleaved with other letters). Our main contribution here is an algorithm solving this problem on-line in time O(nq) with an MP-RAM model (which is a RAM model equipped with extra operations).  相似文献   

18.
G. Davies  S. Bowsher 《Software》1986,16(6):575-601
This paper describes four algorithms of varying complexity used for pattern matching, and investigates their behaviour. The algorithms are tested using patterns of varying length from several alphabets. It is concluded that although there is no overall ‘best’ algorithm, the more complex algorithms are worth considering as they are generally more efficient in terms of number of comparisons made and execution time.  相似文献   

19.
An aggressive algorithm for multiple string matching   总被引:1,自引:0,他引:1  
A new algorithm based on the Wu-Manber algorithm for multiple string matching is presented in this paper. The algorithm eliminates the functional overlap of the table HASH and SHIFT, and computes the shift distances in an aggressive manner. After each test, the algorithm examines the character next to the scan window to maximize the shift distance. This idea is consistent with that of the quick-search (QS) algorithm. Experimental results on four alphabets show that the new algorithm is more efficient than Wu-Manber and other recent algorithms, particularly on short pattern sets and large alphabet.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号