期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient string matching with wildcards and length constraints 总被引：1，自引：2，他引：1

Gong Chen Xindong Wu Xingquan Zhu Abdullah N. Arslan Yu He 《Knowledge and Information Systems》2006,10(4):399-419

This paper defines a challenging problem of pattern matching between a pattern P and a text T, with wildcards and length constraints, and designs an efficient algorithm to return each pattern occurrence in an online manner. In this pattern matching problem, the user can specify the constraints on the number of wildcards between each two consecutive letters of P and the constraints on the length of each matching substring in T. We design a complete algorithm, SAIL that returns each matching substring of P in T as soon as it appears in T in an O(n+klmg) time with an O(lm) space overhead, where n is the length of T, k is the frequency of P's last letter occurring in T, l is the user-specified maximum length for each matching substring, m is the length of P, and g is the maximum difference between the user-specified maximum and minimum numbers of wildcards allowed between two consecutive letters in P.SAIL stands for string matching with wildcards and length constraints. Gong Chen received the B.Eng. degree from the Beijing University of Technology, China, and the M.Sc. degree from the University of Vermont, USA, both in computer science. He is currently a graduate student in the Department of Statistics at the University of California, Los Angeles, USA. His research interests include data mining, statistical learning, machine learning, algorithm analysis and design, and database management. Xindong Wu is a professor and the chair of the Department of Computer Science at the University of Vermont. He holds a Ph.D. in Artificial Intelligence from the University of Edinburgh, Britain. His research interests include data mining, knowledge-based systems, and Web information exploration. He has published extensively in these areas in various journals and conferences, including IEEE TKDE, TPAMI, ACM TOIS, IJCAI, AAAI, ICML, KDD, ICDM and WWW, as well as 12 books and conference proceedings. Dr. Wu is the Editor-in-Chief of the IEEE Transactions on Knowledge and Data Engineering (by the IEEE Computer Society), the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining (ICDM),an Honorary Editor-in-Chief of Knowledge and Information Systems (by Springer), and a Series Editor of the Springer Book Series on Advanced Information and Knowledge Processing (AI&KP). He is the 2004 ACM SIGKDD Service Award winner. Xingquan Zhu received his Ph.D degree in Computer Science from Fudan University, Shanghai, China, in 2001. He spent 4 months with Microsoft Research Asia, Beijing, China, where he was working on content-based image retrieval with relevance feedback. From 2001 to 2002, he was a postdoctoral associate in the Department of Computer Science at Purdue University, West Lafayette, IN. He is currently a research assistant professor in the Department of Computer Science, the University of Vermont, Burlington, VT. His research interests include data mining, machine learning, data quality, multimedia computing, and information retrieval. Since 2000, Dr. Zhu has published extensively, including over 50 refereed papers in various journals and conference proceedings. Abdullah N. Arslan got his Ph.D. degree in Computer Science in 2002 from the University of California at Santa Barbara. Upon his graduation he joined the Department of Computer Science at the University of Vermont as an assistant professor. He has been with the computer science faculty there since then. Dr. Arslan's main research interests are on algorithms on strings, computational biology and bioinformatics. Dr. Arslan earned his Master's degree in Computer Science in 1996 from the University of North Texas, Denton, Texas and his Bachelor's degree in Computer Engineering in 1990 from the Middle East Technical University, Ankara, Turkey. He worked as a programmer for the Central Bank of Turkey between 1991 and 1994. Yu He received her B.E. degree in Information Engineering from Zhejiang University, China, in 2001. She is currently a graduate student in the Department of Computer Science at the University of Vermont. Her research interests include data mining, bioinformatics and pattern recognition. 相似文献

2.

Pattern matching with wildcards using words of shorter length

Meng Zhang Yi Zhang Liang Hu 《Information Processing Letters》2010,110(24):1099-1102

The problem of pattern matching with wildcards is to find all the occurrences of a pattern of length m in a text of length n over a finite alphabet Σ (both the text and the pattern are allowed to contain wildcards). Based on the prime number encoding scheme (Chaim Linhart, Ron Shamir, Faster pattern matching with character classes using prime number encoding, J. Comput. Syst. Sci. 75 (3) (2009) 155-162), we present a new integer encoding and an efficient fast Fourier transforms based algorithm for this problem. The algorithm takes time to search the pattern in the text by computing one convolution. For matching with wildcards, our encoding uses fewer prime numbers and has shorter code words comparing with the prime number encoding. We use at most 2lg|Σ| prime numbers to encode the symbols while in the prime number encoding |Σ| prime numbers are required. This number reduces to 1.5lg|Σ| when |Σ|>40. The code word used in the algorithm is at most 2⌊lg|Σ|⌋⌈lg(5m)⌉ bits while in the prime encoding it is at least bits. We also show that the length of words can be further reduced by increasing the number of convolutions computed. 相似文献

3.

Shift-or string matching with super-alphabets

Kimmo Fredriksson 《Information Processing Letters》2003,87(4):201-204

Given a text T[1…n] and a pattern P[1…m] over some alphabet Σ of size σ, we want to find all the (exact) occurrences of P in T. The well-known shift-or algorithm solves this problem in time O(n⌈m/w⌉), where w is the number of bits in machine word, using bit-parallelism. We show how to extend the bit-parallelism in another direction, using super-alphabets. This gives a speed-up by a factor s, where s is the number of characters processed simultaneously. The algorithm is implemented, and we show that it works well in practice too. The result is the fastest known algorithm for exact string matching for short patterns and small alphabets. 相似文献

4.

Improving practical exact string matching

Branislav ?urian 《Information Processing Letters》2010,110(4):148-152

We present improved variations of the BNDM algorithm for exact string matching. At each alignment our bit-parallel algorithms process a q-gram before testing the state variable. In addition we apply reading a 2-gram in one instruction. Our point of view is practical efficiency of algorithms. Our experiments show that the new variations are faster than earlier algorithms in many cases. 相似文献

5.

A faster algorithm for matching a set of patterns with variable length don't cares 总被引：1，自引：0，他引：1

Meng Zhang Liang Hu 《Information Processing Letters》2010,110(6):216-220

We present a simple and faster solution to the problem of matching a set of patterns with variable length don't cares. Given an alphabet Σ, a pattern p is a word p₁@p₂?@pm, where pi is a string over Σ called a keyword and @∉Σ is a symbol called a variable length don't care (VLDC) symbol. Pattern p matches a text t if t=u₀p₁u₁…um−1pmum for some u₀,…,um∈Σ^∗. The problem addressed in this paper is: given a set of patterns P and a text t, determine whether one of the patterns of P matches t.Kucherov and Rusinowitch (1997) [9] presented an algorithm that solves the problem in time O((|t|+|P|)log|P|), where |P| is the total length of keywords in every pattern of P. We give a new algorithm based on Aho-Corasick automaton. It uses the solutions of Dynamic Marked Ancestor Problem of Chan et al. (2007) [5]. The algorithm takes O((|t|+‖P‖)logκ/loglogκ) time, where ‖P‖ is the total number of keywords in every pattern of P, and κ is the number of distinct keywords in P. The algorithm is faster and simpler than the previous approach. 相似文献

6.

Waiting time and complexity for matching patterns with automata

M. Crochemore V.T. Stefanov 《Information Processing Letters》2003,87(3):119-125

The paper shows how to compute exactly expectations, standard deviations, and cumulative probabilities of the searching times of string-matching algorithms based on the use of automata. This is derived from a methodology based on viewing the underlying Markov chains as exponential families and applying recent results on them. 相似文献

7.

Light-based string matching 总被引：1，自引：1，他引：0

Mihai Oltean 《Natural computing》2009,8(1):121-132

String matching is a very important problem in computer science. The problem consists in finding all the occurrences of a pattern P of length m in a text T of length n. We describe a special device which can do string matching by performing n−m + 1 text-to-pattern comparisons. The proposed device uses light and optical filters for performing computations. Two physical implementations are proposed. One of them uses colored glass and the other one uses polarizing filters. The strengths and the weaknesses of each method are deeply discussed. 相似文献

8.

Simple algorithms for partial point set pattern matching under rigid motion

Arijit Bishnu Subhas C. Nandy Bhargab B. Bhattacharya 《Pattern recognition》2006,39(9):1662-1671

This paper presents simple and deterministic algorithms for partial point set pattern matching in 2D. Given a set P of n points, called sample set, and a query set Q of k points (n?k), the problem is to find a matching of Q with a subset of P under rigid motion. The match may be of two types: exact and approximate. If an exact matching exists, then each point in Q coincides with the corresponding point in P under some translation and/or rotation. For an approximate match, some translation and/or rotation may be allowed such that each point in Q lies in a predefined ε-neighborhood region around some point in P. The proposed algorithm for the exact matching needs O(n²) space and preprocessing time. The existence of a match for a given query set Q can be checked in time in the worst-case, where α is the maximum number of equidistant pairs of point in P. For a set of n points, α may be O(n^4/3) in the worst-case. Some applications of the partial point set pattern matching are then illustrated. Experimental results on random point sets and some fingerprint databases show that, in practice, the computation time is much smaller than the worst-case requirement. The algorithm is then extended for checking the exact match of a set of k line segments in the query set with a k-subset of n line segments in the sample set under rigid motion in time. Next, a simple version of the approximate matching problem is studied where one point of Q exactly matches with a point of P, and each of the other points of Q lie in the ε-neighborhood of some point of P. The worst-case time and space complexities of the proposed algorithm are and O(n), respectively. The proposed algorithms will find many applications to fingerprint matching, image registration, and object recognition. 相似文献

9.

S2S: structural-to-syntactic matching similar documents 总被引：2，自引：2，他引：0

Ramazan S. Aygün 《Knowledge and Information Systems》2008,16(3):303-329

Management of large collection of replicated data in centralized or distributed environments is important for many systems that provide data mining, mirroring, storage, and content distribution. In its simplest form, the documents are generated, duplicated and updated by emails and web pages. Although redundancy may increase the reliability at a level, uncontrolled redundancy aggravates the retrieval performance and might be useless if the returned documents are obsolete. Document similarity matching algorithms do not provide the information on the differences of documents, and file synchronization algorithms are usually inefficient and ignore the structural and syntactic organization of documents. In this paper, we propose the S2S matching approach. The S2S matching is composed of structural and syntactic phases to compare documents. Firstly, in the structural phase, documents are decomposed into components by its syntax and compared at the coarse level. The structural mapping processes the decomposed documents based on its syntax without actually mapping at the word level. The structural mapping can be applied in a hierarchical way based on the structural organization of a document. Secondly, the syntactic matching algorithm uses a heuristic look-ahead algorithm for matching consecutive tokens with a verification patch. Our two-phase S2S matching approach provides faster results than currently available string matching algorithms.

Ramazan S. AygünEmail:

相似文献

10.

Multiple filtration and approximate pattern matching 总被引：5，自引：0，他引：5

P. A. Pevzner M. S. Waterman 《Algorithmica》1995,13(1-2):135-154

Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230. 相似文献

11.

Approximate string matching for music analysis

R.?Clifford Email author C.?Iliopoulos 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2004,8(9):597-603

Approximate pattern matching algorithms have become an important tool in computer assisted music analysis and information retrieval. The number of different problem formulations has greatly expanded in recent years, not least because of the subjective nature of measuring musical similarity. From an algorithmic perspective, the complexity of each problem depends crucially on the exact definition of the difference between two strings. We present an overview of advances in approximate string matching in this field focusing on new measures of approximation. 相似文献

12.

Fast exact string matching algorithms

Thierry Lecroq 《Information Processing Letters》2007,102(6):229-235

String matching is the problem of finding all the occurrences of a pattern in a text. We propose a very fast new family of string matching algorithms based on hashing q-grams. The new algorithms are the fastest on many cases, in particular, on small size alphabets. 相似文献

13.

String Matching in Lempel—Ziv Compressed Strings

M. Farach M. Thorup 《Algorithmica》1998,20(4):388-404

String matching and compression are two widely studied areas of computer science. The theory of string matching has a long association with compression algorithms. Data structures from string matching can be used to derive fast implementations of many important compression schemes, most notably the Lempel—Ziv (LZ77) algorithm. Intuitively, once a string has been compressed—and therefore its repetitive nature has been elucidated—one might be tempted to exploit this knowledge to speed up string matching. The Compressed Matching Problem is that of performing string matching in a compressed text, without uncompressing it. More formally, let T be a text, let Z be the compressed string representing T , and let P be a pattern. The Compressed Matching Problem is that of deciding if P occurs in T , given only P and Z . Compressed matching algorithms have been given for several compression schemes such as LZW. In this paper we give the first nontrivial compressed matching algorithm for the classic adaptive compression scheme, the LZ77 algorithm. In practice, the LZ77 algorithm is known to compress more than other dictionary compression schemes, such as LZ78 and LZW, though for strings with constant per bit entropy, all these schemes compress optimally in the limit. However, for strings with o(1) per bit entropy, while it was recently shown that the LZ77 gives compression to within a constant factor of optimal, schemes such as LZ78 and LZW may deviate from optimality by an exponential factor. Asymptotically, compressed matching is only relevant if |Z|=o(|T|) , i.e., if the compression ratio |T|/|Z| is more than a constant. These results show that LZ77 is the appropriate compression method in such settings. We present an LZ77 compressed matching algorithm which runs in time O(n log ² u/n + p) where n=|Z| , u=|T| , and p=|P| . Compare with the na?ve ``decompresion' algorithm, which takes time Θ(u+p) to decide if P occurs in T . Writing u+p as (n u)/n+p , we see that we have improved the complexity, replacing the compression factor u/n by a factor log ² u/n . Our algorithm is competitive in the sense that O(n log ² u/n + p)=O(u+p) , and opportunistic in the sense that O(n log ² u/n + p)=o(u+p) if n=o(u) and p=o(u) . Received December 20, 1995; revised October 29, 1996, and February 6, 1997. 相似文献

14.

A comparison of three string matching algorithms

G. De V. Smit 《Software》1982,12(1):57-66

Three string matching algorithms—straightforward, Knuth-Morris-Pratt and Boyer-Moor—re examined and their time complexities discussed. A comparison of their actual average behaviour is made, based on empirical data presented. It is shown that the Boyel-Moore algorithm is extremely efficient in most cases and that, contrary to the impression one might get from the analytical results, the Knuth-Morris-Pratt algorithm is not significantly better on the average than the straightforward algorithm. 相似文献

15.

Fast string matching using an n-gram algorithm

Jong Yong Kim John Shawe-Taylor 《Software》1994,24(1):79-88

Experimental results are given for the application of a new n-gram algorithm to substring searching in DNA strings. The results confirm theoretical predictions of expected running times based on the assumption that the data are drawn from a stationary ergodic source. They also confirm that the algorithms tested are the most efficient known for searches involving larger patterns. 相似文献

16.

String matching with simple devices

Holger Petersen 《Information Processing Letters》2007,105(1):32-34

With the help of a general simulation technique of deterministic finite two-way multi-head automata by automata with blind heads we show O(n²/logn) to be an upper time bound on string matching. This result is tight by a previously known lower bound. 相似文献

17.

Multiple serial episodes matching

Patrick Cégielski Yuri Matiyasevich 《Information Processing Letters》2006,98(6):211-218

Given q+1 strings (a text t of length n and q patterns m₁,…,mq) and a natural number w, the multiple serial episode matching problem consists in finding the number of size w windows of text t which contain patterns m₁,…,mq as subsequences, i.e., for each mi, if mi=p₁,…,pk, the letters p₁,…,pk occur in the window, in the same order as in mi, but not necessarily consecutively (they may be interleaved with other letters). Our main contribution here is an algorithm solving this problem on-line in time O(nq) with an MP-RAM model (which is a RAM model equipped with extra operations). 相似文献

18.

Algorithms for pattern matching

G. Davies S. Bowsher 《Software》1986,16(6):575-601

This paper describes four algorithms of varying complexity used for pattern matching, and investigates their behaviour. The algorithms are tested using patterns of varying length from several alphabets. It is concluded that although there is no overall ‘best’ algorithm, the more complex algorithms are worth considering as they are generally more efficient in terms of number of comparisons made and execution time. 相似文献

19.

An aggressive algorithm for multiple string matching 总被引：1，自引：0，他引：1

Liuling Dai 《Information Processing Letters》2009,109(11):553-559

A new algorithm based on the Wu-Manber algorithm for multiple string matching is presented in this paper. The algorithm eliminates the functional overlap of the table HASH and SHIFT, and computes the shift distances in an aggressive manner. After each test, the algorithm examines the character next to the scan window to maximize the shift distance. This idea is consistent with that of the quick-search (QS) algorithm. Experimental results on four alphabets show that the new algorithm is more efficient than Wu-Manber and other recent algorithms, particularly on short pattern sets and large alphabet. 相似文献

20.

Approximate string matching with stuck address bits

Amihood Amir Estrella EisenbergOrgad Keller Avivit Levy Ely Porat 《Theoretical computer science》2011,412(29):3537-3544

相似文献