首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
R. Nigel Horspool 《Software》1980,10(6):501-506
The problem of searching through text to find a specified substring is considered in a practical setting. It is discovered that a method developed by Boyer and Moore can outperform even special-purpose search instructions that may be built into the computer hardware. For very short substrings however, these special purpose instructions are fastest—provided that they are used in an optimal way.  相似文献   

2.
Justin Zobel  Philip Dart 《Software》1995,25(3):331-345
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching.  相似文献   

3.
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlogkn)/B) disk pages and finds all k-error matches with O((|P|+occ)/B+logknloglogBn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω(|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)/B) disk pages, and the I/O complexity is O((|P|+occ)/B+logk(k+1)nloglogn).  相似文献   

4.
Applications of approximate string matching to 2D shape recognition   总被引:7,自引:0,他引:7  
H Bunke  U Bü  hler 《Pattern recognition》1993,26(12):1797-1812
A new method for the recognition of arbitrary two-dimensional (2D) shapes is described. It is based on string edit distance computation. The recognition method is invariant under translation, rotation, scaling and partial occlusion. A set of experiments are described demonstrating the robustness and reliability of the proposed approach.  相似文献   

5.
字符串近似匹配在网络安全中有广泛的应用。本文从中文字符串相似度角度出发,提出了通过单个汉字的细分来提高字符相似度的想法,并从汉字"成簇性"方面进行分析,引出了汉字的Key表示方法,将汉字与Key的映射关系归结为规则,讨论了规则的获取方法。设计了基于规则的中文字符串近似匹配的框架,提出了新的相似度计算模型,并通过实验对整个流程加以验证,证明基于规则的中文字符串近似匹配的优越性。  相似文献   

6.
Multiple filtration and approximate pattern matching   总被引:5,自引:0,他引:5  
Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.This research was supported in part by the National Science Foundation under Grant No. DMS 90-05833 and the National Institute of Health under Grant No. GM-36230.  相似文献   

7.
Many database applications have the emerging need to support approximate queries that ask for strings that are similar to a given string, such as “name similar to smith” and “telephone number similar to 412-0964”. Query optimization needs the selectivity of such an approximate predicate, i.e., the fraction of records in the database that satisfy the condition. In this paper, we study the problem of estimating selectivities of approximate string predicates. We develop a novel technique, called Sepia, to solve the problem. Given a bag of strings, our technique groups the strings into clusters, builds a histogram structure for each cluster, and constructs a global histogram. It is based on the following intuition: given a query string q, a preselected string p in a cluster, and a string s in the cluster, based on the proximity between q and p, and the proximity between p and s, we can obtain a probability distribution from a global histogram about the similarity between q and s. We give a full specification of the technique using the edit distance metric. We study challenges in adopting this technique, including how to construct the histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. Our extensive experiments on real data sets show that this technique can accurately estimate selectivities of approximate string predicates. A short version of this article appeared as [21] in the proceedings of the 31st International Conference on Very Large Data Bases (VLDB), August 30 – September 2, 2005, Trondheim, Norway. The source code of our algorithms is available at .  相似文献   

8.
The discretely-scaled string indexing problem asks one to preprocess a given text string T, so that for a queried pattern P, the matched positions in T that P appears with some discrete scale can be reported efficiently. For solving this problem, Wang et al. first show that with an O(|T|log|T|)-time preprocessing on T, all matched positions can be reported in O(|P|+Ud) time, where |T|, |P|, and Ud denote the length of T, the length of P, and the number of matched positions for discretely-scaled P in T, respectively. In this paper, for fixed alphabets we propose the first-known optimal indexing algorithm that takes O(|T|) and O(|P|+Ud) time in its preprocessing and query phases, respectively. For integer and unbounded alphabets, our new algorithm can also be extended to obtain the best-known results.  相似文献   

9.
Approximate pattern matching algorithms have become an important tool in computer assisted music analysis and information retrieval. The number of different problem formulations has greatly expanded in recent years, not least because of the subjective nature of measuring musical similarity. From an algorithmic perspective, the complexity of each problem depends crucially on the exact definition of the difference between two strings. We present an overview of advances in approximate string matching in this field focusing on new measures of approximation.  相似文献   

10.
Related problems of scaled matching and indexing, which aim to determine all positions in a text T that a pattern P occurs in its scaled form, are considered important because of their applications to computer vision. However, previous results only focus on enlarged patterns, and do not allow shrunk patterns since they may disappear. In this paper, we give the definition and an efficient indexing algorithm for proportionally-scaled patterns that can be visually enlarged or shrunk. The proposed indexing algorithm takes O(|T|) time in its preprocessing phase, and achieves time in its answering phase, where |T|, |P|, Up, and m denote the length of T, the length of P, the number of reported positions, and the length of T under run-length representation, respectively.  相似文献   

11.
在不同关键词规模、最短关键词长度和字符集大小等情况下,有效的多串匹配算法是不同的。新提出的自适应多串匹配算法(Adapted Multiple Strings Matching Algorithm,AMSM)改善了SBOM算法中Oracle树存在不精确跳跃计算的缺点,同时采用了WuManber算法的块跳跃策略和压缩形式的Oracle树比较策略,提高了算法的性能,可适用于各种情况,是一种通用多串(多模式)匹配算法。  相似文献   

12.
We propose DMP-tree, a dynamic M-way prefix tree, data structure for the string matching problem in general and prefix matching in particular. DMP-tree has been initially devised for fast and efficiently handling prefix matching which constitutes the building block of some applications in the computer realm and related area. It is assumed there are strings of an alphabet Σ which are ordered. The data strings can have different lengths and some of them can be prefixes of others. Two well known applications of prefix matching are layers 3 and 4 switching in TCP/IP protocols. In layer 3 switching, routers forward an IP packet by checking its destination address and finding the longest matching prefix from a database. In layer 4 switching, the source and destination addresses are used to classify packets for differentiated service and Quality of Services (QoS). DMP-tree is a superset of B-tree. When none of the data strings are a prefix of each other, DMP-tree is the same as B-tree. In DMP-tree, no data string can be in a higher level than another data string which is its prefix. This requires a special procedure for node splitting. Indeed, node splitting differentiates DMP-tree from B-tree. The proposed data structure is simple, well defined, easy to implement in hardware or software and efficient in terms of memory usages and search time compared to other data structures proposed for prefix matching. We have implemented DMP-tree and the experimental results for simulated IP prefixes from the {0, 1} character set show an average search time of for a large number of N, number of data elements, when the internal node branching factor M is big enough (?5).  相似文献   

13.
Let X and Y be two run-length encoded strings, of encoded lengths k and l, respectively. We present a simple O(|X|l+|Y|k) time algorithm that computes their edit distance.  相似文献   

14.
We propose two sparsity pattern selection algorithms for factored approximate inverse preconditioners to solve general sparse matrices. The sparsity pattern is adaptively updated in the construction phase by using combined information of the inverse and original triangular factors of the original matrix. In order to determine the sparsity pattern, our first algorithm uses the norm of the inverse factors multiplied by the largest absolute value of the original factors, and the second employs the norm of the inverse factors divided by the norm of the original factors. Experimental results show that these algorithms improve the robustness of the preconditioners to solve general sparse matrices.  相似文献   

15.
16.
17.
18.
A subquadratic algorithm for approximate limited expression matching   总被引:1,自引:0,他引:1  
In this paper we present an efficient subquadratic-time algorithm for matching strings and limited expressions in large texts. Limited expressions are a subset of regular expressions that appear often in practice. The generalization from simple strings to limited expressions has a negligible affect on the speed of our algorithm, yet allows much more flexibility. Our algorithm is similar in spirit to that of Masek and Paterson [MP], but it is much faster in practice. Our experiments show a factor of four to five speedup against the algorithms of Sellers [Se] and Ukkonen [Uk1] independent of the sizes of the input strings. Experiments also reveal our algorithm to be faster, in most cases, than a recent improvement by Chang and Lampe [CL2], especially for small alphabet sizes for which it is two to three times faster.The research of U. Manber was supported in part by a Presidential Young Investigator Award DCR-8451397, with matching funds from AT&T, and by NSF Grant CCR-9001619. G. Myers research was supported in part by NIH Grant LM04960, NSF Grant CCR-9001619, and the Aspen Center for Physics.  相似文献   

19.
The increasing popularity of graph data in various domains has lead to a renewed interest in developing efficient graph matching techniques, especially for processing large graphs. In this paper, we study the problem of approximate graph matching in a large attributed graph. Given a large attributed graph and a query graph, we compute a subgraph of the large graph that best matches the query graph. We propose a novel structure-aware and attribute-aware index to process approximate graph matching in a large attributed graph. We first construct an index on the similarity of the attributed graph, by partitioning the large search space into smaller subgraphs based on structure similarity and attribute similarity. Then, we construct a connectivity-based index to give a concise representation of inter-partition connections. We use the index to find a set of best matching paths. From these best matching paths, we compute the best matching answer graph using a greedy algorithm. Experimental results on real datasets demonstrate the efficiency of both index construction and query processing. We also show that our approach attains high-quality query answers.  相似文献   

20.
A stringw isprimitive if it is not a power of another string (i.e., writingw =v k impliesk = 1. Conversely,w is asquare ifw =vv, withv a primitive string. A stringx issquare-free if it has no nonempty substring of the formww. It is shown that the square-freedom of a string ofn symbols over an arbitrary alphabet can be tested by a CRCW PRAM withn processors inO(logn) time and linear auxiliary space. If the cardinality of the input alphabet is bounded by a constant independent of the input size, then the number of processors can be reduced ton/logn without affecting the time complexity of this strategy. The fastest sequential algorithms solve this problemO(n logn) orO(n) time, depending on whether the cardinality of the input alphabet is unbounded or bounded, and either performance is known to be optimal within its class. More elaborate constructions lead to a CRCW PRAM algorithm for detecting, within the samen-processors bounds, all positioned squares inx in timeO(logn) and using linear auxiliary space. The fastest sequential algorithms solve this problem inO(n logn) time, and such a performance is known to be optimal.This research was supported, through the Leonardo Fibonacci Institute, by the Istituto Trentino di Cultura, Trento, Italy. Additional support was provided by the French and Italian Ministries of Education, by the National Research Council of Italy, by the British Research Council Grant SERC-E76797, by NSF Grant CCR-89-00305, by NIH Library of Medicine Grant ROI LM05118, by AFOSR Grant 90-0107, and by NATO Grant CRG900293.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号