首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains.  相似文献   

2.
由于现代社会飞速发展,一些新的名词不断出现,在已有的字符串匹配的分词方法中,大部分的词典是固定的,如果出现新的词,那么就不能被正确识别出来。由此该文提出了渐进式丰富词典的分词方法,把那些不能正确分出来的字符串,利用统计词频的方法记录下来,如果词频达到一定阈值,就可以把它认为是新词,可以把它加入到词典中,使得词典动态的增加。实验证明,该方法在保证分词速度不受影响的基础上,可以提高分词的精度。  相似文献   

3.
Storing and retrieving strings in main memory is a fundamental problem in computer science. The efficiency of string data structures used for this task is of paramount importance for applications such as in-memory databases, text-based search engines and dictionaries. The burst trie is a leading choice for such tasks, as it can provide fast sorted access to strings. The burst trie, however, uses linked lists as substructures which can result in poor use of CPU cache and main memory. Previous research addressed this issue by replacing linked lists with dynamic arrays forming a cache-conscious array burst trie. Though faster, this variant can incur high instruction costs which can hinder its efficiency. Thus, engineering a fast, compact, and scalable trie for strings remains an open problem. In this paper, we introduce a novel and practical solution that carefully combines a trie with a hash table, creating a variant of burst trie called HAT-trie. We provide a thorough experimental analysis which demonstrates that for large set of strings and on alternative computing architectures, the HAT-trie—and two novel variants engineered to achieve further space-efficiency—is currently the leading in-memory trie-based data structure offering rapid, compact, and scalable storage and retrieval of variable-length strings.  相似文献   

4.
A new algorithm for string edit distance computation is given. The algorithm assumes that one of the two strings to be compared is a dictionary entry that is known a priori. This dictionary word is converted in an off-line phase into a deterministic finite state automaton. Given an input string and the automaton derived from the dictionary word, the computation of the edit distance between the two strings corresponds to a traversal of the states of the automaton. This procedure needs time which is only linear in the length of the input string. It is independent of the length of the dictionary word. Given not only one butN different dictionary words, their corresponding automata can be combined into a single deterministic finite state automaton. Thus the computation of the edit distance between the input word and each dictionary entry, and the determination of the nearest neighbor in the dictionary need time that is only linear in the length of the input string. However, the number os states of the automation is exponential.  相似文献   

5.
Chinese word segmentation is a difficult and challenging job because Chinese has no white space to mark word boundaries. Its result largely depends on the quality of the segmentation dictionary. Many domain phrases are cut into single words for they are not contained in the general dictionary. This paper demonstrates a Chinese domain phrase identification algorithm based on atomic word formation. First, atomic word formation algorithm is used to extract candidate strings from corpus after pretreatment. These extracted strings are stored as the candidate domain phrase set. Second, a lot of strategies such as repeated substring screening, part of speech (POS) combination filtering, and prefix and suffix filtering and so on are used to filter the candidate domain phrases. Third, a domain phrase refining method is used to determine whether a string is a domain phrase or not by calculating the domain relevance of this string. Finally, sort all the identified strings and then export them to users. With the help of morphological rules, this method uses the combination of statistical information and rules instead of corpus machine learning. Experiments proved that this method can obtain better results than traditional n-gram methods.  相似文献   

6.
Associative networks are parallel pattern-processing structures, capable of handling disturbed patterns in a robust manner. In this paper an implementation of a fast and robust dictionary-lookup algorithm for misspelt words using this technique is described. To code the words to be processed in a way that is suitable for processing in a network, the “rubber trigram” code is introduced. The program is capable of retrieving words from a dictionary of at least 25 000 words, even if the key contains severe misspellings. The presented algorithm is suitable for interactive dictionary lookup, e.g. in a word-processing system, being more flexible than the use of conventional dictionaries and lookup methods.  相似文献   

7.
信号分解的稀疏程度决定了压缩感知重构信号的精度,针对标准正交基稀疏程度的不足,提出了基于混合字典的压缩感知图像分解和重构方法。构建匹配图像边缘和纹理的二维Gabor字典,将图像在离散余弦字典与建立的二维Gabor字典上进行混合稀疏分解,得到图像的光滑成分、边缘成分和纹理成分。对得到的稀疏成分进行CS观测,通过求解一个优化问题重构图像。实验结果表明,构造的混合字典能够对图像进行更加稀疏的表示,在相同的采样率下,图像的重构质量优于标准正交基分解。  相似文献   

8.
Approximate string matching is an important operation in information systems because an input string is often an inexact match to the strings already stored. Commonly known accurate methods are computationally expensive as they compare the input string to every entry in the stored dictionary. This paper describes a two-stage process. The first uses a very compact n-gram table to preselect sets of roughly similar strings. The second stage compares these with the input string using an accurate method to give an accurately matched set of strings. A new similarity measure based on the Levenshtein metric is defined for this comparison. The resulting method is both computationally fast and storage-efficient.  相似文献   

9.
Computer recognition of machine-printed letters of the Tamil alphabet is described. Each character is represented as a binary matrix and encoded into a string using two different methods. The encoded strings form a dictionary. A given text is presented symbol by symbol and information from each symbol is extracted in the form of a string and compared with the strings in the dictionary. When there is agreement the letters are recognized and printed out in Roman letters following a special method of transliteration. The lengthening of vowels and hardening of consonants are indicated by numerals printed above each letter.  相似文献   

10.
IP地址查找中的数据结构及其性能分析   总被引:1,自引:0,他引:1  
许多计算机应用涉及字符串处理。为了提高处理效率,设计一个好的数据结构十分重要。本文以IP地址查找为应用背景,分析了数据结构trle及其变种的结构特性、查找性能和应用方法,表明了trle作为一种通用的数据结构的重要性。  相似文献   

11.
数据结构Trie及其应用   总被引:3,自引:0,他引:3  
许多计算机应用都涉及字符串处理.为了提高处理效率,设计一个好的数据结构十分重要.本文简要分析了几种常用字符串的数据结构及其性能,重点分析了数据结构Trie的三种形式的结构特性,最后以Trie在IP地址查找中的应用为实例说明了Trie的实际应用方法.  相似文献   

12.
A typical syntactic pattern recognition (PR) problem involves comparing a noisy string with every element of a dictionary, X. The problem of classification can be greatly simplified if the dictionary is partitioned into a set of subdictionaries. In this case, the classification can be hierarchical-the noisy string is first compared to a representative element of each subdictionary and the closest match within the subdictionary is subsequently located. Indeed, the entire problem of subdividing a set of string into subsets where each subset contains "similar" strings has been referred to as the "String Taxonomy Problem". To our knowledge there is no reported solution to this problem. In this paper we present a learning-automaton based solution to string taxonomy. The solution utilizes the Object Migrating Automaton the power of which in clustering objects and images has been reported. The power of the scheme for string taxonomy has been demonstrated using random string and garbled versions of string representations of fragments of macromolecules.  相似文献   

13.
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.  相似文献   

14.
针对传统的稀疏表示字典学习图像分类方法在大规模分布式环境下效率低下的问题,设计一种基于稀疏表示全局字典的图像学习方法。将传统的字典学习步骤分布到并行节点上,使用凸优化方法在节点上学习局部字典并实时更新全局字典,从而提高字典学习效率和大规模数据的分类效率。最后在MapReduce平台上进行并行化实验,结果显示该方法在不影响分类精度的情况下对大规模分布式数据的分类有明显的加速,可以更高效地运用于各种大规模图像分类任务中。  相似文献   

15.
Discovering Frequent Closed Partial Orders from Strings   总被引:2,自引:0,他引:2  
Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of a series of products, the partial order can be used to predict other related customers' future purchases and develop marketing campaigns. Moreover, some biological sequences (e.g., microarray data) can be clustered based on the partial orders shared by the sequences. Given a set of items, a total order of a subset of items can be represented as a string. A string database is a multiset of strings. In this paper, we identify a novel problem of mining frequent closed partial orders from strings. Frequent closed partial orders capture the nonredundant and interesting ordering information from string databases. Importantly, mining frequent closed partial orders can discover meaningful knowledge that cannot be disclosed by previous data mining techniques. However, the problem of mining frequent closed partial orders is challenging. To tackle the problem, we develop Frecpo (for frequent closed partial order), a practically efficient algorithm for mining the complete set of frequent closed partial orders from large string databases. Several interesting pruning techniques are devised to speed up the search. We report an extensive performance study on both real data sets and synthetic data sets to illustrate the effectiveness and the efficiency of our approach  相似文献   

16.
An approach to designing very fast algorithms for approximate string matching in a dictionary is proposed. Multiple spelling errors corresponding to insert, delete, change, and transpose operations on character strings are considered in the fault model. The design of very fast approximate string matching algorithms through a four-step reduction procedure is described. The final and most effective step uses hashing techniques to avoid comparing the given word with words at large distances. The technique has been applied to a library book catalog textbase. The experiments show that performing approximate string matching for a large dictionary in real-time on an ordinary sequential computer under our multiple fault model is feasible  相似文献   

17.
多模式串匹配算法是网络内容过滤系统的核心技术之一.自动机的存储空间大小和Cache性能是影响多模式串匹配算法速度的关键因素.随着模式串规模的扩大,自动机的巨大存储开销导致现有的串匹配算法性能大幅度下降.从压缩存储空间以提高Cache命中率的思想出发,提出了一种对经典SBOM算法的优化策略,它用Suffix Tree代替SBOM算法中的Factor Oracle结构,同时用剪枝的方法将Suffix Tree降低为近似线性的空间复杂度,然后用双数组Trie表示之,以压缩存储空间.与SBOM算法相比,改进算法不仅能够有效地节省存储空间,而且显著地提高了串匹配的速度,非常适合于在线高速匹配的应用环境.  相似文献   

18.
对汉语信息处理中常常要涉及汉语词典查询,当所涉及的词典规模较为庞大时如何快速访问词典以获取词语知识便成为了一个需重点解决的问题。将阐述一种简单快捷的基于双数组Trie(Double-Array Trie)原理的词典查询机制。该算法的查询时间为O(n)的线性时间(n为查询词条的长度),由此可见双数组算法在时间上存在着明显优势,但在空间耗费上却存在着浪费现象。前人提出了一些解决方案,其中主要的有:在构造双数组时采用一种启发式排序策略,即每一次都先处理当前分支节点最多的活动节点。考虑到这种启发式思想为确定性算法,容易陷入局部最优陷阱之中,因此在这种思想的基础上引入了舍伍德随机思想和遗传算法中常常运用到的变异思想,在改进算法空间利用率的同时也使得算法跳出了局部最优解的陷阱。  相似文献   

19.
针对Lucene自带中文分词器分词效果差的缺点,在分析现有分词词典机制的基础上,设计了基于全哈希整词二分算法的分词器,并集成到Lucene中,算法通过对整词进行哈希,减少词条匹配次数,提高分词效率。该分词器词典文件维护方便,可以根据不同应用的要求进行定制,从而提高了检索效率。  相似文献   

20.
In recent years, several authors have presented algorithms that locate instances of a given string, or set of strings, within a text. Recently, authors have given less consideration to the complementary problem of processing a text to find out what strings appear in the text, without any preconceived notion of what strings might be present. A system called PATRICIA, which was developed two decades ago, is an implementation of a solution to this problem. The design of PATRICIA is very tightly bound to the assumptions that individual string elements are bits and that the user of the system can provide complete lists of starting and stopping places for strings. This paper presents an approach that drops these assumptions. Our method allows different definitions of indivisible string elements for different applications, and the only information the user provides for the determination of the beginning and ends of strings is a specification of a maximum length for output strings. This paper also describes a portable C implementation of the method, called PORTREP. The primary data structure of PORTREP is a trie represented as a ternary tree. PORTREP has a method for eliminating redundancy from the output, and it can function with a bounded number of nodes by employing a heuristic process that reuses seldom-visited nodes. Theoretical analysis and empirical studies, reported here, give confidence in the efficiency of the algorithms. PORTREP has the ability to form the basis for a variety of text-analysis applications, and this paper considers one such application, automatic document indexing.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号