首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
摘 要 多字符串模式匹配是在给定的文本中并行查找多个模式串的一种方法。本文中提出THT-MSMA多模式匹配算法,该算法采用双哈希表来减少尝试比较的次数。分析表明,该算法适合于最短模式串长度很长的环境,时间复杂度要低于经典的算法,尝试比较次数少于传统的多模式匹配算法。最后,实验结果表明,THT-MSMA算法具有良好的时空性能。  相似文献   

2.
目的 视觉检索需要准确、高效地从大型图像或者视频数据集中检索出最相关的视觉内容,但是由于数据集中图像数据量大、特征维度高的特点,现有方法很难同时保证快速的检索速度和较好的检索效果。方法 对于面向图像视频数据的高维数据视觉检索任务,提出加权语义局部敏感哈希算法(weighted semantic locality-sensitive hashing, WSLSH)。该算法利用两层视觉词典对参考特征空间进行二次空间划分,在每个子空间里使用加权语义局部敏感哈希对特征进行精确索引。其次,设计动态变长哈希码,在保证检索性能的基础上减少哈希表数量。此外,针对局部敏感哈希(locality sensitive hashing, LSH)的随机不稳定性,在LSH函数中加入反映参考特征空间语义的统计性数据,设计了一个简单投影语义哈希函数以确保算法检索性能的稳定性。结果 在Holidays、Oxford5k和DataSetB数据集上的实验表明,WSLSH在DataSetB上取得最短平均检索时间0.034 25 s;在编码长度为64位的情况下,WSLSH算法在3个数据集上的平均精确度均值(mean average precision,mAP)分别提高了1.2%32.6%、1.7%19.1%和2.6%28.6%,与几种较新的无监督哈希方法相比有一定的优势。结论 通过进行二次空间划分、对参考特征的哈希索引次数进行加权、动态使用变长哈希码以及提出简单投影语义哈希函数来对LSH算法进行改进。由此提出的加权语义局部敏感哈希(WSLSH)算法相比现有工作有更快的检索速度,同时,在长编码的情况下,取得了更为优异的性能。  相似文献   

3.
Authors of papers on LR parser table compaction and authors of books on compiler construction appear to have either overlooked or discounted the possibility of using hashing. In fact, hashing is easy to implement as a compaction technique and gives reasonable performance. It produces tables that are as compact as some of the other techniques reported in the literature while permitting efficient table lookups.  相似文献   

4.
We have established a preprocessing method for determining the meaningfulness of a table to allow for information extraction from tables on the Internet. A table offers a preeminent clue in text mining because it contains meaningful data displayed in rows and columns. However, tables are used on the Internet for both knowledge structuring and document design. Therefore, we were interested in determining whether or not a table has meaningfulness that is related to the structural information provided at the abstraction level of the table head. Accordingly, we: 1) investigated the types of tables present in HTML documents, 2) established the features that distinguished meaningful tables from others, 3) constructed a training data set using the established features after having filtered any obvious decorative tables, and 4) constructed a classification model using a decision tree. Based on these features, we set up heuristics for table head extraction from meaningful tables, and obtained an F-measure of 95.0 percent in distinguishing meaningful tables from decorative tables and an accuracy of 82.1 percent in extracting the table head from the meaningful tables.  相似文献   

5.
Pl Quittner 《Software》1983,13(6):471-478
It is shown that for data stored on direct access devices access time can be reduced without increasing storage demand, through a single level index table which itself is accessible by hashing. If the complete index table can be stored in main memory this method is always superior to direct hashing and to sequentially organized index tables. If the index table is stored on disk it always yields smaller access time than multi-level index tables and—depending on the size of the index table and on the number of records per track—it is comparable or better than hashing the data directly. Expressions are given to determine in this case which method is more efficient.  相似文献   

6.
Abstract

State-of-the-art hashing methods, such as the kernelised locality-sensitive hashing and spectral hashing, have high algorithmic complexities to build the hash codes and tables. Our observation from the existing hashing method is that, putting two dissimilar data points into the same hash bucket only reduces the efficiency of the hash table, but it does not hurt the query accuracy. Whereas putting two similar data points into different hash buckets will reduce the correctness (i.e. query accuracy) of a hashing method. Therefore, it is much more important for a good hashing method to ensure that similar data points have high probabilities to be put to the same bucket, than considering those dissimilar data-point relations. On the other side, attracting similar data points to the same hash bucket will naturally suppress dissimilar data points to be put into the same hash bucket. With this locality-preserving observation, we naturally propose a new hashing method called the locality-preserving hashing, which builds the hash codes and tables with much lower algorithmic complexity. Experimental results show that the proposed method is very competitive in terms of the training time spent for large data-sets among the state of the arts, and with reasonable or even better query accuracy.  相似文献   

7.
We study different efficient implementations of an Aho–Corasick pattern matching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7‐bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the pattern matching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple‐array, and double‐array representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho–Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple‐array and double‐array performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

8.
9.
In sequential systems, hash tables yield almost constant time performance for single element accesses. However, in massively parallel systems, we need to consider a large number of parallel accesses. Consequently, the potential queueing delay as well as the communication overhead can alter the relative performance of hash algorithms. Thus, it is necessary to reevaluate the performance of conventional hash algorithms and investigate new algorithms that exploit the parallelism without suffering from excessive communication overheads or queueing delays. In this paper, we first study the performance of data parallel hash algorithms with conventional collision resolution strategies. For SIMD/SPMD hypercube systems, neither linear probing nor double hashing yields satisfactory performance. Thus, we develop a new collision resolution strategy, namely, hypercube hashing. Hypercube hashing combines the randomness provided in double hashing with the low communication cost inherited from linear probing to yield better performance. We also investigate efficient implementation of the chaining algorithm in data parallel systems and its performance. From the simulation results, hypercube hashing significantly outperforms the other open addressing strategies in all cases (under the assumption of random input key space). For high load factors, chaining performs better than hypercube hashing. However, with a low load factor, hypercube hashing significantly outperforms the chaining algorithm.  相似文献   

10.
This paper shows how trees can be stored in a very compact form, called ‘Bonsai’, using hash tables. A method is described that is suitable for large trees that grow monotonically within a predefined maximum size limit. Using it, pointers in any tree can be represented within 6 + [log2n] bits per node where n is the maximum number of children a node can have. We first describe a general way of storing trees in hash tables, and then introduce the idea of compact hashing which underlies the Bonsai structure. These two techniques are combined to give a compact representation of trees, and a practical methodology is set out to permit the design of these structures. The new representation is compared with two conventional tree implementations in terms of the storage required per node. Examples of programs that must store large trees within a strict maximum size include those that operate on trie structures derived from natural language text. We describe how the Bonsai technique has been applied to the trees that arise in text compression and adaptive prediction, and include a discussion of the design parameters that work well in practice.  相似文献   

11.
Hashing methods for temporal data   总被引:2,自引:0,他引:2  
External dynamic hashing has been used in traditional database systems as a fast method for answering membership queries. Given a dynamic set S of objects, a membership query asks whether an object with identity k is in (the most current state of) S. This paper addresses the more general problem of temporal hashing. In this setting, changes to the dynamic set are time-stamped and the membership query has a temporal predicate, as in: "Find whether object with identity k was in set S at time t". We present an efficient solution for this problem that takes an ephemeral hashing scheme and makes it partially persistent. Our solution, also termed partially persistent hashing, uses a space that is linear on the total number of changes in the evolution of set S and has a small {O[logB(n/B)]} query overhead. An experimental comparison of partially persistent hashing with various straightforward approaches (like external linear hashing, the multi-version B-tree and the R*-tree) shows that it provides the faster membership query response time. Partially persistent hashing should be seen as an extension of traditional external dynamic hashing in a temporal environment. It is independent of the ephemeral dynamic hashing scheme used; while this paper concentrates on linear hashing, the methodology applies to other dynamic hashing schemes as well  相似文献   

12.
This paper develops a formalism that precisely characterizes when class tables are required for C++ memory layouts. A memory layout is a particular choice of data structures for implementing run‐time support for object‐oriented languages. We use this formalism to quantify and evaluate, on a set of benchmarks, the space overhead for a set of C++ memory layouts. In particular, this paper studies the space overhead due to three language features: virtual dispatch, virtual inheritance, and dynamic typing. To date, there has been no scientific quantification or evaluation of C++ memory layouts. Our approach can help C++ implementors. This work has already influenced the memory layout design choices in IBM's Visual Age C++ V5 compiler. Applying our approach to a set of five benchmarks, we demonstrate that the impact of object‐oriented space overhead can vary dramatically between applications (ranging from 0.42% to 99.79% for our benchmarks). In particular, applications whose object space is dominated by instances of classes that heavily use object‐oriented language features will be significantly impacted by the choice of a memory layout. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

13.
14.
Routing table lookup is an important operation in packet forwarding. This operation has a significant influence on the overall performance of the network processors. Routing tables are usually stored in main memory which has a large access time. Consequently, small fast cache memories are used to improve access time. In this paper, we propose a novel routing table compaction scheme to reduce the number of entries in the routing table. The proposed scheme has three versions. This scheme takes advantage of ternary content addressable memory (TCAM) features. Two or more routing entries are compacted into one using don’t care elements in TCAM. A small compacted routing table helps to increase cache hit rate; this in turn provides fast address lookups. We have evaluated this compaction scheme through extensive simulations involving IPv4 and IPv6 routing tables and routing traces. The original routing tables have been compacted over 60% of their original sizes. The average cache hit rate has improved by up to 15% over the original tables. We have also analyzed port errors caused by caching, and developed a new sampling technique to alleviate this problem. The simulations show that sampling is an effective scheme in port error-control without degrading cache performance.  相似文献   

15.
Execution monitors are widely used during software development for tasks that require an understanding of program behavior, such as debugging and profiling. The Icon programming language has been enhanced with a framework that supports execution monitoring. Under the enhanced translator and interpreter, neither source modification nor any special compiler command-line option is required in order to monitor an Icon program. Execution monitors are written in the source language, instead of the implementation language. Performance, portability, and detailed access to the monitored program's state are achieved using a coroutine model and dynamic loading rather than the separate-process model employed by many conventional monitoring systems.  相似文献   

16.
Random Sampling for Continuous Streams with Arbitrary Updates   总被引:1,自引:0,他引:1  
The existing random sampling methods have at least one of the following disadvantages: they 1) are applicable only to certain update patterns, 2) entail large space overhead, or 3) incur prohibitive maintenance cost. These drawbacks prevent their effective application in stream environments (where a relation is updated by a large volume of insertions and deletions that may arrive in any order), despite the considerable success of random sampling in conventional databases. Motivated by this, we develop several fully dynamic algorithms for obtaining random samples from individual relations, and from the join result of two tables. Our solutions can handle any update pattern with small space and computational overhead. We also present an in-depth analysis that provides valuable insight into the characteristics of alternative sampling strategies and leads to precision guarantees. Extensive experiments validate our theoretical findings and demonstrate the efficiency of our techniques in practice  相似文献   

17.
《Information and Computation》2006,204(9):1325-1345
Dynamic Programming solves combinatorial optimization problems by recursive decomposition and tabulation of intermediate results. The first step in the design of a dynamic programming algorithm is to decide on the set of tables that will hold optimal solutions to subproblems. This step predetermines the shape of the dynamic programming recurrences as well as the asymptotic efficiency of the algorithm in time and space. We study dynamic programming in a formal framework where design of tables and problem decomposition can be done independently. Our main result shows that choosing a good table design for a given decomposition is an NP-complete problem. A heuristic or approximate approach is therefore needed to automate good table design. We report on a strategy that combines user annotation and a brute force algorithm, which is shown to perform well in a large application.  相似文献   

18.
David R. Hanson 《Software》1980,10(6):489-500
Icon is a new programming language designed primarily for non-numerical applications. Its roots are in SNOBOL4 and SL5; as in those languages, execution-time flexibility is an important attribute of Icon, although some aspects of programs are bound at compile time to improve efficiency. Icon, which is implemented in Ratfor, is also intended to be portable and suitable for 16-bit computers. The storage management system in Icon is designed to meet the goals of portability, flexibility and efficiency. This is accomplished by subdividing the storage management system into a set of type-specific storage management subsystems. This paper describes the implementation of these subsystems, their interaction, and their performance.  相似文献   

19.
Hashing algorithms have been widely adopted to provide a fast address lookup process which involves a search through a large database to find a record associated with a given key. Modern examples include address lookup in network routers for a forwarding outgoing link, rule-matching in intrusion detection systems comparing incoming packets with a large database, etc. Hashing algorithms involve transforming a key inside each target data to a hash value hoping that the hashing would render the database a uniform distribution with respect to this new hash value. When the database are already key-wise uniformly distributed, any regular hashing algorithm would easily lead to perfectly uniform distribution after the hashing. On the other hand, if records in the database are instead not uniformly distributed, then different hashing functions would lead to different performance. This paper addresses the cases when such distribution follows a natural negative linear distribution, a partial negative linear distribution, or an exponential distribution which are found to closely approximate many real-life database distributions. For each of these distributions, we derive a general formula for calculating the distribution variance produced by any given non-overlapping bit-grouping XOR hashing function. Such a distribution variance from the hashing directly translates to performance variations in searching. Through this, the best XOR hashing function can be easily determined for any given key size and any given hashing target size.  相似文献   

20.
The next evolutionary step in wireless Internet information management is to provide support for tasks, which may be collaborative and may include multiple target devices, from desktop to handheld. This means that the information architecture supports the processes of the task, recognizes group interaction, and lets users migrate seamlessly among internet-compatible devices without losing the thread of the session. If users are free to migrate amongst devices during the course of a session then intelligent transformation of data is required to exploit the screen size and input characteristics of the target appliance with minimal loss of task effectiveness.In this paper we first review general characteristics related to the performance of users on small screens and then examine the navigation of full tables on small screens for users in multi-device scenarios. We examine the methodologies available for access to full tables in environments where the full table cannot be viewed in its entirety. In particular, we examine the situation where users are collaborating across platform and referring to the same table of data. We ask three basic questions: Does screen size affect the performance of table lookup tasks? Does a search function improve performance of table lookup based tasks on reduced screen sizes? Does including context information improve the performance of table lookup based tasks on reduced screen sizes? The answers to these questions are important as individual and intuitive responses are used by the designers of small screen interfaces for use with large tables of data. We report on the results of a user study that examines factors that may affect the use of large tables on small display devices. The use of large tables on small devices in their native state becomes important in at least two circumstances. First, when collaboration involves two or more users sharing a view of data when the individual screen sizes are different. Second, when the exact table structure replication may be critical as a user moves quickly from a larger to a smaller screen or back again mid-task. Performance is measured by both effectiveness, correctness of result, and efficiency, effort to reach a result.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号