首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Backward demodulation is a simplification technique used in saturation-based theorem proving with superposition and ordered paramodulation. It requires instance retrieval, i.e., search for instances of some term in a typically large set of terms. Path indexing is a family of indexing techniques that can be used to solve this problem efficiently. We propose a number of powerful optimisations to standard path indexing. We also describe a novel framework that combines path indexing with relational joins. The main advantage of the proposed scheme is flexibility, which we illustrate by sketching how to adapt the scheme to instance retrieval modulo commutativity and backward subsumption on multi-literal clauses.  相似文献   

3.
Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations, and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies both the decisions made by a table recognizer and the assumptions and inferencing techniques that underlie these decisions.Received: 29 May 2003, Revised: 28 October 2003, Published online: 1 April 2004 Correspondence to: Richard Zanibbi  相似文献   

4.
The traditional interaction mechanism with a database system is through the use of a query language, the most widely used one being SQL. However, when one is facing a situation where he or she has to make a minor modification to a previously issued SQL query, either the whole query has to be written from scratch, or one has to invoke an editor to edit the query. This, however, is not the way we converse with each other as humans. During the course of a conversation, the preceding interaction is used as a context within which many incomplete and/or incremental phrases are uniquely and unambiguously interpreted, sparing the need to repeat the same things again and again. In this paper, we present an effective mechanism that allows a user to interact with a database system in a way similar to the way humans converse. More specifically, incomplete SQL queries are accepted as input which are then matched to identified parts of previously issued queries. Disambiguation is achieved by using various types of semantic information. The overall method works independently of the domain under which it is used (i.e., independently of the database schema). Several algorithms that are variations of the same basic mechanism are proposed. They are mutually compared with respect to efficiency and accuracy through a limited set of experiments on human subjects. The results have been encouraging, especially when semantic knowledge from the schema is exploited, laying a potential foundation for conversational querying in databases.  相似文献   

5.
The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to manage large resource requirements and unpredictable system failures. Parallel and distributed indices built using commodity hardware like personal computers (PCs) can substantially save cost because PCs are produced in bulk, achieving the scale of economy. However, PCs have limited amount of random access memory (RAM) and the effective utilization of RAM for in-memory inversion is crucial. This paper presents an analytical investigation and an empirical evaluation of storage-efficient inmemory extensible inverted files, which are represented by fixed- or variable-sized linked list nodes. The size of these linked list nodes is determined by minimizing the storage wastes or maximizing storage utilization under different conditions, which lead to different storage allocation schemes. Minimizing storage wastes also reduces the number of address indirections (i.e., chaining). We evaluated our storage allocation schemes using a number of reference collections. We found that the arrival rate scheme is the best in terms of both storage utilization and the mean number of chainings per term. The final storage utilization can be over 90% in our evaluation if there is a sufficient number of documents indexed. The mean number of chainings is not large (less than 2.6 for all the reference collections). We have also showed that our best storage allocation scheme can be used for our extensible compressed inverted file. The final storage utilization of the extensible compressed inverted file can be over 90% in our evaluation provided that there is a sufficient number of documents indexed. The proposed storage allocation schemes can also be used by compressed extensible inverted files with word positions  相似文献   

6.
The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.  相似文献   

7.
分布式事务处理性能评价模型   总被引:2,自引:0,他引:2  
介绍了在分布式数据库中对事务处理性能评价的模型,这种模型是建立在简化的限制性假设基础之上的,并选择了两个重要的性能评价指标:存取结点的平均数和每个结点存取数据项的平均数。  相似文献   

8.
We consider the problem of designing an efficient index for approximate pattern matching. Despite ongoing efforts, this topic is still a challenge in combinatorial pattern matching. We present a new data structure that allows to report all matches in worst-case time O(|Σ|m+occ), which is linear in the pattern length m and the number of occurrences occ for alphabets of constant size |Σ|. Our index uses O(n|Σ|logn) extra space on average and with high probability, where n is the length of the text (indexing) or the number of strings to index (dictionary lookup).  相似文献   

9.
10.
This work addresses the problem of detecting novel sentences from an incoming stream of text data, by studying the performance of different novelty metrics, and proposing a mixed metric that is able to adapt to different performance requirements. Existing novelty metrics can be divided into two types, symmetric and asymmetric, based on whether the ordering of sentences is taken into account. After a comparative study of several different novelty metrics, we observe complementary behavior in the two types of metrics. This finding motivates a new framework of novelty measurement, i.e. the mixture of both symmetric and asymmetric metrics. This new framework of novelty measurement performs superiorly under different performance requirements varying from high-precision to high-recall as well as for data with different percentages of novel sentences. Because it does not require any prior information, the new metric is very suitable for real-time knowledge base applications such as novelty mining systems where no training data is available beforehand.  相似文献   

11.
The past few years have seen tremendous advances in distributed storage infrastructure. Unstructured and structured overlay networks have been successfully used in a variety of applications, ranging from file-sharing to scientific data repositories. While unstructured networks benefit from low maintenance overhead, the associated search costs are high. On the other hand, structured networks have higher maintenance overheads, but facilitate bounded time search of installed keywords. When dealing with typical data sets, though, it is infeasible to install every possible search term as a keyword into the structured overlay.  相似文献   

12.
Objective: Information Retrieval (IR) is strongly rooted in experimentation where new and better ways to measure and interpret the behavior of a system are key to scientific advancement. This paper presents an innovative visualization environment: Visual Information Retrieval Tool for Upfront Evaluation (VIRTUE), which eases and makes more effective the experimental evaluation process.Methods: VIRTUE supports and improves performance analysis and failure analysis.Performance analysis: VIRTUE offers interactive visualizations based on well-known IR metrics allowing us to explore system performances and to easily grasp the main problems of the system.Failure analysis: VIRTUE develops visual features and interaction, allowing researchers and developers to easily spot critical regions of a ranking and grasp possible causes of a failure.Results: VIRTUE was validated through a user study involving IR experts. The study reports on (a) the scientific relevance and innovation and (b) the comprehensibility and efficacy of the visualizations.Conclusion: VIRTUE eases the interaction with experimental results, supports users in the evaluation process and reduces the user effort.Practice: VIRTUE will be used by IR analysts to analyze and understand experimental results.Implications: VIRTUE improves the state-of-the-art in the evaluation practice and integrates visualization and IR research fields in an innovative way.  相似文献   

13.
Our research shows that for large databases, without considerable additional storage overhead, cluster-based retrieval (CBR) can compete with the time efficiency and effectiveness of the inverted index-based full search (FS). The proposed CBR method employs a storage structure that blends the cluster membership information into the inverted file posting lists. This approach significantly reduces the cost of similarity calculations for document ranking during query processing and improves efficiency. For example, in terms of in-memory computations, our new approach can reduce query processing time to 39% of FS. The experiments confirm that the approach is scalable and system performance improves with increasing database size. In the experiments, we use the cover coefficient-based clustering methodology (C3M), and the Financial Times database of TREC containing 210 158 documents of size 564 MB defined by 229 748 terms with total of 29 545 234 inverted index elements. This study provides CBR efficiency and effectiveness experiments using the largest corpus in an environment that employs no user interaction or user behavior assumption for clustering.  相似文献   

14.
15.
张继燕  欧莹元 《软件》2013,34(5):155-156
本文从信息管理与信息系统的专业目标开始分析,确立《信息存储与检索》课程在该专业中的地位,然后阐述《信息存储与检索》课程的跨多学科的特点,分析当前大学的主要教材,选择最适合信息管理与信息系统专业的教材,针对所选教材阐述了该课程的教学内容及教学方式、方法。  相似文献   

16.
17.
Text retrieval systems require an index to allow efficient retrieval of documents at the cost of some storage overhead. This paper proposes a novel full-text indexing model for Chinese text retrieval based on the concept of adjacency matrix of directed graph. Using this indexing model, on one hand, retrieval systems need to keep only the indexing data, instead of the indexing data and the original text data as the traditional retrieval systems always do. On the other hand, occurrences of index term are identified by labels of the so-called s-strings where the index term appears, rather than by its positions as in traditional indexing models. Consequently, system space cost as a whole can be reduced drastically while retrieval efficiency is maintained satisfactory. Experiments over several real-world Chinese text collections are carried out to demonstrate the effectiveness and efficiency of this model. In addition to Chinese, The proposed indexing model is also effective and efficient for text retrieval of other Oriental languages, such as Japanese and Korean. It is especially useful for digital library application areas where storage resource is very limited (e.g., e-books and CD-based text retrieval systems).  相似文献   

18.
In this paper we present a robust information integration approach to identifying images of persons in large collections such as the Web. The underlying system relies on combining content analysis, which involves face detection and recognition, with context analysis, which involves extraction of text or HTML features. Two aspects are explored to test the robustness of this approach: sensitivity of the retrieval performance to the context analysis parameters and automatic construction of a facial image database via automatic pseudofeedback. For the sensitivity testing, we reevaluate system performance while varying context analysis parameters. This is compared with a learning approach where association rules among textual feature values and image relevance are learned via the CN2 algorithm. A face database is constructed by clustering after an initial retrieval relying on face detection and context analysis alone. Experimental results indicate that the approach is robust for identifying and indexing person images.Y. Alp Aslandogan: Correspondence to:  相似文献   

19.
One of the major challenges in Peer-to-Peer (P2P) file sharing systems is to support content-based search. Although there have been some proposals to address this challenge, they share the same weakness of using either servers or super-peers to keep global knowledge, which is required to identify importance of terms to avoid popular terms in query processing. As a result, they are not scalable and are prone to the bottleneck problem, which is caused by the high visiting load at the global knowledge maintainers. To that end, in this paper, we propose a novel adaptive indexing approach for content-based search in P2P systems, which can identify importance of terms without keeping global knowledge. Our method is based on an adaptive indexing structure that combines a Chord ring and a balanced tree. The tree is used to aggregate and classify terms adaptively, while the Chord ring is used to index terms of nodes in the tree. Specifically, at each node of the tree, the system classifies terms as either important or unimportant. Important terms, which can distinguish the node from its neighbor nodes, are indexed in the Chord ring. On the other hand, unimportant terms, which are either popular or rare terms, are aggregated to higher level nodes. Such classification enables the system to process queries on the fly without the need for global knowledge. Besides, compared to the methods that index terms separately, term aggregation reduces the indexing cost significantly. Taking advantage of the tree structure, we also develop an efficient search algorithm to tackle the bottleneck problem near the root. Finally, our extensive experiments on both benchmark and Wikipedia datasets validated the effectiveness and efficiency of the proposed method.  相似文献   

20.
Content-based image indexing and searching using Daubechies' wavelets   总被引:8,自引:0,他引:8  
This paper describes WBIIS (Wavelet-Based Image Indexing and Searching), a new image indexing and retrieval algorithm with partial sketch image searching capability for large image databases. The algorithm characterizes the color variations over the spatial extent of the image in a manner that provides semantically meaningful image comparisons. The indexing algorithm applies a Daubechies' wavelet transform for each of the three opponent color components. The wavelet coefficients in the lowest few frequency bands, and their variances, are stored as feature vectors. To speed up retrieval, a two-step procedure is used that first does a crude selection based on the variances, and then refines the search by performing a feature vector match between the selected images and the query. For better accuracy in searching, two-level multiresolution matching may also be used. Masks are used for partial-sketch queries. This technique performs much better in capturing coherence of image, object granularity, local color/texture, and bias avoidance than traditional color layout algorithms. WBIIS is much faster and more accurate than traditional algorithms. When tested on a database of more than 10 000 general-purpose images, the best 100 matches were found in 3.3 seconds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号