首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
High findability of documents within a certain cut-off rank is considered an important factor in recall-oriented application domains such as patent or legal document retrieval. Findability is hindered by two aspects, namely the inherent bias favoring some types of documents over others introduced by the retrieval model, and the failure to correctly capture and interpret the context of conventionally rather short queries. In this paper, we analyze the bias impact of different retrieval models and query expansion strategies. We furthermore propose a novel query expansion strategy based on document clustering to identify dominant relevant documents. This helps to overcome limitations of conventional query expansion strategies that suffer strongly from the noise introduced by imperfect initial query results for pseudo-relevance feedback documents selection. Experiments with different collections of patent documents suggest that clustering based document selection for pseudo-relevance feedback is an effective approach for increasing the findability of individual documents and decreasing the bias of a retrieval system.  相似文献   

2.
Most of the common techniques in text retrieval are based on the statistical analysis terms (words or phrases). Statistical analysis of term frequency captures the importance of the term within a document only. Thus, to achieve a more accurate analysis, the underlying model should indicate terms that capture the semantics of text. In this case, the model can capture terms that represent the concepts of the sentence, which leads to discovering the topic of the document. In this paper, a new concept-based retrieval model is introduced. The proposed concept-based retrieval model consists of conceptual ontological graph (COG) representation and concept-based weighting scheme. The COG representation captures the semantic structure of each term within a sentence. Then, all the terms are placed in the COG representation according to their contribution to the meaning of the sentence. The concept-based weighting analyzes terms at the sentence and document levels. This is different from the classical approach of analyzing terms at the document level only. The weighted terms are then ranked, and the top concepts are used to build a concept-based document index for text retrieval. The concept-based retrieval model can effectively discriminate between unimportant terms with respect to sentence semantics and terms which represent the concepts that capture the sentence meaning. Experiments using the proposed concept-based retrieval model on different data sets in text retrieval are conducted. The experiments provide comparison between traditional approaches and the concept-based retrieval model obtained by the combined approach of the conceptual ontological graph and the concept-based weighting scheme. The evaluation of results is performed using three quality measures, the preference measure (bpref), precision at 10 documents retrieved (P(10)) and the mean uninterpolated average precision (MAP). All of these quality measures are improved when the newly developed concept-based retrieval model is used, confirming that such model enhances the quality of text retrieval.  相似文献   

3.
Cross-lingual text retrieval (CLTR) is a technique for locating relevant documents in different languages. The authors have developed fuzzy conceptual indexing (FCI) to extend CLTR to include documents that share concepts but don't contain exact translations of query terms. In FCI, documents and queries are represented as a function of language-independent concepts, thus enabling direct mapping between them across multiple languages. Experimental results suggest that concept-based CLTR outperforms translation-based CLTR in identifying conceptually relevant documents.  相似文献   

4.
5.
Before undertaking new biomedical research, identifying concepts that have already been patented is essential. A traditional keyword-based search on patent databases may not be sufficient to retrieve all the relevant information, especially for the biomedical domain. This paper presents BioPatentMiner, a system that facilitates information retrieval and knowledge discovery from biomedical patents. The system first identifies biological terms and relations from the patents and then integrates the information from the patents with knowledge from biomedical ontologies to create a semantic Web. Besides keyword search and queries linking the properties specified by one or more RDF triples, the system can discover semantic associations between the Web resources. The system also determines the importance of the resources to rank the results of a search and prevent information overload while determining the semantic associations.  相似文献   

6.
一种基于局部共现的查询扩展方法   总被引:16,自引:2,他引:16  
针对信息检索中文档与查询之间的词不匹配问题,本文提出了一种基于局部共现的查询扩展方法LOCOOC。LOCOOC利用词项与所有查询词在局部文档集合中的共现程度来评估扩展词的质量,并整合了词项在语料集中的全局统计信息,使得选取的扩展词与初始查询所表征的主题或概念具有更好的相关性。实验结果表明:与未进行查询扩展时相比,采用LOCOOC方法进行扩展后,平均准确率提高40%以上;与传统的局部反馈方法以及局部上下文分析方法(LCA,Local Context Analysis)相比,LOCOOC不仅具有更优的检索性能,而且有着更好的鲁棒性。  相似文献   

7.
This paper provides a formal specification for concept-based image retrieval using triples. To effectively manage a vast amount of images, we may need an image retrieval system capable of indexing and searching images based on the characteristics of their content. However, such a content-based image retrieval technique alone may not satisfy user queries if retrieved images turn out to be relevant only when they are conceptually related with the queries. In this paper, we develop an image retrieval mechanism to extract semantics of images based on triples. The semantics can be captured by deriving concepts from its constituent objects and spatial relationships between them. The concepts are basically composite objects formed from the aggregation of the constituents. In our mechanism, all the spatial relationships between objects including the concepts are uniformly represented by triples, which are used for indexing images as well as capturing their semantics. We also develop a query evaluation for supporting the concept-based image retrieval. ©1999 John Wiley & Sons, Inc.  相似文献   

8.
Technology in the field of digital media generates huge amounts of nontextual information, audio, video, and images, along with more familiar textual information. The potential for exchange and retrieval of information is vast and daunting. The key problem in achieving efficient and user-friendly retrieval is the development of a search mechanism to guarantee delivery of minimal irrelevant information (high precision) while insuring relevant information is not overlooked (high recall). The traditional solution employs keyword-based search. The only documents retrieved are those containing user-specified keywords. But many documents convey desired semantic information without containing these keywords. This limitation is frequently addressed through query expansion mechanisms based on the statistical co-occurrence of terms. Recall is increased, but at the expense of deteriorating precision. One can overcome this problem by indexing documents according to context and meaning rather than keywords, although this requires a method of converting words to meanings and the creation of a meaning-based index structure. We have solved the problem of an index structure through the design and implementation of a concept-based model using domain-dependent ontologies. An ontology is a collection of concepts and their interrelationships that provide an abstract view of an application domain. With regard to converting words to meaning, the key issue is to identify appropriate concepts that both describe and identify documents as well as language employed in user requests. This paper describes an automatic mechanism for selecting these concepts. An important novelty is a scalable disambiguation algorithm that prunes irrelevant concepts and allows relevant ones to associate with documents and participate in query generation. We also propose an automatic query expansion mechanism that deals with user requests expressed in natural language. This mechanism generates database queries with appropriate and relevant expansion through knowledge encoded in ontology form. Focusing on audio data, we have constructed a demonstration prototype. We have experimentally and analytically shown that our model, compared to keyword search, achieves a significantly higher degree of precision and recall. The techniques employed can be applied to the problem of information selection in all media types.Received: 7 October 2002, Accepted: 20 May 2003, Published online: 30 September 2003Edited by: E. LochovskyThis research has been funded [or funded in part] by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC-9529152.  相似文献   

9.
Semantic search has been one of the motivations of the semantic Web since it was envisioned. We propose a model for the exploitation of ontology-based knowledge bases to improve search over large document repositories. In our view of information retrieval on the semantic Web, a search engine returns documents rather than, or in addition to, exact values in response to user queries. For this purpose, our approach includes an ontology-based scheme for the semiautomatic annotation of documents and a retrieval system. The retrieval model is based on an adaptation of the classic vector-space model, including an annotation weighting algorithm, and a ranking algorithm. Semantic search is combined with conventional keyword-based retrieval to achieve tolerance to knowledge base incompleteness. Experiments are shown where our approach is tested on corpora of significant scale, showing clear improvements with respect to keyword-based search  相似文献   

10.
11.
An interactive agent-based system for concept-based web search   总被引:1,自引:0,他引:1  
Search engines are useful tools in looking for information from the Internet. However, due to the difficulties of specifying appropriate queries and the problems of keyword-based similarity ranking presently encountered by search engines, general users are still not satisfied with the results retrieved. To remedy the above difficulties and problems, in this paper we present a multi-agent framework in which an interactive approach is proposed to iteratively collect a user's feedback from the pages he has identified. By analyzing the pages gathered, the system can then gradually formulate queries to efficiently describe the content a user is looking for. In our framework, the evolution strategies are employed to evolve critical feature words for concept modeling in query formulation. The experimental results show that the framework developed is efficient and useful to enhance the quality of web search, and the concept-based semantic search can thus be achieved.  相似文献   

12.
Exploring statistical correlations for image retrieval   总被引:1,自引:0,他引:1  
Bridging the cognitive gap in image retrieval has been an active research direction in recent years, of which a key challenge is to get enough training data to learn the mapping functions from low-level feature spaces to high-level semantics. In this paper, image regions are classified into two types: key regions representing the main semantic contents and environmental regions representing the contexts. We attempt to leverage the correlations between types of regions to improve the performance of image retrieval. A Context Expansion approach is explored to take advantages of such correlations by expanding the key regions of the queries using highly correlated environmental regions according to an image thesaurus. The thesaurus serves as both a mapping function between image low-level features and concepts and a store of the statistical correlations between different concepts. It is constructed through a data-driven approach which uses Web data (images, their surrounding textual annotations) as training data source to learn the region concepts and to explore the statistical correlations. Experimental results on a database of 10,000 general-purpose images show the effectiveness of our proposed approach in both improving search precision (i.e. filter irrelevant images) and recall (i.e. retrieval relevant images whose context may be varied). Several major factors which have impact on the performance of our approach are also studied.  相似文献   

13.
We investigate the possibility of using Semantic Web data to improve hypertext Web search. In particular, we use relevance feedback to create a ‘virtuous cycle’ between data gathered from the Semantic Web of Linked Data and web-pages gathered from the hypertext Web. Previous approaches have generally considered the searching over the Semantic Web and hypertext Web to be entirely disparate, indexing, and searching over different domains. While relevance feedback has traditionally improved information retrieval performance, relevance feedback is normally used to improve rankings over a single data-set. Our novel approach is to use relevance feedback from hypertext Web results to improve Semantic Web search, and results from the Semantic Web to improve the retrieval of hypertext Web data. In both cases, an evaluation is performed based on certain kinds of informational queries (abstract concepts, people, and places) selected from a real-life query log and checked by human judges. We evaluate our work over a wide range of algorithms and options, and show it improves baseline performance on these queries for deployed systems as well, such as the Semantic Web Search engine FALCON-S and Yahoo! Web search. We further show that the use of Semantic Web inference seems to hurt performance, while the pseudo-relevance feedback increases performance in both cases, although not as much as actual relevance feedback. Lastly, our evaluation is the first rigorous ‘Cranfield’ evaluation of Semantic Web search.  相似文献   

14.
基于文档实例的中文信息检索   总被引:2,自引:0,他引:2  
传统的信息检索系统基于关键词建立索引并进行信息检索.这些系统存在查询返回文档集大、准确率低和普通用户不便于构造查询等不足.为此,该文提出基于文档实例的信息检索,即以已有文档作为样本,在文档库中检索与样本文档相似的所有文档.文中给出了基于文档实例的中文信息检索的解决方法和实现技术.初步实验结果表明该方法是行之有效的.  相似文献   

15.
Spoken content retrieval will be very important for retrieving and browsing multimedia content over the Internet, and spoken term detection (STD) is one of the key technologies for spoken content retrieval. In this paper, we show acoustic feature similarity between spoken segments used with pseudo-relevance feedback and graph-based re-ranking can improve the performance of STD. This is based on the concept that spoken segments similar in acoustic feature vector sequences to those with higher/lower relevance scores should have higher/lower scores, while graph-based re-ranking further uses a graph to consider the similarity structure among all the segments retrieved in the first pass. These approaches are formulated on both word and subword lattices, and a complete framework of using them in open vocabulary retrieval of spoken content is presented. Significant improvements for these approaches with both in-vocabulary and out-of-vocabulary queries were observed in preliminary experiments.  相似文献   

16.
In this paper, we perform a number of experiments with large scale queries to analyze the retrieval bias of standard retrieval models. These experiments analyze how far different retrieval models differ in terms of retrieval bias that they imposed on the collection. Along with the retrieval bias analysis, we also exploit a limitation of standard retrievability scoring function and propose a normalized retrievability scoring function. Results of retrieval bias experiments show us that when a collection contains highly skewed distribution, then the standard retrievability calculation function does not take into account the differences in vocabulary richness across documents of collection. In such case, documents having large vocabulary produce many more queries and such documents thus have theoretically large probability of retrievability via a much large number of queries. We thus propose a normalized retrievability scoring function that tries to mitigate this effect by normalizing the retrievability scores of documents relative to their total number of queries. This provides an unbiased representation of the retrieval bias that could occurred due to vocabulary differences between the documents of collection without automatically inflicting a penalty on the retrieval models that favor or disfavor long documents. Finally, in order to examine, which retrievability scoring function has better effectiveness than other for correctly producing the retrievability ranks of documents, we perform a comparison between the both functions on the basis of known-items search method. Experiments on known-items search show that normalized retrievability scoring function has better effectiveness than the standard retrievability scoring function.  相似文献   

17.
查询扩展作为查询优化的重要组成部分,对改善信息检索系统的性能起到了至关重要的作用.传统的伪相关反馈查询扩展方法虽然在一定程度上提高了检索性能,但选择的扩展词中会包含一部分与原查询不相关的词语,这对检索性能的提升产生了不利影响.提出了一种基于分类模型的查询扩展方法,该算法综合候选扩展词的统计信息和多种特征,采用朴素贝叶斯分类模型对初次得到的候选扩展词进行再次分类选择,进一步去除与查询词相关性小的扩展词.在TREC 2013数据集上的实验结果表明,提出的查询扩展方法能够有效提高用户查询的查准率和查全率.  相似文献   

18.
基于概念检索的中文搜索引擎的设计与实现   总被引:4,自引:0,他引:4  
构建语义库和扩展查询是影响概念检索效率的主要因素。提出一种自动构建语义库和相关性查询扩展的方法,方法利用关联规则挖掘技术,自动从文档中导出概念/词语之间相关性及层次关系,构建关联库,再通过关联库,对查询请求进行相关性扩展,以实现概念检索。实验结果显示,方法是有效的,能提高信息检索的查全率和查准率。  相似文献   

19.
查询扩展作为一门重要的信息检索技术,是以用户查询为基础,通过一定策略在原始查询中加入一些相关的扩展词,从而使得查询能够更加准确地描述用户信息需求。排序学习方法利用机器学习的知识构造排序模型对数据进行排序,是当前机器学习与信息检索交叉领域的研究热点。该文尝试利用伪相关反馈技术,在查询扩展中引入排序学习算法,从文档集合中提取与扩展词相关的特征,训练针对于扩展词的排序模型,并利用排序模型对新查询的扩展词集合进行重新排序,将排序后的扩展词根据排序得分赋予相应的权重,加入到原始查询中进行二次检索,从而提高信息检索的准确率。在TREC数据集合上的实验结果表明,引入排序学习算法有助于提高伪相关反馈的检索性能。  相似文献   

20.
This article provides a comprehensive and comparative overview of question answering technology. It presents the question answering task from an information retrieval perspective and emphasises the importance of retrieval models, i.e., representations of queries and information documents, and retrieval functions which are used for estimating the relevance between a query and an answer candidate. The survey suggests a general question answering architecture that steadily increases the complexity of the representation level of questions and information objects. On the one hand, natural language queries are reduced to keyword-based searches, on the other hand, knowledge bases are queried with structured or logical queries obtained from the natural language questions, and answers are obtained through reasoning. We discuss different levels of processing yielding bag-of-words-based and more complex representations integrating part-of-speech tags, classification of the expected answer type, semantic roles, discourse analysis, translation into a SQL-like language and logical representations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号