首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
当前,信息检索系统通常采用“检索+重排序”的多级流水线架构。基于稠密表示的检索模型已经被逐渐应用到第一阶段检索中,并展现出了相比传统的稀疏向量空间模型更好的性能。考虑到第一阶段检索所需的高效性,大多数情况下这些模型的基本架构都采用双编码器(bi-encoder)结构。对查询和文档进行独立的编码,分别得到一个稠密表示向量,然后基于获得的查询和文档表示使用简单的相似度函数计算查询-文档对的得分。然而,在编码文档的过程中查询是不可知的,而且文档相比查询而言通常包含更多的主题信息,因此这种简单的单表示模型可能会造成严重的文档信息丢失。为了解决这个问题,设计了一种新的语义检索方法 MDR(multi-representation dense retrieval),将文档编码成多个稠密向量表示。同时,该方法引入覆盖率(coverage)机制来保证多个向量之间的差异性,从而能够覆盖文档中不同主题的信息。为了评估模型性能,在MS MARCO数据集上进行了段落排序和文档排序任务,实验结果证明了MDR方法的有效性。  相似文献   

2.
A content-search information retrieval process based on conceptual graphs   总被引:1,自引:0,他引:1  
An intelligent information retrieval system is presented in this paper. In our approach, which complies with the logical view of information retrieval, queries, document contents and other knowledge are represented by expressions in a knowledge representation language based on the conceptual graphs introduced by Sowa. In order to take the intrinsic vagueness of information retrieval into account, i.e. to search documents imprecisely and incompletely represented in order to answer a vague query, different kinds of probabilistic logic are often used. The search process described in this paper uses graph transformations instead of probabilistic notions. This paper is focused on the content-based retrieval process, and the cognitive facet of information retrieval is not directly addressed. However, our approach, involving the use of a knowledge representation language for representing data and a search process based on a combinatorial implementation of van Rijsbergen’s logical uncertainty principle, also allows the representation of retrieval situations. Hence, we believe that it could be implemented at the core of an operational information retrieval system. Two applications, one dealing with academic libraries and the other concerning audiovisual documents, are briefly presented.  相似文献   

3.
This article provides a comprehensive and comparative overview of question answering technology. It presents the question answering task from an information retrieval perspective and emphasises the importance of retrieval models, i.e., representations of queries and information documents, and retrieval functions which are used for estimating the relevance between a query and an answer candidate. The survey suggests a general question answering architecture that steadily increases the complexity of the representation level of questions and information objects. On the one hand, natural language queries are reduced to keyword-based searches, on the other hand, knowledge bases are queried with structured or logical queries obtained from the natural language questions, and answers are obtained through reasoning. We discuss different levels of processing yielding bag-of-words-based and more complex representations integrating part-of-speech tags, classification of the expected answer type, semantic roles, discourse analysis, translation into a SQL-like language and logical representations.  相似文献   

4.
5.
杨为民  李龙澍 《计算机工程》2011,37(15):134-136
传统信息检索模型的信息获取准确率较低。针对该问题,提出一种基于场论的高精度信息检索模型,在考虑标引词之间相关性的前提下,以规范化标引词的数量作为文档信息量,构建一个文档信息场。该信息场以文档间相互作用的场力大小度量文档的相关性,由此得到更精确的检索结果。实验结果验证了该模型相对于同类模型的优越性。  相似文献   

6.
WEBSOM is a recently developed neural method for exploring full-text document collections, for information retrieval, and for information filtering. In WEBSOM the full-text documents are encoded as vectors in a document space somewhat like in earlier information retrieval methods, but in WEBSOM the document space is formed in an unsupervised manner using the Self-Organizing Map algorithm. In this article the document representations the WEBSOM creates are shown to be computationally efficient approximations of the results of a certain probabilistic model. The probabilistic model incorporates information about the similarity of use of different words to take into account their semantic relations.  相似文献   

7.
张刚  郭岩  张凯 《计算机工程》2007,33(2):158-159
集合选择是分布式信息检索中的重要问题,将集合选择问题转化为文档检索问题,尝试了多种文档检索方法来解决集合选择问题,并将各种方法的文档检索结果与集合选择结果进行了对比,通过与经典的集合选择算法CORI相比较,实验发现语言模型的集合选择方法能够取得令人满意的结果。  相似文献   

8.
With the number of documents describing real-world events and event-oriented information needs rapidly growing on a daily basis, the need for efficient retrieval and concise presentation of event-related information is becoming apparent. Nonetheless, the majority of information retrieval and text summarization methods rely on shallow document representations that do not account for the semantics of events. In this article, we present event graphs, a novel event-based document representation model that filters and structures the information about events described in text. To construct the event graphs, we combine machine learning and rule-based models to extract sentence-level event mentions and determine the temporal relations between them. Building on event graphs, we present novel models for information retrieval and multi-document summarization. The information retrieval model measures the similarity between queries and documents by computing graph kernels over event graphs. The extractive multi-document summarization model selects sentences based on the relevance of the individual event mentions and the temporal structure of events. Experimental evaluation shows that our retrieval model significantly outperforms well-established retrieval models on event-oriented test collections, while the summarization model outperforms competitive models from shared multi-document summarization tasks.  相似文献   

9.
针对现有信息检索系统难以按查询需求处理检索文档的问题,提出了一种基于相关反馈的信息检索模型,分析了查询词分解,推导了相关反馈机制和正规化过程,并进一步阐述了文档提取方法。提出的模型通过相关反馈和查询词扩展,克服了传统方法无法计算文档与查询词之间的相似度问题,并能有效地处理检索文档。仿真结果证明了该模型的有效性和可行性。  相似文献   

10.
User profiles play an important role in information retrieval system. In this paper, we propose a novel method for the acquisition of ontology-based user profiles. In the method, the ontology-based user profiles can maintain the representations of personal interest. In addition, user ontologies can be automatically constructed. The method can make user profiles strong expressive and less manually interfered.  相似文献   

11.
Legal text retrieval traditionally relies upon external knowledge sources such as thesauri and classification schemes, and an accurate indexing of the documents is often manually done. As a result not all legal documents can be effectively retrieved. However a number of current artificial intelligence techniques are promising for legal text retrieval. They sustain the acquisition of knowledge and the knowledge-rich processing of the content of document texts and information need, and of their matching. Currently, techniques for learning information needs, learning concept attributes of texts, information extraction, text classification and clustering, and text summarization need to be studied in legal text retrieval because of their potential for improving retrieval and decreasing the cost of manual indexing. The resulting query and text representations are semantically much richer than a set of key terms. Their use allows for more refined retrieval models in which some reasoning can be applied. This paper gives an overview of the state of the art of these innovativetechniques and their potential for legal text retrieval.  相似文献   

12.
一种通过内容和结构查询文档数据库的方法   总被引:4,自引:0,他引:4       下载免费PDF全文
文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与传统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.  相似文献   

13.
In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications (e.g., passage-based retrieval and high-quality information extraction). However, long documents, containing hundreds or even thousands of pages and a variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper position on the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. As for evaluations, we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong. Since such mistakes may worsen the downstream applications seriously, a new measure is developed for a more careful evaluation. The empirical experiments based on thousands of long documents from Chinese financial market, English financial market and English scientific publication show that the HELD model with the "root-to-leaf" traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6, 0.729,1 and 0.957,8 in the Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.  相似文献   

14.
面向本体的语义相似度计算及在检索中的应用   总被引:1,自引:0,他引:1       下载免费PDF全文
检索是获取信息的重要方式。传统检索只停留在关键字异同的逻辑层面,忽略了语义层面的信息。以本体的知识组织体系为基础,以检索应用为目标,提出面向本体的文档和查询的语义向量表示方法,进而建立面向本体的相似度计算方法,为语义检索创造条件,检索结果关注语义层面的匹配。并在理论的指导下,进行实验和分析。  相似文献   

15.
This paper concentrates on the problem of designing and developing a spoken query retrieval (SQR) system to access large document databases via voice. The main challenge is to identify and address issues related to the adaptation and scalability of integrating automatic speech recognition (ASR) and information retrieval (IR). In this paper, a Context Aware Language Model (CALM) framework allowing information retrieval to large document databases via voice is presented and findings from a research study using the framework will be discussed as well.  相似文献   

16.
This paper considers the use of text signatures, fixed-length bit string representations of document content, in an experimental information retrieval system: such signatures may be generated from the list of keywords characterising a document or a query. A file of documents may be searched in a bit-serial parallel computer, such as the ICL Distributed Array Processor, using a two-level retrieval strategy in which a comparison of a query signature with the file of document signatures provides a simple and efficient means of identifying those few documents that need to undergo a computationally demanding, character matching search. Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach.  相似文献   

17.
Document representation and its application to page decomposition   总被引:6,自引:0,他引:6  
Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550×3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise  相似文献   

18.
查询扩展是提高检索效果的有效方法,传统的查询扩展方法大都以单个查询词的相关性来扩展查询词,没有充分考虑词项之间、文档之间以及查询之间的相关性,使得扩展效果不佳。针对此问题,该文首先通过分别构造词项子空间和文档子空间的Markov网络,用于提取出最大词团和最大文档团,然后根据词团与文档团的映射关系将词团分为文档依赖和非文档依赖词团,并构建基于文档团依赖的Markov网络检索模型做初次检索,从返回的检索结果集合中构造出查询子空间的Markov网络,用于提取出最大查询团,最后,采用迭代的方法计算文档与查询的相关概率,并构建出最终的基于迭代方法的多层Markov网络信息检索模型。实验结果表明 该文的模型能较好地提高检索效果。  相似文献   

19.
张刚  谭建龙 《软件学报》2008,19(1):136-143
分布式信息检索的文档集合划分方案的评价是一个困难的问题,目前还没有良好的评价标准.从文档集合划分问题本身出发,给出了两个划分模型来刻画文档集合划分问题,从而使这两个模型可以作为文档集合划分的有效评价指标.在此基础上,提出了一种类Huffman编码的模型快速求解算法,可以求出在给定查询测试集情况下的最优文档划分方案,该方案可以作为其他文档划分方案的参考.实验表明,两个文档划分模型可以成为有效的文档集合划分评价标准.  相似文献   

20.
一种基于锚文本的并行检索策略   总被引:1,自引:0,他引:1       下载免费PDF全文
高珊  何婷婷  胡文敏 《计算机工程》2008,34(19):30-31,3
进行Web信息检索时,页面中的锚文本与正文存在较大相关性,多数检索系统忽视了锚文本对页面正文的贡献。该文提出一种提高检索精度的方法,为文档集建立一个基于页面正文的索引和一个基于锚文本的索引,对其采取并行检索策略。实验结果表明,该方法可以有效处理特定结构的网页集。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号