首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Kwong  Linus W.  Ng  Yiu-Kai 《World Wide Web》2003,6(3):281-303
To retrieve Web documents of interest, most of the Web users rely on Web search engines. All existing search engines provide query facility for users to search for the desired documents using search-engine keywords. However, when a search engine retrieves a long list of Web documents, the user might need to browse through each retrieved document in order to determine which document is of interest. We observe that there are two kinds of problems involved in the retrieval of Web documents: (1) an inappropriate selection of keywords specified by the user; and (2) poor precision in the retrieved Web documents. In solving these problems, we propose an automatic binary-categorization method that is applicable for recognizing multiple-record Web documents of interest, which appear often in advertisement Web pages. Our categorization method uses application ontologies and is based on two information retrieval models, the Vector Space Model (VSM) and the Clustering Model (CM). We analyze and cull Web documents to just those applicable to a particular application ontology. The culling analysis (i) uses CM to find a virtual centroid for the records in a Web document, (ii) computes a vector in a multi-dimensional space for this centroid, and (iii) compares the vector with the predefined ontology vector of the same multi-dimensional space using VSM, which we consider the magnitudes of the vectors, as well as the angle between them. Our experimental results show that we have achieved an average of 90% recall and 97% precision in recognizing Web documents belonged to the same category (i.e., domain of interest). Thus our categorization discards very few documents it should have kept and keeps very few it should have discarded.  相似文献   

2.
《Knowledge》2005,18(2-3):117-124
In this paper we propose an approach for refining a document ranking by learning filtering rulesets through relevance feedback. This approach includes two important procedures. One is a filtering method, which can be incorporated into any kinds of information retrieval systems. The other is a learning algorithm to make a set of filtering rules, each of which specifies a condition to identify relevant documents using combinations of characteristic words. Our approach is useful not only to overcome the limitation of the vector space model, but also to utilize tags of semi-structured documents like Web pages. Through experiments we show our approach improves the performance of relevance feedback in two types of IR systems adopting the vector space model and a Web search engine, respectively.  相似文献   

3.
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.  相似文献   

4.
In this paper, we present a new method for query reweighting to deal with document retrieval. The proposed method uses genetic algorithms to reweight a user's query vector, based on the user's relevance feedback, to improve the performance of document retrieval systems. It encodes a user's query vector into chromosomes and searches for the optimal weights of query terms for retrieving documents by genetic algorithms. After the best chromosome is found, the proposed method decodes the chromosome into the user's query vector for dealing with document retrieval. The proposed query reweighting method can find the best weights of query terms in the user's query vector, based on the user's relevance feedback. It can increase the precision rate and the recall rate of the document retrieval system for dealing with document retrieval.  相似文献   

5.
基于属性论的文本相似度计算   总被引:38,自引:0,他引:38  
以属性论为理论依据,分析了文本属性与属性重心剖分模型的关系,建立了文本属性重心剖分模型,并在属性坐标系中表示文本向属与查询式向量,确定向量之间的匹配基准,计算匹配距离,从而建立一个文本与查询式之间的匹配相似度计算公式,该模型有效地描述文本属性和查询式属性之间的关系。  相似文献   

6.
卫琳 《微机发展》2007,17(9):65-67
搜索引擎返回的信息太多且不能根据用户的兴趣提供检索结果,使得用户使用搜索引擎难以用简便的方式找到感兴趣的文档。个性化推荐是一种旨在减轻用户在信息检索方面负担的有效方法。文中把内容过滤技术和文档聚类技术相结合,实现了一个基于搜索结果的个性化推荐系统,以聚类的方法自动组织搜索结果,主动推荐用户感兴趣的文档。通过建立用户概率兴趣模型,对搜索结果STC聚类的基础上进行内容过滤。实验表明,概率模型比矢量空间模型更好地表达了用户的兴趣和变化。  相似文献   

7.
The success and intensive use of social networks makes strategies for efficient document location a hot topic of research. In this paper, we propose a common vector space to describe documents and users to create a social network based on affinities, and explore epidemic routing to recommend documents according to the user’s interests. Furthermore, we propose the creation of a SoftDHT structure to improve the recommendation results. Using these mechanisms, an efficient document recommender system with a fast organization of clusters of users based on their affinity can be provided, preventing the creation of unlinked communities. We show through simulations that the proposed system has a short convergence time and presents a high recall ratio.  相似文献   

8.
A Knowledge-Based Approach to Effective Document Retrieval   总被引:3,自引:0,他引:3  
This paper presents a knowledge-based approach to effective document retrieval. This approach is based on a dual document model that consists of a document type hierarchy and a folder organization. A predicate-based document query language is proposed to enable users to precisely and accurately specify the search criteria and their knowledge about the documents to be retrieved. A guided search tool is developed as an intelligent natural language oriented user interface to assist users formulating queries. Supported by an intelligent question generator, an inference engine, a question base, and a predicate-based query composer, the guided search collects the most important information known to the user to retrieve the documents that satisfy users' particular interests. A knowledge-based query processing and search engine is devised as the core component in this approach. Algorithms are developed for the search engine to effectively and efficiently retrieve the documents that match the query.  相似文献   

9.
This paper describes our research into a query-by-semantics approach to searching the World Wide Web. This research extends existing work, which had focused on a query-by-structure approach for the Web. We present a system that allows users to request documents containing not only specific content information, but also to specify that documents be of a certain type. The system captures and utilizes structure information as well as content during a distributed query of the Web. The system also allows the user the option of creating their own document types by providing the system with example documents. In addition, although the system still gives users the option of dynamically querying the web, the incorporation of a document database has improved the response time involved in the search process. Based on extensive testing and validation presented herein, it is clear that a system that incorporates structure and document semantic information into the query process can significantly improve search results over the standard keyword search.  相似文献   

10.
11.
李勇  相中启 《计算机应用》2019,39(1):245-250
针对云计算环境下已有的密文检索方案不支持检索关键词语义扩展、精确度不够、检索结果不支持排序的问题,提出一种支持检索关键词语义扩展的可排序密文检索方案。首先,使用词频逆文档频率(TF-IDF)方法计算文档中关键词与文档之间的相关度评分,并对文档不同域中的关键词设置不同的位置权重,使用域加权评分方法计算位置权重评分,将相关度评分与位置权重评分的乘积设置为关键词在文档索引向量上相应位置的取值;其次,根据WordNet语义网对授权用户输入的检索关键词进行语义扩展,得到语义扩展检索关键词集合,使用编辑距离公式计算语义扩展检索关键词集合中关键词之间的相似度,并将相似度值设置为检索关键词在文档检索向量上相应位置的取值;最后,加密产生安全索引和文档检索陷门,在向量空间模型(VSM)下进行内积运算,以内积运算的结果为密文检索文档的排序依据。理论分析和实验仿真表明,所提方案在已知密文模型和已知背景知识模型下是安全的,且具备对检索结果的排序能力;与多关键字密文检索结果排序(MRSE)方案相比,所提方案支持关键词语义扩展,查询准确率比MRSE方案更加准确可靠,而检索时间则与MRSE方案相差不大。  相似文献   

12.
 Relevance feedback techniques have demonstrated to be a powerful means to improve the results obtained when a user submits a query to an information retrieval system as the world wide web search engines. These kinds of techniques modify the user original query taking into account the relevance judgements provided by him on the retrieved documents, making it more similar to those he judged as relevant. This way, the new generated query permits to get new relevant documents thus improving the retrieval process by increasing recall. However, although powerful relevance feedback techniques have been developed for the vector space information retrieval model and some of them have been translated to the classical Boolean model, there is a lack of these tools in more advanced and powerful information retrieval models such as the fuzzy one. In this contribution we introduce a relevance feedback process for extended Boolean (fuzzy) information retrieval systems based on a hybrid evolutionary algorithm combining simulated annealing and genetic programming components. The performance of the proposed technique will be compared with the only previous existing approach to perform this task, Kraft et al.'s method, showing how our proposal outperforms the latter in terms of accuracy and sometimes also in time consumption. Moreover, it will be showed how the adaptation of the retrieval threshold by the relevance feedback mechanism allows the system effectiveness to be increased.  相似文献   

13.
Current document retrieval methods use a vector space similarity measure to give scores of relevance to documents when related to a specific query. The central problem with these methods is that they neglect any spatial information within the documents in question. We present a new method, called Fourier Domain Scoring (FDS), which takes advantage of this spatial information, via the Fourier transform, to give a more accurate ordering of relevance to a document set. We show that FDS gives an improvement in precision over the vector space similarity measures for the common case of Web like queries, and it gives similar results to the vector space measures for longer queries.  相似文献   

14.
传统的云计算下的可搜索加密算法没有对查询关键词进行语义扩展,导致了用户查询意图与返回结果存在语义偏差,并且对检索结果的相关度排序不够合理,无法满足用户对智能搜索的需求。对此,提出了一种支持语义的可搜索加密方法。该方法利用本体知识库实现了用户查询的语义拓展,并通过语义相似度来控制扩展词的个数,防止因拓展词过多影响检索的精确度。同时,该方法利用文档向量、查询向量分块技术构造出对应的标记向量,以过滤无关文档,并在查询-文档的相似度得分中引入了语义相似度、关键词位置加权评分及关键词-文档相关度等影响因子,实现了检索结果的有效排序。实验结果表明,该方法在提高检索效率的基础上显著改善了检索结果的排序效果,提高了用户满意度。  相似文献   

15.
全文索引技术在办公自动化系统中的应用研究*   总被引:1,自引:0,他引:1  
基于内容的全文检索技术广泛用于全文数据库中,为解决办公自动化系统中大量文档的快速检索问题,将SQL Server全文索引技术运用于办公自动化系统开发中.首先介绍SQL Server全文检索流程,然后将其运用于办公自动化系统文档管理模块公文搜索的实现中,全文检索用户界面层采用ASP.NET开发,应用业务层采用C#语言.  相似文献   

16.
信息检索中的相关反馈技术综述*   总被引:4,自引:1,他引:3  
论述了信息检索中的向量空间模型、概率模型以及语言模型中所采用的相关反馈技术。其中主要介绍检索词的权重调整、查询扩展、文档相关反馈,以及语言模型中的查询语言模型和文档语言模型的调整。针对最近反馈方面的最新成果——基于term的反馈技术进行了探讨,指出了相关反馈在今后研究的方向,即提供个性化的如分层反馈和利用日志进行反馈,并讨论了相关反馈技术对检索性能的影响。  相似文献   

17.
Although a technique of relevance feedback is common in the field of information retrieval (IR), the feedback is usually done by means of query refinement; restructuring of the information space has not been attempted yet. The restructuring not only allows useful applications such as clustering but also is indispensable for IR if a modeling function employs correlation of terms. In this paper we present a new method of relevance feedback through the restructuring of the information space. Our method adapts document space to the user’s mental model by manipulating a dictionary vector. Therefore, user’s viewpoint is preserved after a series of retrieval processes and reused for retrieval performed later. We show its effectiveness through the retrieval experiments on FAQ (Frequntly Asked Questions) documents. Tomoko Murakami: She obtained her bachelor’s degree in Engineering from Aoyama Gakuin University in 1996, and her master’s degree in Media and Governance from Keio University in 1998. In 1998 she joined Human Interface Labolatory, Corporate Research & Development Center, Toshiba Corporation, Kawasaki, Japan. Her research interests are in Machine Learning, especially Inductive Logic Programming. She is a member of JSAI. Ryohei Orihara, Ph.D.: He is a research scientist at Human Interface Laboratory, Corporate Research & Development Center, Toshiba Corporation. He obtained his bachelor’s degree and master’s degree in Engineering and Ph.D. from University of Tsukuba in 1986, 1988 and 1999 respectively. His current research interests include machine learning, creativity support system, analogical reasoning and metaphor understanding. He was a visiting researcher at University of Toronto from 1993 to 1995. He is a member of IPSJ, JSAI and JSSST. He is presently on the editorial committee of the Journal of JSAI.  相似文献   

18.
Today, digitally stored information isn't only ubiquitous, it's also increasing in volume at an exponential rate. And not only is the volume increasing, but so is the variety, as well as the ways of combining information from different sources to derive insights. Not surprisingly, our most pressing technological and business problem is finding what we need in this sea of information. The dominant paradigm for addressing this problem is information retrieval (Modem Information Retrieval, Ricardo Baeza-Yates and Berthier Ribeiro-Neto, ACM Press, 1999). In this paradigm, the user enters a query (typically a few words typed into a search box), and the system retrieves documents matching the query, ranking the matches based on an estimate of their relevancy to the query. If the system finds many matches, the user sees only the highest-ranked matches. The popularity of Web search systems such as Google shows that the information retrieval paradigm can be effective. An information access framework empowers users by explicitly focusing on the interaction between users and the system. The key problem for information access systems isn't guessing which matching document is most relevant, but establishing a dialogue in which users progressively communicate their information goals while the system provides immediate, incremental feedback that guides users in the pursuit of those goals  相似文献   

19.
In this paper, we propose CYBER, a CommunitY Based sEaRch engine, for information retrieval utilizing community feedback information in a DHT network. In CYBER, each user is associated with a set of user profiles that capture his/her interests. Likewise, a document is associated with a set of profiles—one for each indexed term. A document profile is updated by users who query on the term and consider the document as a relevant answer. Thus, the profile acts as a consolidation of users feedback from the same community, and reflects their interests. In this way, as one user finds a document to be relevant, another user in the same community issuing a similar query will benefit from the feedback provided by the earlier user. Hence, the search quality in terms of both precision and recall is improved. Moreover, we further improve the effectiveness of CYBER by introducing an index tuning technique. By choosing the indexing terms more carefully, community-based relevance feedback is utilized in both building/refining indices and re-evaluating queries. We first propose a naive scheme, CYBER+, which involves an index tuning technique based on past queries only, and then re-evaluates queries in a separate step. We then propose a more complex scheme, CYBER+ +, which refines its index based on both past queries and relevance feedback. As the index is built with more selective and accurate terms, the search performance is further improved. We conduct a comprehensive experimental study and the results show the effectiveness of our schemes.  相似文献   

20.
We present a virtual reality application called VR-VIBE which is intended to support the co-operative browsing and filtering of large document stores. VR-VIBE extends a visualisation approach proposed in a previous two dimensional system called VIBE into three dimensions, allowing more information to be visualised at one time and supporting more powerful styles of interaction, The essence of VR-VIBE is that multiple users can explore the results of applying several simultaneous queries to a corpus of documents. By arranging the queries into a spatial framework, the system shows the relative attraction of each document to each query by its spatial position and also shows the absolute relevance of each document to all of the queries. Users may then navigate the space, select individual documents, control the display according to a dynamic relevance threshold and dynamically drag the queries to new positions to see the effect on the document space. Co-operative browsing is supported by directly embodying users and providing them with the ability to interact over live audio connections and to attach brief textual annotations to individual documents. Finally, we conclude with some initial observations gleaned from our experience of constructing VR-VIBE and using it in the laboratory setting.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号