首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Much information is nowadays stored electronically in document bases. Users retrieve information from these document bases by browsing and querying. While a large number of tools are available nowadays, not much work has been done on tools that support queries involving all the characteristics of documents as well as the use of domain knowledge during the search for information. In this paper we propose a query language that allows for querying documents using content information, information about the logical structure of the documents as well as information about properties of the documents. Domain knowledge is taken into account during the search as well. We also present an architecture for a system supporting such a language and we describe a prototype implementation together with test results.  相似文献   

2.
A Knowledge-Based Approach to Effective Document Retrieval   总被引:3,自引:0,他引:3  
This paper presents a knowledge-based approach to effective document retrieval. This approach is based on a dual document model that consists of a document type hierarchy and a folder organization. A predicate-based document query language is proposed to enable users to precisely and accurately specify the search criteria and their knowledge about the documents to be retrieved. A guided search tool is developed as an intelligent natural language oriented user interface to assist users formulating queries. Supported by an intelligent question generator, an inference engine, a question base, and a predicate-based query composer, the guided search collects the most important information known to the user to retrieve the documents that satisfy users' particular interests. A knowledge-based query processing and search engine is devised as the core component in this approach. Algorithms are developed for the search engine to effectively and efficiently retrieve the documents that match the query.  相似文献   

3.
This paper considers the use of text signatures, fixed-length bit string representations of document content, in an experimental information retrieval system: such signatures may be generated from the list of keywords characterising a document or a query. A file of documents may be searched in a bit-serial parallel computer, such as the ICL Distributed Array Processor, using a two-level retrieval strategy in which a comparison of a query signature with the file of document signatures provides a simple and efficient means of identifying those few documents that need to undergo a computationally demanding, character matching search. Text retrieval experiments using three large collections of documents and queries demonstrate the efficiency of the suggested approach.  相似文献   

4.
The World-Wide Web can be viewed as a collection of semi-structured multimedia documents in the form of Web pages connected through hyperlinks. Unlike most web search engines, which primarily focus on information retrieval functionality, WebDB aims at supporting a comprehensive database-like query functionality, including selection, aggregation, sorting, summary, grouping, and projection. WebDB allows users to access (1) document level information, such as title, URL, length, keywords types and last modified date; (2) intra-document structures, such as tables, forms and images and (3) inter-document linkage information, such as destination URLs and anchors. With these three types of information, comprehensive queries for complex Web-based applications, such as Web mining and Web site management, can be answered. WebDB is based on object-relational concepts: Object-oriented modeling and relational query language. In this paper, we present the data model, language and implementation of WebDB. We also present the novel visual query/browsing interface for semi-structured Web and Web documents. Our system provides high usability compared with other existing systems.  相似文献   

5.
PCCS部分聚类分类:一种快速的Web文档聚类方法   总被引:15,自引:1,他引:15  
PCCS是为了帮助Web用户从搜索引擎所返回的大量文档片中筛选出自已所需要的文档,而使用的一种对Web文档进行快速聚类的部分聚类分法,首先对一部分文档进行聚类,然后根据聚类结果形成类模型对其余的文档进行分类,采用交互式的一次改进一个聚类摘选的聚类方法快速地创建一个聚类摘选集,将其余的文档使用Naive-Bayes分类器进行划分,为了提高聚类与分类的效率,提出了一种混合特征选取方法以减少文档表示的维数,重新计算文档中各特征的熵,从中选取具有最大熵值的前若干个特征,或者基于持久分类模型中的特征集来进行特征选取,实验证明,部分聚类方法能够快速,准确地根据文档主题内容组织Web文档,使用户在更高的术题层次上来查看搜索引擎返回的结果,从以主题相似的文档所形成的集簇中选取相关文档。  相似文献   

6.
语义检索是解决信息检索中准确度、人性化要求的一个非常有潜力的方法。通过对知识文档进行主题词标注,然后建立从词元→主题词→知识文档的二级索引结构;对用户的检索,进行查询词到主题词的转化,计算语义相似度,按照语义相似度算法进行排序文档。目前基于知识文档的语义检索系统已经在某集团公司进行部署和应用,取得了前5项结果命中用户总查询90%的效果,说明这种方法是语义检索的一种有效途径。  相似文献   

7.
More people than ever before have access to information with the World Wide Web; information volume and number of users both continue to expand. Traditional search methods based on keywords are not effective, resulting in large lists of documents, many of which unrelated to users’ needs. One way to improve information retrieval is to associate meaning to users’ queries by using ontologies, knowledge bases that encode a set of concepts about one domain and their relationships. Encoding a knowledge base using one single ontology is usual, but a document collection can deal with different domains, each organized into an ontology. This work presents a novel way to represent and organize knowledge, from distinct domains, using multiple ontologies that can be related. The model allows the ontologies, as well as the relationships between concepts from distinct ontologies, to be represented independently. Additionally, fuzzy set theory techniques are employed to deal with knowledge subjectivity and uncertainty. This approach to organize knowledge and an associated query expansion method are integrated into a fuzzy model for information retrieval based on multi-related ontologies. The performance of a search engine using this model is compared with another fuzzy-based approach for information retrieval, and with the Apache Lucene search engine. Experimental results show that this model improves precision and recall measures.  相似文献   

8.
9.
基于网络用户行为的搜索引擎系统SISI   总被引:1,自引:0,他引:1  
郭岩 《计算机工程》2004,30(16):9-11,13
提出了一种基于网络用户行为的搜索引擎SISl(Similar Interest,Similar access on Internet)。SISI的查询输入是一个Web文档的URL。SISI的检索模型是使用统计的方法基于网络日志中用户对文档的访问频率挖掘相关文档,充分利用了用户在相关文档判定上的潜在意识。模型的假设基础是一组兴趣相似的人访问的文档有可能相关。与传统的搜索引擎相比较,搜索引擎SISI具有系统初始化时间代价小、空间代价小等优点。同时SISI的检索优势在于可以查找那些没有显式相似内容的相关文档,尤其是在检索处理时避开了文档的类型,将文本文档和多媒体文档一视同仁。  相似文献   

10.
Engineers create engineering documents with their own terminologies, and want to search existing engineering documents quickly and accurately during a product development process. Keyword-based search methods have been widely used due to their ease of use, but their search accuracy has been often problematic because of the semantic ambiguity of terminologies in engineering documents and queries. The semantic ambiguity can be alleviated by using a domain ontology. Also, if queries are expanded to incorporate the engineer’s personalized information needs, the accuracy of the search result would be improved. Therefore, we propose a framework to search engineering documents with less semantic ambiguity and more focus on each engineer’s personalized information needs. The framework includes four processes: (1) developing a domain ontology, (2) indexing engineering documents, (3) learning user profiles, and (4) performing personalized query expansion and retrieval. A domain ontology is developed based on product structure information and engineering documents. Using the domain ontology, terminologies in documents are disambiguated and indexed. Also, a user profile is generated from the domain ontology. By user profile learning, user’s interests are captured from the relevant documents. During a personalized query expansion process, the learned user profile is used to reflect user’s interests. Simultaneously, user’s searching intent, which is implicitly inferred from the user’s task context, is also considered. To retrieve relevant documents, an expanded query in which both user’s interests and intents are reflected is then matched against the document collection. The experimental results show that the proposed approach can substantially outperform both the keyword-based approach and the existing query expansion method in retrieving engineering documents. Reflecting a user’s information needs precisely has been identified to be the most important factor underlying this notable improvement.  相似文献   

11.
基于XML文档的图书信息XQuery查询技术   总被引:2,自引:0,他引:2  
魏衍君  何洁月 《微机发展》2004,14(4):43-44,47
XML与HTML相比在数据管理和格式化组织上具有更大的优越性,XQuery能够有效地查询和处理XMI文档。为了查询网上基于XML文档的图书信息,文章设计了一些统一格式的XML文档作为图书信息的数据源,利用XQuery技术,将XQuery查询程序嵌入实现格式控制的JAVA程序中,JAVA程序可以调用并运行XQuery查询程序,查询的结果保存到一个XML文件中,将查询得到的XML文档转换成需要的格式输出给用户。初步实验证明这种方法是可行的并具有较强的实用性,为实现网上基于XML文档的信息查询提供了技术基础。  相似文献   

12.
Today, digitally stored information isn't only ubiquitous, it's also increasing in volume at an exponential rate. And not only is the volume increasing, but so is the variety, as well as the ways of combining information from different sources to derive insights. Not surprisingly, our most pressing technological and business problem is finding what we need in this sea of information. The dominant paradigm for addressing this problem is information retrieval (Modem Information Retrieval, Ricardo Baeza-Yates and Berthier Ribeiro-Neto, ACM Press, 1999). In this paradigm, the user enters a query (typically a few words typed into a search box), and the system retrieves documents matching the query, ranking the matches based on an estimate of their relevancy to the query. If the system finds many matches, the user sees only the highest-ranked matches. The popularity of Web search systems such as Google shows that the information retrieval paradigm can be effective. An information access framework empowers users by explicitly focusing on the interaction between users and the system. The key problem for information access systems isn't guessing which matching document is most relevant, but establishing a dialogue in which users progressively communicate their information goals while the system provides immediate, incremental feedback that guides users in the pursuit of those goals  相似文献   

13.
A content-search information retrieval process based on conceptual graphs   总被引:1,自引:0,他引:1  
An intelligent information retrieval system is presented in this paper. In our approach, which complies with the logical view of information retrieval, queries, document contents and other knowledge are represented by expressions in a knowledge representation language based on the conceptual graphs introduced by Sowa. In order to take the intrinsic vagueness of information retrieval into account, i.e. to search documents imprecisely and incompletely represented in order to answer a vague query, different kinds of probabilistic logic are often used. The search process described in this paper uses graph transformations instead of probabilistic notions. This paper is focused on the content-based retrieval process, and the cognitive facet of information retrieval is not directly addressed. However, our approach, involving the use of a knowledge representation language for representing data and a search process based on a combinatorial implementation of van Rijsbergen’s logical uncertainty principle, also allows the representation of retrieval situations. Hence, we believe that it could be implemented at the core of an operational information retrieval system. Two applications, one dealing with academic libraries and the other concerning audiovisual documents, are briefly presented.  相似文献   

14.
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.  相似文献   

15.
XBASE语义文档数据库系统是一种基于语义信息的文档数据库原型系统,该系统利用文档外部描述信息和文档内部特征等语义信息,可以对结构化文档、半结构化文档以及无结构文档等多种类型的文档进行有效存储、索引和查询,同时该系统还提供了一个可视化的多维交互浏览器,便于对数据库中文档进行高效浏览.  相似文献   

16.
一种通过内容和结构查询文档数据库的方法   总被引:4,自引:0,他引:4       下载免费PDF全文
文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与传统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.  相似文献   

17.
The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.  相似文献   

18.
Fensel  D. 《Computer》2002,35(11):56-59
The Web's very popularity is making it more difficult to find, present, and maintain the data that users with a wide range of tasks and computer skills need. Existing document management systems use keyword matching as a search method, combined with information retrieval rather than query answering. In addition, these systems offer limited information-sharing facilities, and they do not support different views on documents or information maintenance. To address these weaknesses, a European consortium formed the On-to-Knowledge Project to build an ontology-based tool suite that efficiently processes the many heterogeneous, distributed, and semistructured documents typically found in intranets and on the Web. The consortium's approach integrates Semantic Web search technology, document exchange via transformation operators, automated information extraction, and systematic support for information maintenance and user-specific views. The paper considers how On-to-Knowledge's tools exploit the power of ontologies to provide automated support for acquiring, maintaining, and accessing weakly structured information sources.  相似文献   

19.
Conventional thought from the Semantic Web community equates the use of ontologies with the representation of the meaning of content. Here, we skew this viewpoint by describing our ontology, Web Authoring for Accessibility (WAfA), which investigates the way ontologies can describe the semantic structure of documents. By understanding the way heterogeneous XHTML (Extensible Hypertext Mark-up Language) documents are structured we can better transform documents, currently inaccessible to visually impaired users. WAfA performs two tasks: (1) it allows us to flexibly model an XHTML document within the context of navigation and orientation through the Web resource; (2) it enables non-expert users to quickly annotate a Web document by providing a ‘lingua franca’ between author and Web Accessibility Domain Experts. Here we describe our ontology, its use, novelty, and importance.  相似文献   

20.
EDCMS: A Content Management System for Engineering Documents   总被引:1,自引:0,他引:1  
Engineers often need to look for the right pieces of information by sifting through long engineering documents.It is a very tiring and time-consuming job.To address this issue,researchers are increasingly devoting their attention to new ways to help information users,including engineers,to access and retrieve document content.The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure),document mark-up (with EXtensible Mark- up Language (XML),HyperText Mark-up Language (HTML),and Scalable Vector Graphics (SVG)),and a facetted classification mechanism.Document content extraction is implemented via computer programming (with Java).An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers. The main features of the EDCMS system are: 1) EDCMS is a system that enables users,especially engineers,to access and retrieve information at content rather than document level.In other words,it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information. 2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content. 3) Users can use the EDCMS to access and retrieve content objects,i.e.text,images and graphics (including engineering drawings) via multiple views and at different granularities based on decomposition schemes. Experiments with the EDCMS have been conducted on semi-structured documents,a textbook of CADCAM,and a set of project posters in the Engineering Design domain.Experimental results show that the system provides information users with a powerful solution to access document content.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号