首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We describe a method for generating queries for retrieving data from distributed heterogeneous semistructured documents, and its implementation in the metadata interface DDXMI (distributed document XML metadata interchange). The proposed system generates local queries appropriate to local schemas from a user query over the global schema. The system constructs mappings between global schema and local schemas (extracted from local documents if not given), path substitution, and node identification for resolving the heterogeneity among nodes with the same label that often exist in semistructured data. The system uses Quilt as its XML query language. An experiment is reported over three local semistructured documents: ‘thesis’, ‘reports’, and ‘journal’ documents with ‘article’ global schema. The prototype was developed under Windows system with Java and JavaCC.  相似文献   

2.
The widespread adoption of XML holds the promise that document structure can be exploited to specify precise database queries. However, users may have only a limited knowledge of the XML structure, and may be unable to produce a correct XQuery expression, especially in the context of a heterogeneous information collection. The default is to use keyword-based search and we are all too familiar with how difficult it is to obtain precise answers by these means. We seek to address these problems by introducing the notion of Meaningful Query Focus (MQF) for finding related nodes within an XML document. MQF enables users to take full advantage of the preciseness and efficiency of XQuery without requiring (perfect) knowledge of the document structure. Such a Schema-Free XQuery is potentially of value not just to casual users with partial knowledge of schema, but also to experts working in data integration or data evolution. In such a context, a schema-free query, once written, can be applied universally to multiple data sources that supply similar content under different schemas, and applied “forever” as these schemas evolve. Our experimental evaluation found that it is possible to express a wide variety of queries in a schema-free manner and efficiently retrieve correct results over a broad diversity of schemas. Furthermore, the evaluation of a schema-free query is not expensive: using a novel stack-based algorithm we developed for computing MQF, the overhead is from 1 to 4 times the execution time of an equivalent schema-aware query. The evaluation cost of schema-free queries can be further reduced by as much as 68% using a selectivity-based algorithm we develop to enable the integration of MQF operation into the query pipeline.  相似文献   

3.
Boolean query mapping across heterogeneous information sources   总被引:5,自引:0,他引:5  
Searching over heterogeneous information sources is difficult because of the nonuniform query languages. Our approach is to allow a user to compose Boolean queries in one rich front end language. For each user query and target source, we transform the user query into a subsuming query that can be supported by the source but that may return extra documents. The results are then processed by a filter query to yield the correct final result. We introduce the architecture and associated algorithms for generating the supported subsuming queries and filters. We show that generated subsuming queries return a minimal number of documents; we also discuss how minimal cost filters can be obtained. We have implemented prototype versions of these algorithms and demonstrated them on heterogeneous Boolean systems  相似文献   

4.
5.
We consider data exchange for XML documents: given source and target schemas, a mapping between them, and a document conforming to the source schema, construct a target document and answer target queries in a way that is consistent with the source information. The problem has primarily been studied in the relational context, in which data-exchange systems have also been built. Since many XML documents are stored in relations, it is natural to consider using a relational system for XML data exchange. However, there is a complexity mismatch between query answering in relational and in XML data exchange. This indicates that to make the use of relational systems possible, restrictions have to be imposed on XML schemas and mappings, as well as on XML shredding schemes. We isolate a set of five requirements that must be fulfilled in order to have a faithful representation of the XML data-exchange problem by a relational translation. We then demonstrate that these requirements naturally suggest the in-lining technique for data-exchange tasks. Our key contribution is to provide shredding algorithms for schemas, documents, mappings and queries, and demonstrate that they enable us to correctly perform XML data-exchange tasks using a relational system.  相似文献   

6.
7.
The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions.  相似文献   

8.
Traditional information search in which queries are posed against a known and rigid schema over a structured database is shifting toward a Web scenario in which exposed schemas are vague or absent and data come from heterogeneous sources. In this framework, query answering cannot be precise and needs to be relaxed, with the goal of matching user requests with accessible data. In this paper, we propose a logical model and a class of abstract query languages as a foundation for querying relational data sets with vague schemas. Our approach relies on the availability of taxonomies, that is, simple classifications of terms arranged in a hierarchical structure. The model is a natural extension of the relational model in which data domains are organized in hierarchies, according to different levels of generalization between terms. We first propose a conservative extension of the relational algebra for this model in which special operators allow the specification of relaxed queries over vaguely structured information. We also study equivalence and rewriting properties of the algebra that can be used for query optimization. We then illustrate a logic-based query language that can provide a basis for expressing relaxed queries in a declarative way. We finally investigate the expressive power of the proposed query languages and the independence of the taxonomy in this context.  相似文献   

9.
王进鹏  张亚非  苗壮 《计算机科学》2010,37(12):134-137
为实现异构关系数据库的语义集成,针对传统集成技术存在的问题,在对语义网等相关技术进行分析的基础上,研究基于本体的关系数据集成系统中的查询处理问题,提出了一种基于本体的关系数据库集成框架。设计了基于本体的关系数据的描述方法,使用本体作为集成的全局模式来描述关系模式的语义。设计了查询重写算法,该算法可以将基于全局模式的SPARQL查询重写为针对具体关系数据库的查询,从而实现对异构关系数据库的集成。实验表明,该算法具有良好的可扩展性。  相似文献   

10.
In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.  相似文献   

11.
In this paper, we develop techniques to produce interoperable queries with object and relational databases. A user poses a local query in a local query language, against a local object or relational schema. We transparently produce appropriate queries with respect to a remote target object or relational schema, corresponding to some remote database which contains data relevant to the user's query. Mapping knowledge to resolve representational heterogeneities in local and remote schemas is expressed in a canonical representation, CRmapping, and is independent of the particular data model. A canonical representation CRquery is also used to resolve heterogeneities of query languages. A set of heterogeneous transformation algorithms define the appropriate transformations from the local queries to the remote queries. The use of canonical representations (CR) allows us to represent queries independent of the particular query language, and to resolve representational conflicts in a uniform manner, independent of models and query languages.  相似文献   

12.
Keyword proximity search in XML trees   总被引:3,自引:0,他引:3  
Recent works have shown the benefits of keyword proximity search in querying XML documents in addition to text documents. For example, given query keywords over Shakespeare's plays in XML, the user might be interested in knowing how the keywords cooccur. In this paper, we focus on XML trees and define XML keyword, proximity queries to return the (possibly heterogeneous) set of minimum connecting trees (MCTs) of the matches to the individual keywords in the query. We consider efficiently executing keyword proximity queries on labeled trees (XML) in various settings: 1) when the XML database has been preprocessed and 2) when no indices are available on the XML database. We perform a detailed experimental evaluation to study the benefits of our approach and show that our algorithms considerably outperform prior algorithms and other applicable approaches.  相似文献   

13.
异构数据源集成系统查询分解和优化的实现   总被引:54,自引:0,他引:54  
王宁  王能斌 《软件学报》2000,11(2):222-228
通用异构数据源集成系统需要集成包括WWW在内的各种数据源,有些数据源既无规则的模式结构,又无强有力的查询功能,给全局查询的分解和优化造成一定的困难.异构数据源集成系统Versatile一方面利用局部动态字典的模板操作构造集成系统全局动态字典,作为查询分解和优化的依据.一方面采用基于缓存和数据源能力的查询分解和优化策略,以便充分利用数据源的查询能力,简化包装器的设计,并取得较高的查询效率.  相似文献   

14.
Peers in a peer-to-peer data management system often have heterogeneous schemas and no mediated global schema. To translate queries across peers, we assume each peer provides correspondences between its schema and a small number of other peer schemas. We focus on query reformulation in the presence of heterogeneous XML schemas, including data–metadata conflicts. We develop an algorithm for inferring precise mapping rules from informal schema correspondences. We define the semantics of query answering in this setting and develop query translation algorithm. Our translation handles an expressive fragment of XQuery and works both along and against the direction of mapping rules. We describe the HePToX heterogeneous P2P XML data management system which incorporates our results. We report the results of extensive experiments on HePToX on both synthetic and real datasets. We demonstrate our system utility and scalability on different P2P distributions.  相似文献   

15.
目前XML查询语言及查询界面对Web用户过于复杂,该文描述了一种XML文档索引机制,在此基础上建立了一个通用的贝叶斯网络查询模型。用户只需在交互界面输入自然语言描述的查询,系统就能对其实现基于语义的构造,由它生成多个结构化查询;对这些查询建立贝叶斯网络,计算查询在给定文档下的概率,选择概率最大的前3个查询提交给系统执行。  相似文献   

16.
A Knowledge-Based Approach to Effective Document Retrieval   总被引:3,自引:0,他引:3  
This paper presents a knowledge-based approach to effective document retrieval. This approach is based on a dual document model that consists of a document type hierarchy and a folder organization. A predicate-based document query language is proposed to enable users to precisely and accurately specify the search criteria and their knowledge about the documents to be retrieved. A guided search tool is developed as an intelligent natural language oriented user interface to assist users formulating queries. Supported by an intelligent question generator, an inference engine, a question base, and a predicate-based query composer, the guided search collects the most important information known to the user to retrieve the documents that satisfy users' particular interests. A knowledge-based query processing and search engine is devised as the core component in this approach. Algorithms are developed for the search engine to effectively and efficiently retrieve the documents that match the query.  相似文献   

17.
We present a virtual reality application called VR-VIBE which is intended to support the co-operative browsing and filtering of large document stores. VR-VIBE extends a visualisation approach proposed in a previous two dimensional system called VIBE into three dimensions, allowing more information to be visualised at one time and supporting more powerful styles of interaction, The essence of VR-VIBE is that multiple users can explore the results of applying several simultaneous queries to a corpus of documents. By arranging the queries into a spatial framework, the system shows the relative attraction of each document to each query by its spatial position and also shows the absolute relevance of each document to all of the queries. Users may then navigate the space, select individual documents, control the display according to a dynamic relevance threshold and dynamically drag the queries to new positions to see the effect on the document space. Co-operative browsing is supported by directly embodying users and providing them with the ability to interact over live audio connections and to attach brief textual annotations to individual documents. Finally, we conclude with some initial observations gleaned from our experience of constructing VR-VIBE and using it in the laboratory setting.  相似文献   

18.
一种基于结构索引的XML模式匹配方法   总被引:2,自引:0,他引:2  
XML文档采用了树型的数据模型,对其查询通常是用带有选择谓词的模式树在XML数据中进行匹配.因此,找出XML文档中所有符合模式树结构的元素集,是XML查询处理的核心操作.本文提出了结构索引JoinGuide,并在此基础上提出了一种新的XML模式匹配方法.它使用JoinGuide来对模式树进行预匹配,这样在XML文档上查询时可以利用索引上的匹配结果来忽略部分连接谓词和不必要的候选XML元素序列.本文还提出了三种具体算法来利用索引匹配结果进行进一步的查询.实验结果表明本文中的模式树匹配方法优于以往的匹配方法,并且索引所需的空间很小.  相似文献   

19.
In this paper, we address the problem of document re-ranking in information retrieval, which is usually conducted after initial retrieval to improve rankings of relevant documents. To deal with this problem, we propose a method which automatically constructs a term resource specific to the document collection and then applies the resource to document re-ranking. The term resource includes a list of terms extracted from the documents as well as their weighting and correlations computed after initial retrieval. The term weighting based on local and global distribution ensures the re-ranking not sensitive to different choices of pseudo relevance, while the term correlation helps avoid any bias to certain specific concept embedded in queries. Experiments with NTCIR3 data show that the approach can not only improve performance of initial retrieval, but also make significant contribution to standard query expansion.  相似文献   

20.
Many large programs operate on collection types. Extensive libraries are available in many programming languages, such as the C++ Standard Template Library, which make programming with collections convenient. Extending programming languages to provide collection queries as first class constructs in the language would not only allow programmers to write queries explicitly in their programs but it would also allow compilers to leverage the wealth of experience available from the database domain to optimize such queries. This paper describes an approach to reduce the run time of programs involving explicit collection queries by performing run time query optimization that is effective for single runs of a program. In addition, it also leverages a cache to store previously computed results. The proposed approach relies on histograms built from the data at run time to estimate the selectivity of joins and predicates in order to construct query plans. Information from earlier executions of the same query during run time is leveraged during the construction of the query plans, even when the data has changed between these executions. An effective cache policy is also determined for caching the results of join (sub) queries. The cache is maintained incrementally, when the underlying collections change, and use of the cache space is optimized by a cache replacement policy. Our approach has been implemented within the Java Query Language (JQL) framework using AspectJ. Our approach demonstrated that its run time query optimization in integration with caching sub query result significantly improves the run time of programs with explicit queries over equivalent programs performing collection operations by iterating over those collections. This paper evaluates our approach using synthetic as well as real world Robocode programs by comparing it to JQL as a benchmark. Experimental results show that our approach performs better than the JQL approach with respect to the program run time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号