共查询到20条相似文献,搜索用时 0 毫秒
1.
Twig query pattern matching is a core operation in XML query processing. Indexing XML documents for twig query processing
is of fundamental importance to supporting effective information retrieval. In practice, many XML documents on the web are
heterogeneous and have their own formats; documents describing relevant information can possess different structures. Therefore
some “user-interesting” documents having similar but non-exact structures against a user query are often missed out. In this
paper, we propose the RRSi, a novel structural index designed for structure-based query lookup on heterogeneous sources of XML documents supporting
proximate query answers. The index avoids the unnecessary processing of structurally irrelevant candidates that might show
good content relevance. An optimized version of the index, oRRSi, is also developed to further reduce both space requirements and computational complexity. To our knowledge, these structural
indexes are the first to support proximity twig queries on XML documents. The results of our preliminary experiments show
that RRSi and oRRSi based query processing significantly outperform previously proposed techniques in XML repositories with structural heterogeneity.
相似文献
Vincent T. Y. NgEmail: |
2.
3.
The flexibility of XML data model allows a more natural representation of uncertain data compared with the relational model. Matching twig pattern against XML data is a fundamental problem in querying information from XML documents. For a probabilistic XML document, each twig answer has a probabilistic value because of the uncertainty of data. The twig answers that have small probabilistic value are useless to the users, and usually users only want to get the answers with the k largest probabilistic values. To this end, existing algorithms for ordinary XML documents cannot be directly applicable due to the need for handling probability distributional nodes and efficient calculation of top-k probabilities of answers in probabilistic XML. In this paper, we address the problem of finding twig answers with top-k probabilistic values against probabilistic XML documents directly. We propose a new encoding scheme called PEDewey for probabilistic XML in this paper. Based on this encoding scheme, we then design two algorithms for finding answers of top-k probabilities for twig queries. One is called ProTJFast, to process probabilistic XML data based on element streams in document order, and the other is called PTopKTwig, based on the element streams ordered by the path probability values. Experiments have been conducted to study the performance of these algorithms. 相似文献
4.
Zhu Fangzhou Li Guohui Li Li Zhao Xiaosong Zhang Cong 《Peer-to-Peer Networking and Applications》2013,6(4):363-379
Most existing location-dependent query processing methods are based on the client-server model. However, due to the increasing nubmer of smart mobile devices, there can be a large volume of data being processed on the server side and the server can be system performance bottleneck. This paper takes the first step towards processing probabilistic nearest neighbor queries of uncertain data objects via wireless data broadcast (BPNN). Our method leverages the key properties of Voronoi Diagrams for Uncertain Data (UV-Diagram). To preserve the good properties of UV-Diagram, according to the property of Hilbert curve, UV-Hilbert-Partition is proposed to partition the UV-Diagram into several grid cells, called Hilbert-Cells, which have good locality-preserving behavior. Then a special organizing method is proposed. For a certain UV-Diagram, the CellFrame structure, which can be located based on the coordinates of a query client, is used to efficiently minimize the broadcast cycle and keep the probabilistic nearest neighbor results. Based on the sequence of the CellFrames, a distributed index, called UVHilbert-DI, is proposed to support BPNN query processing. Finally, the efficient BPNN algorithms based on UVHilbert-DI is presented and extensive experiments have been conducted to demonstrate the performance of our approaches. 相似文献
5.
Given a set S of points in the two-dimensional space, which are stored in a spatial database, this paper presents an efficient algorithm to find, in the area delimited by those points, the empty circle with the largest area that contains only a query point q. Our algorithm adapts previous work in the field of computational geometry to be used in spatial databases, which requires to manage large amounts of data. To achieve this objective, the basic idea is to discard a large part of the points of S, in such a way that the problem can be solved providing only the remaining points to a classical computational geometry algorithm that, by processing a smaller collection of points, saves main memory resources and computation time. The correctness of our algorithm is formally proven. In addition, we empirically show its efficiency and scalability by running a set of experiments using both synthetic and real data. 相似文献
6.
XML instances are not necessarily self-contained but may have connections to remote XML data residing on other servers. In this paper, we show that—in spite of its minor support and use in the XML world—the XLink language provides a powerful mechanism for expressing such links both from the modeling point of view and for actually querying interlinked XML data: in our dbxlink approach, the links are not seen as explicit links (where the users must be aware of the links and traverse them explicitly in their queries), but define views that combine into a logical, transparent XML model which serves as an external schema and can be queried by XPath/XQuery. We motivate the underlying modeling and give a concise and declarative specification as an XML-to-XML mapping. We also describe the implementation of the model as an extension of the eXist [eXist: an Open Source Native XML Database, http://exist-db.org/] XML database system. The approach can be applied both for distribution of data and for integration of data from autonomous sources. 相似文献
7.
Optimizing queries using materialized views has not been addressed adequately in the context of XML due to the many limitations associated with the definition and usability of materialized views in traditional XML query evaluation models. 相似文献
8.
Selectivity estimation is an integral part of query optimization. In this paper, we propose to approximate data density functions of relations by cosine series and use the approximations to estimate selectivities of range queries. We lay down the foundation for applying cosine series to range query size estimation and compare it with some notable approaches, such as the wavelets, DCT, kernel-spline, sketch, and Legendre polynomials. Experimental results have shown that our approach is simple to construct, easy to update, and fast to estimate. It also yields accurate estimates, especially in multi-dimensional cases. 相似文献
9.
With the increasing popularity of XML applications in enterprise and big data systems, the use of efficient query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best possible query execution plan. In this work, we propose a novel selectivity estimator which is a hybrid of structural synopsis and statistics, called XHQE. The structural synopsis enhances the accuracy of estimation and the structural statistics makes it scalable to the allocated memory space. The structural synopsis is generated by labeling the nodes of the source XML dataset using a fingerprint function and merging subtrees with similar fingerprints (i.e. having similar structures). The generated structural synopsis and structural statistics are then used to estimate the selectivity of given queries. We studied the performance of the proposed approach using different types of queries and four benchmark datasets with different structural characteristics. We compared XHQE with existing algorithms such as Sampling, TreeSketch and one histogram-based algorithm. The experimental results showed that the XHQE is significantly better than other algorithms in terms of estimation accuracy and scalability for semi-uniform datasets. For non-uniform datasets, the proposed algorithm has comparable estimation accuracy to TreeSketch as the allocated memory size is highly reduced, yet the estimation data generation time of the proposed approach is much lower (e.g., TreeSketch took more than 50 times longer than that of the proposed approach for XMark dataset). Comparing to the histogram-based algorithm, our approach supports regular twig quires in addition to having higher accuracy when both run under similar memory constraints. 相似文献
10.
The prevalence of XML in discussions of the Web and its application to such diverse application domains in the past year (1998-99) have almost eclipsed the fact that the 1.0 specification, approved as a W3C Recommendation in February 1998, is only the first part of the “structured documents” originally envisioned. The specifications for two more parts-hypertext link types and the stylesheet language-are nearing completion, and the W3C chartered new working groups last year (1998) to generate new members of the family 相似文献
11.
XML data stores: emerging practices 总被引:1,自引:0,他引:1
We survey emerging native XML storage approaches and identify and highlight popular implementations tailored to XML's "nature" and syntax. By understanding the storage practices of emerging native XML environments, programmers and software designers can better exploit the technology's scalability and reliability benefits. It is because XML is rapidly becoming the Internet standard for data representation and exchange, efficient XML document storage has become a core data management issue. Most early XML storage practices rely on conventional database management systems. However, such systems involve mappings and transformations between XML and the underlying database structure. More recent efforts are based on specific XML-tailored systems that provide ad hoc functionalities. This overview of emerging XML storage approaches highlights current practices along with prospective research and implementation trends. 相似文献
12.
13.
The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a result, disk is becoming slower from the view of applications because of the much larger data volume that they need to store and process. In database systems, the less the data volume that is involved in query processing, the better the performance that is achieved. Disk-based join operation is a common but time-consuming database operation, especially in an environment of massive data in which I/O cost dominates the execution time. However, current join algorithms are only suitable for moderate or small data volume. They will incur high I/O cost when performing on massive data because of multi-pass I/O operations on the joined tables and the insensitivity to join selectivity. This paper proposes PI-Join a novel disk-based join algorithm that can efficiently process join queries involving massive data. PI-Join consists of two stages: JPIPT construction stage (JCS) and result output stage (ROS). JCS performs a cache-conscious construction algorithm on join attributes which are kept in column-oriented model to obtain join positional index pair table (JPIPT) of join results faster. The obtained JPIPT is used in ROS to retrieve results in a one-pass sequential selective scan on each table. We provide the correctness proof and cost analysis of PI-Join. Our experimental results indicate that PI-Join has a significant advantage over the existing join algorithms. 相似文献
14.
Marina Drosou Evaggelia Pitoura 《The VLDB Journal The International Journal on Very Large Data Bases》2013,22(6):849-874
The typical user interaction with a database system is through queries. However, many times users do not have a clear understanding of their information needs or the exact content of the database. In this paper, we propose assisting users in database exploration by recommending to them additional items, called Ymal (“You May Also Like”) results, that, although not part of the result of their original query, appear to be highly related to it. Such items are computed based on the most interesting sets of attribute values, called faSets, that appear in the result of the original query. The interestingness of a faSet is defined based on its frequency in the query result and in the database. Database frequency estimations rely on a novel approach of maintaining a set of representative rare faSets. We have implemented our approach and report results regarding both its performance and its usefulness. 相似文献
15.
Noga Alon 《Journal of Computer and System Sciences》2003,66(4):688-727
We investigate the typechecking problem for XML queries: statically verifying that every answer to a query conforms to a given output DTD, for inputs satisfying a given input DTD. This problem had been studied by a subset of the authors in a simplified framework that captured the structure of XML documents but ignored data values. We revisit here the typechecking problem in the more realistic case when data values are present in documents and tested by queries. In this extended framework, typechecking quickly becomes undecidable. However, it remains decidable for large classes of queries and DTDs of practical interest. The main contribution of the present paper is to trace a fairly tight boundary of decidability for typechecking with data values. The complexity of typechecking in the decidable cases is also considered. 相似文献
16.
PSoup: a system for streaming queries over streaming data 总被引:3,自引:0,他引:3
Recent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also require the processing of data that arrives prior to query submission or during a period of disconnection. For such applications, we have developed PSoup, a system that combines the processing of ad hoc and continuous queries by treating data and queries symmetrically, allowing new queries to be applied to old data and new data to be applied to old queries. PSoup also supports intermittent connectivity by separating the computation of query results from the delivery of those results. PSoup builds on adaptive query-processing techniques developed in the Telegraph project at UC Berkeley. In this paper, we describe PSoup and present experiments that demonstrate the effectiveness of our approach.Received: 17 September 2002, Revised: 18 February 2003, Published online: 10 July 2003Edited by R. RamakrishnanThis work has been supported in part by the National Science Foundation under the ITR grants of IIS0086057 and SI0122599, and by IBM, Microsoft, Siemens, and the UC MICRO program. 相似文献
17.
18.
19.
Mohamed F. Mokbel Walid G. Aref 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(5):971-995
This paper presents the scalable on-line execution (SOLE) algorithm for continuous and on-line evaluation of concurrent continuous spatio-temporal queries over data streams.
Incoming spatio-temporal data streams are processed in-memory against a set of outstanding continuous queries. The SOLE algorithm
utilizes the scarce memory resource efficiently by keeping track of only the significant objects. In-memory stored objects are expired (i.e., dropped) from memory once they become insignificant. SOLE is a scalable algorithm where all the continuous outstanding queries share the same buffer pool. In addition, SOLE
is presented as a spatio-temporal join between two input streams, a stream of spatio-temporal objects and a stream of spatio-temporal
queries. To cope with intervals of high arrival rates of objects and/or queries, SOLE utilizes a load-shedding approach where some of the stored objects are dropped from memory. SOLE is implemented as a pipelined query operator that
can be combined with traditional query operators in a query execution plan to support a wide variety of continuous queries.
Performance experiments based on a real implementation of SOLE inside a prototype of a data stream management system show
the scalability and efficiency of SOLE in highly dynamic environments.
This work was supported in part by the National Science Foundation under Grants IIS-0093116, IIS-0209120, and 0010044-CCR. 相似文献
20.
Alexander R. Galloway 《AI & Society》2006,20(4):483-492
The goal of this paper is to offer, in straight forward terms, some practical insight into distributed data surveillance. I will use the software project Carnivore as a case study. Carnivore is a public domain riff on the U.S. Federal Bureau of Investigation’s software “Carnivore,” which was developed to perform electronic wiretaps of email. As founder of the Radical Software Group (RSG), and lead developer on the Carnivore project, I will describe the technological, philosophical, and political reasons for launching the project. I will also offer an account of the development cycle of the core engine, identify trends in “client” interface designs, and present a series of design challenges that still remain.Alexander R. Galloway is an assistant professor at New York University. He is author of Protocol: How Control Exists After Decentralization (Cambridge, The MIT Press, 2004), and a founding member of the software development group RSG. 相似文献