首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We consider the XPath evaluation problem: Evaluate an XPath query Q on a streaming XML document D. We consider two versions of the problem: 1) Filtering Problem: Determine if there is a match for Q in D. 2) Node Selection Problem: Determine the set Q(D) of document nodes selected by Q. We consider Conjunctive XPath (CXPath) queries that involve only the child and descendant axes. Let d denote the depth of D, and n denote the number of location steps in Q. Bar-Yossef et al. (2007, 2005) [6] and [7] presented lower bounds on the memory space required by any algorithm to solve these two problems. Their lower bounds apply to each query in a large subset of XPath, and are obtained (mostly) using nonrecursive(Q,D). In this paper, we present larger lower bounds for a different class of queries (namely, CXPath queries with independent predicates), on recursive(Q,D). One of our results is an Ω(nmaxcands(Q,D)) lower bound for the node selection problem, for a worst-case Q; maxcands(Q,D) is the maximum number of nodes of D that can be candidates for output, at any one instant. So, there is no algorithm for the node selection problem that uses O(f(d,|Q|)+maxcands(Q,D)) space, for any function f. This shows that some previously published algorithms are incorrect.  相似文献   

2.
We consider the XPath evaluation problem: Evaluate an XPath query Q on a streaming XML document D; i.e., determine the set Q(D) of document elements selected by Q. We mainly consider Conjunctive XPath queries that involve only the child and descendant axes. Previously known in-memory algorithms for this problem use O(|D|) space and O(|Q||D|) time. Several previously known algorithms for the streaming version use Ω(dn) space and Ω(dn|D|) time in the worst case; d denotes the depth of D, and n denotes the number of location steps in Q. Their exponential space requirement could well exceed the O(|D|) space used by the in-memory algorithms. We present an efficient algorithm that uses O(d|Q|+nc) space and O((|Q|+dn)|D|) time in the worst case; c denotes the maximum number of elements of D that can be candidates for output, at any one instant. For some worst case Q and D, the memory space used by our algorithm matches our lower bound proved in a different paper; so, our algorithm uses optimal memory space in the worst case.  相似文献   

3.
One of the primary issues confronting XML message brokers is the difficulty associated with processing a large set of continuous XPath queries over incoming XML streams. This paper proposes a novel system designed to present an effective solution to this problem. The proposed system transforms multiple XPath queries before their run-time into a new data structure, called an XP-table, by sharing their common constraints. An XP-table is matched with a stream relation (SR) transformed from a target XML stream by a SAX parser. This arrangement is intended to minimize the run-time workload of continuous query processing. In addition, an early-query-termination strategy is proposed as an improved alternative to the basic approach. It optimizes query processing by arranging the evaluation sequence of the member-lists (m-lists) of an XP-table adaptively and offers increased efficiency, especially in cases of low selectivity. System performance is estimated and verified through a variety of experiments, including comparisons with previous approaches such as YFilter and LazyDFA. The proposed system is practically linear-scalable and stable for evaluating a set of XPath queries in a continuous and timely fashion.  相似文献   

4.
The XML stream filtering is gaining widespread attention from the research community in recent years. There have been many efforts to improve the performance of the XML filtering system by utilizing XML schema information. In this paper, we design and implement an XML stream filtering system, SFilter, which uses DTD or XML schema information for improving the performance. We propose the simplification and two kinds of optimization, one is static and the other is dynamic optimization. The Simplification and static optimization transform the XPath queries to make automata as an index structure for the filtering. The dynamic optimization are done in runtime at the filtering time. We developed five kinds of static optimization and two kinds of dynamic optimization. We present the novel filtering algorithm for the resulting transformed XPath queries and runtime optimizing. The experimental result shows that our system filters the XML streams efficiently.  相似文献   

5.
Existing encoding schemes and index structures proposed for XML query processing primarily target the containment relationship, specifically the parent–child and ancestor–descendant relationship. The presence of preceding-sibling and following-sibling location steps in the XPath specification, which is the de facto query language for XML, makes the horizontal navigation, besides the vertical navigation, among nodes of XML documents a necessity for efficient evaluation of XML queries. Our work enhances the existing range-based and prefix-based encoding schemes such that all structural relationships between XML nodes can be determined from their codes alone. Furthermore, an external-memory index structure based on the traditional B+-tree, XL+-tree(XML Location+-tree), is introduced to index element sets such that all defined location steps in the XPath language, vertical and horizontal, top-down and bottom-up, can be processed efficiently. The XL+-trees under the range or prefix encoding scheme actually share the same structure; but various search operations upon them may be slightly different as a result of the richer information provided by the prefix encoding scheme. Finally, experiments are conducted to validate the efficiency of the XL+-tree approach. We compare the query performance of XL+-tree with that of R-tree, which is capable of handling comprehensive XPath location steps and has been empirically shown to outperform other indexing approaches.  相似文献   

6.
Query evaluation over probabilistic XML   总被引:2,自引:0,他引:2  
Query evaluation over probabilistic XML is explored. The queries are twig patterns with projection, and the data is represented in terms of three models of probabilistic XML (that extend existing ones in the literature). The first model makes an assumption of independence among the probabilistic junctions, whereas the second model can encode probabilistic dependencies. The third model combines the first two and, hence, is the most general. An efficient algorithm (under data complexity) is given for query evaluation in the first model. In addition, various optimizations are proposed, and their effectiveness is shown both analytically and experimentally. For the other two models, it is shown that every query is either intractable or trivial. Nonetheless, efficient (additive and multiplicative) approximation algorithms are given for these two models. Finally, Boolean queries are enriched by allowing disjunctions and negations of branches. The above algorithm for the first model is extended to handle these queries. For the other two models, there is an efficient additive approximation, and a multiplicative one also exists if there is no negation; in addition, it is shown that if the query is non-monotonic, then no efficient multiplicative approximation exists unless NP = RP.  相似文献   

7.
The streaming evaluation is a popular way of evaluating queries on XML documents. Besides its many advantages, it is also the only option for a number of important XML applications. Unfortunately, existing algorithms focus almost exclusively on tree-pattern queries (TPQs). Requirements for flexible querying of XML data have motivated recently the introduction of query languages that are more general and flexible than TPQs. These languages are not supported by existing algorithms. In this paper, we consider a partial tree-pattern query (PTPQ) language which generalizes and strictly contains TPQs. PTPQs can express a fragment of XPath which comprises reverse axes and the node identity equality (is) operator, in addition to forward axes, wildcards and predicates. They constitute an important subclass of XPath, which is very useful in practice. Unfortunately, previous streaming algorithms for TPQs cannot be applied to PTPQs. PTPQs can be represented as dags enhanced with constraints. We explore this representation to design an original polynomial time streaming algorithm for PTPQs. Our algorithm aggressively filters incoming data that is irrelevant to the query and wisely avoids processing redundant query matches (i.e., matches of the query dag that do not contribute to new solutions). Our algorithm is the first one to support the streaming evaluation of such a broad fragment of XPath. We provide an analysis of it, and conduct an extensive experimental evaluation of its performance and scalability. Compared to the only known streaming algorithm that supports TPQs extended with reverse axes, our algorithm performs better by orders of magnitude while consuming a much smaller fraction of memory space. Current streaming applications have stringent requirements on query response time and memory consumption because of the large (possibly unbounded) size of data they handle. In order to keep memory usage and CPU consumption low for the PTPQ streaming evaluation, we design another streaming algorithm called Eager PSX for PTPQs. Its key feature is that it applies an eager evaluation strategy to quickly determine when node matches should be returned as solutions to the user and also to proactively detect redundant matches. We theoretically analyze Eager PSX, and experimentally test its time and space performance and scalability. We compare it with PSX. Our results show that Eager PSX not only achieves better space performance without compromising time performance, but it also greatly improves query response time for both simple and complex queries, in many cases, by orders of magnitude.  相似文献   

8.
9.
提出了一种新的XML数据流XPath查询模型GBRender,该模型通过组着色序列来直接处理元素,具有较高的处理效率与较强的适应性.  相似文献   

10.
The query rewriting plan generation over XML views has received wide attention recently. However, little work has been done on efficient evaluation of the query rewriting plans, which is not trivial since the plan may contain an exponential size of sub-plans. This paper investigates the reason for the potentially exponential number of sub-plans, and then proposes a new space-efficient form called ABCPlan (Plan with Automata Based Combinations) to equivalently represent the original query rewriting plan. ABCPlan contains a set of buckets containing suffix paths in the query tree and an automata to indicate the combination of the suffix paths from different buckets as valid query rewriting sub-plans. We also design an evaluation method called ABCScan, which constructs a unified evaluation tree for the ABCPlan and handles the evaluation tree in one scan of the XML view. In the evaluation, we introduce node existence automata to encode the structure of the sub-tree and convert the satisfaction of the ABCPlan into the intersection problem of deterministic finite automata. The experiments show that ABCPlan based method outperforms existing methods significantly in terms of scalability and efficiency.  相似文献   

11.
We introduce a new methodology for coupling language-induced partitions and index  -induced partitions on XML documents that is aimed for the benefit of efficient evaluation of XPath queries. In particular, we identify XPath fragments which are ideally coupled with the newly introduced P(k)P(k)-partition which has its definition grounded in the well-known A(k)A(k) structural index and its associated partition. We then utilize these couplings to investigate fundamental questions about the use of structural indexes in XPath query evaluation.  相似文献   

12.
In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed. These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series of experiments is performed to show the effectiveness and the efficiency of this approach.  相似文献   

13.
Efficient memory representation of XML document trees   总被引:1,自引:0,他引:1  
Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. In this paper, a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by compressing their tree structure; the latter means to detect and remove repetitions of tree patterns. Formally, context-free tree grammars that generate only a single tree are used for tree compression. The functionality of basic tree operations, like traversal along edges, is preserved under this compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. The complexity of certain computational problems like validation against XML types or testing equality is investigated for compressed input trees.  相似文献   

14.
Active XML (AXML) as intensional data aims to exploit potential computing powers of XML, Web services and P2P architecture. It is considered a powerful extension of XML to deal with dynamic XML data from autonomous and heterogeneous data sources on a very large scale via Web services. However, AXML is still at an immature stage and various issues need to be investigated before it can be accepted widely. This paper will focus on two issues facing the current AXML system, namely the representation and the query process. We propose superior representation and improved query evaluation for AXML. For justification purposes, we compare our proposed algorithms with the existing algorithms.  相似文献   

15.
In this paper, we address the problem of cardinality estimation of XPath queries over XML data stored in a distributed, Internet-scale environment such as a large-scale, data sharing system designed to foster innovations in biomedical and health informatics. The cardinality estimate of XPath expressions is useful in XQuery optimization, designing IR-style relevance ranking schemes, and statistical hypothesis testing. We present a novel gossip algorithm called XGossip, which given an XPath query estimates the number of XML documents in the network that contain a match for the query. XGossip is designed to be scalable, decentralized, and robust to failures—properties that are desirable in a large-scale distributed system. XGossip employs a novel divide-and-conquer strategy for load balancing and reducing the bandwidth consumption. We conduct theoretical analysis of XGossip in terms of accuracy of cardinality estimation, message complexity, and bandwidth consumption. We present a comprehensive performance evaluation of XGossip on Amazon EC2 using a heterogeneous collection of XML documents.  相似文献   

16.
Uncertain data are inevitable in many applications due to various factors such as the limitations of measuring equipment and delays in data updates. Although modeling and querying uncertain data have recently attracted considerable attention from the database community, there are still many critical issues to be resolved with respect to conducting advanced analysis on uncertain data. In this paper, we study the execution of the probabilistic skyline query over uncertain data streams. We propose a novel sliding window skyline model where an uncertain tuple may take the probability to be in the skyline at a certain timestamp t. Formally, a Wp-Skyline(p, t) contains all the tuples whose probabilities of becoming skylines are at least p at timestamp t. However, in the stream environment, computing a probabilistic skyline on a large number of uncertain tuples within the sliding window is a daunting task in practice. In order to efficiently calculate Wp-Skyline, we propose an efficient and effective approach, namely the candidate list approach, which maintains lists of candidates that might become skylines in future sliding windows. We also propose algorithms that continuously monitor the newly incoming and expired data to maintain the skyline candidate set incrementally. To further reduce the computation cost of deciding whether or not a candidate tuple belongs to the skyline, we propose an enhanced refinement strategy that is based on a multi-dimensional indexing structure combined with a grouping-and-conquer strategy. To validate the effectiveness of our proposed approach, we conduct extensive experiments on both real and synthetic data sets and make comparisons with basic techniques.  相似文献   

17.
Increasingly popular commercial streaming media applications over the Internet often use UDP as the underlying transmission protocol for performance reasons. Hand-in-hand with the increase in streaming media comes the impending threat of unresponsive UDP traffic, often cited as the major threat to the stability of the Internet. Unfortunately, there are few empirical studies that analyze the responsiveness, or lack of it, of commercial streaming media applications. In this work, we evaluate the responsiveness of RealNetworks’ RealVideo over UDP by measuring the performance of numerous streaming video clips selected from a variety of RealServers on the Internet, analyze the TCP-Friendliness of the UDP streams and correlate the results with network and application layer statistics. We find that most RealVideo UDP streams respond to Internet congestion by reducing the application layer encoding rate, and streams with a minimum encoding rate less than the fair share of the capacity often achieve a TCP-Friendly rate. In addition, our results suggest that a reason streaming applications choose not to use TCP is that the TCP API hides network information, such as loss rate and round-trip time, making it difficult to estimate the available capacity for effective media scaling.  相似文献   

18.
Media authentication of wireless transmission is becoming an increasingly important issue. Authenticated media content is constantly required to be transcoded at intermediates to accommodate heterogeneous applications. In this paper, a general and efficient authentication approach is proposed for scalable lossy media streams. Firstly, a joint coding and stream authentication (JCSA) media transmission system is described in a heterogeneous wireless network. For the JCSA system, a novel structure-maintained packetization is designed to realize flexible transcoding. Secondly, to obtain the optimal end-to-end quality and minimize the authentication overhead, a quality-optimized stream authentication (QOSA) framework is proposed for authenticating media content. Finally, an implementation of the proposed QOSA optimization framework on the consultative committee for space data systems image data compression (CCSDS IDC) coder is presented by combining graph-based and error-correction coding based (ECC-based) approaches. Experimental results demonstrate that our scheme can achieve the desired goal that it provides high robustness against packet-loss at the cost of a very low overhead.  相似文献   

19.
Efficient monitoring of skyline queries over distributed data streams   总被引:1,自引:0,他引:1  
Data management and data mining over distributed data streams have received considerable attention within the database community recently. This paper is the first work to address skyline queries over distributed data streams, where streams derive from multiple horizontally split data sources. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Previous work is concentrated on skyline computations over static data or centralized data streams. We present an efficient and an effective algorithm called BOCS to handle this issue under a more challenging environment of distributed streams. BOCS consists of an efficient centralized algorithm GridSky and an associated communication protocol. Based on the strategy of progressive refinement in BOCS, the skyline is incrementally computed by two phases. In the first phase, local skylines on remote sites are maintained by GridSky. At each time, only skyline increments on remote sites are sent to the coordinator. In the second phase, a global skyline is obtained by integrating remote increments with the latest global skyline. A theoretical analysis shows that BOCS is communication-optimal among all algorithms which use a share-nothing strategy. Extensive experiments demonstrate that our proposals are efficient, scalable, and stable.  相似文献   

20.
High utility pattern (HUP) mining over data streams has become a challenging research issue in data mining. When a data stream flows through, the old information may not be interesting in the current time period. Therefore, incremental HUP mining is necessary over data streams. Even though some methods have been proposed to discover recent HUPs by using a sliding window, they suffer from the level-wise candidate generation-and-test problem. Hence, they need a large amount of execution time and memory. Moreover, their data structures are not suitable for interactive mining. To solve these problems of the existing algorithms, in this paper, we propose a novel tree structure, called HUS-tree (high utility stream tree) and a new algorithm, called HUPMS (high utility pattern mining over stream data) for incremental and interactive HUP mining over data streams with a sliding window. By capturing the important information of stream data into an HUS-tree, our HUPMS algorithm can mine all the HUPs in the current window with a pattern growth approach. Furthermore, HUS-tree is very efficient for interactive mining. Extensive performance analyses show that our algorithm is very efficient for incremental and interactive HUP mining over data streams and significantly outperforms the existing sliding window-based HUP mining algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号