A spatial distance join is a relatively new type of operation introduced for spatial and multimedia database applications. Additional requirements for ranking and stopping cardinality are often combined with the spatial distance join in online query processing or Internet search environments. These requirements pose new challenges as well as opportunities for more efficient processing of spatial distance join queries. In this paper, we first present an efficient k-distance join algorithm that uses spatial indexes such as R-trees. Bidirectional node expansion and plane-sweeping techniques are used for fast pruning of distant pairs, and the plane-sweeping is further optimized by novel strategies for selecting a sweeping axis and direction. Furthermore, we propose adaptive multistage algorithms for k-distance join and incremental distance join operations. Our performance study shows that the proposed adaptive multistage algorithms outperform previous work by up to an order of magnitude for both k-distance, join and incremental distance join queries, under various operational conditions.  相似文献   

We study the subject of exploiting interoperator parallelism to optimize the execution of multi-join queries. Specifically, we focus on two major issues: (1) scheduling the execution sequence of multiple joins within a query, and (2) determining the number of processors to be allocated for the execution of each join operation obtained in (1). For the first issue, we propose and evaluate by simulation several methods to determine the general join sequences, or bushy trees. Despite their simplicity, the heuristics proposed can lead to the general join sequences that significantly outperform the optimal sequential join sequence. The quality of the join sequences obtained by the proposed heuristics is shown to be fairly close to that of the optimal one. For the second issue, it is shown that the processor allocation for exploiting interoperator parallelism is subject to more constraints-such as execution dependency and system fragmentation-than those in the study of intraoperator parallelism for a single join. The concept of synchronous execution time is proposed to alleviate these constraints. Several heuristics to deal with the processor allocation, categorized by bottom-up and top-down approaches, are derived and are evaluated by simulation. The relationship between issues (1) and (2) is explored. Among all the schemes evaluated, the two-step approach proposed, which first applies the join sequence heuristic to build a bushy tree as if under a single processor system, and then, in light of the concept of synchronous execution time, allocates processors to execute each join in the bushy tree in a top-down manner, emerges as the best solution to minimize the query execution time  相似文献   

Because it operates under a strict time constraint, query processing for data streams should be continuous and rapid. To guarantee this constraint, most previous researches optimize the evaluation order of multiple join operations in a set of continuous queries using a greedy optimization strategy so that the order is re-optimized dynamically in run-time due to the time-varying characteristics of data streams. However, this method often results in a sub-optimal plan because the greedy strategy traces only the first promising plan. This paper proposes a new multiple query optimization approach, Adaptive Sharing-based Extended Greedy Optimization Approach (A-SEGO), that traces multiple promising partial plans simultaneously. A-SEGO presents a novel method for sharing the results of common sub-expressions in a set of queries cost-effectively. The number of partial plans can be flexibly controlled according to the query processing workload. In addition, to avoid invoking the optimization process too frequently, optimization is performed only when the current execution plan is relatively no longer efficient. A series of experiments are comparatively analyzed to evaluate the performance of the proposed method in various stream environments.  相似文献   

The join is an important operator in processing data streams. To produce outputs continuously over unbounded data streams, sliding windows are generally used to limit the scope of the join at a certain time. In the existing join algorithms, only a simple type of windows have been considered, which are updated whenever a new data item arrives on any input stream. On the other hand, a more common type of windows have not been addressed yet, whose intervals are updated periodically, i.e., slid by a predefined time interval. In this paper, we consider the time-slide windows in joining multiple data streams. The algorithm for the time-slide window join can vary according to (i) how frequently the join is evaluated and (ii) which structure is used for windowing. Regarding this, possible algorithms are discussed, and experimental results that compare their performances are provided in this paper.  相似文献   

This paper proposes a semi-greedy framework for optimizing multi-join queries in shared-nothing systems.The plan generated by the framework comprises several pipelines,each performing several joins.The framework determines the “optimal” number of joins to be performed in each pipeline.The decisions are made based on the cost estimation of the entire processing plan.Two existing optimization algorithms are extended under the framework.An analytical model is presented and used to compare the quality of plans produced by each optimization algorithm.Our study shows that the new algorithms outperform their counterparts that are not extended.  相似文献   

The resource-constrained nature of mote-level wireless sensor networks (WSNs) poses challenges for the design of a general-purpose sensor network query processors (SNQPs). Existing SNQPs tend to generate query execution plans (QEPs) that are selected on the basis of a fixed, implicit expectation, for example, that energy consumption should be kept as small as possible. However, in WSN applications, the same query may be subject to several, possibly conflicting, quality-of-service (QoS) expectations concomitantly (for example maximizing data acquisition rates subject to keeping energy consumption low). It is also not uncommon for the QoS expectations to change over the lifetime of a deployment (for example from low to high data acquisition rates). This paper describes optimization algorithms that respond to stated QoS expectations (about acquisition rate, delivery time, energy consumption and lifetime) when making routing, placement, and timing decisions for in-WSN query processing. The paper shows experimentally that QoS-awareness offers significant benefits in responding to, and reconciling, diverse QoS expectations, thereby enabling QoS-aware SNQPs to generate efficient QEPs for a broader range WSN applications than has hitherto been possible.  相似文献   

This paper deals with the problem of scheduling spawned tasks when a query is issued to a database which resides on a MIMD multiprocessor. These tasks have the property that their associated dependency scheme can be presented as a directed tree. We present a theoretical framework with extensive experimental simulations which increase the throughput of database applications. We derive a family of algorithms for scheduling tasks. Their performance is tested on several common multiprocessor configurations. For better performance the adaptation of the scheduling algorithm to the multiprocessor configuration is examined and analyzed. The scheduling algorithms are divided into two cases: (a) permitted changes in the resources connection scheme of the multiprocessor, and (b) no changes allowed. The algorithms are scalable and their complexity is computed. In particular, we present an algorithm for scheduling tasks in the case where the construction of a central storage location is permitted. One of the main tools for the construction of the above algorithms is the notion of (t, 1)-domination and k-domination sets. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

滑动窗口聚集查询在数据流管理系统中应用广泛,数据流到达高峰期,必须考虑滑动窗口聚集查询中出现的降载问题。分析了子集模型的特点和已有降载策略的不足,给出了数据流滑动窗口聚集查询降载问题的约束条件,提出了能保证子集结果产生的基于丢弃窗口更新策略的降载算法。理论分析和实验结果表明,该算法对数据流滑动窗口聚集查询降载问题的处理具有较高的有效性和实用性。  相似文献   

Given two sets of moving objects with nonzero extents, the continuous intersection join query reports every pair of intersecting objects, one from each of the two moving object sets, for every timestamp. This type of queries is important for a number of applications, e.g., in the multi-billion dollar computer game industry, massively multiplayer online games like World of Warcraft need to monitor the intersection among players’ attack ranges and render players’ interaction in real time. The computational cost of a straightforward algorithm or an algorithm adapted from another query type is prohibitive, and answering the query in real time poses a great challenge. Those algorithms compute the query answer for either too long or too short a time interval, which results in either a very large computation cost per answer update or too frequent answer updates, respectively. This observation motivates us to optimize the query processing in the time dimension. In this study, we achieve this optimization by introducing the new concept of time-constrained (TC) processing. Further, TC processing enables a set of effective improvement techniques on traditional intersection join algorithms. Finally, we provide a method to find the optimal value for an important parameter required in our technique, the maximum update interval. As a result, we achieve a highly optimized algorithm for processing continuous intersection join queries on moving objects. With a thorough experimental study, we show that our algorithm outperforms the best adapted existing solution by several orders of magnitude. We also validate the accuracy of our cost model and its effectiveness in optimizing the performance.  相似文献   

Data management and data mining over distributed data streams have received considerable attention within the database community recently. This paper is the first work to address skyline queries over distributed data streams, where streams derive from multiple horizontally split data sources. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Previous work is concentrated on skyline computations over static data or centralized data streams. We present an efficient and an effective algorithm called BOCS to handle this issue under a more challenging environment of distributed streams. BOCS consists of an efficient centralized algorithm GridSky and an associated communication protocol. Based on the strategy of progressive refinement in BOCS, the skyline is incrementally computed by two phases. In the first phase, local skylines on remote sites are maintained by GridSky. At each time, only skyline increments on remote sites are sent to the coordinator. In the second phase, a global skyline is obtained by integrating remote increments with the latest global skyline. A theoretical analysis shows that BOCS is communication-optimal among all algorithms which use a share-nothing strategy. Extensive experiments demonstrate that our proposals are efficient, scalable, and stable.  相似文献   

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Query processing for such a data stream should also be continuous and rapid, which requires strict time and space constraints. In order to guarantee these constraints, we have proposed a new scheme called an Attribute Selection Construct (ASC) for an attribute of a data stream in our previous study (Lee and Lee, Information Sciences 178:2416?C2432, 2008). As its optimization technique, this paper proposes the new strategy that determines the evaluation order of multiple ASC??s for a given query set at two different levels??macro and micro levels. Based on the two levels, it also proposes two different strategies??macro-sequence and hybrid-sequence??that find the optimized full evaluation sequence of all the ASC??s. In addition, it provides the adaptive strategy that periodically rearranges the evaluation sequence of multiple ASC??s. The performance of the proposed technique is verified by a series of experiments.  相似文献   

In this paper, we propose parallel processing of continuous queries over data streams to handle the bottleneck of single processor DSMSs. Queries are executed in parallel over the logical machines in a multiprocessing environment. Scheduling parallel execution of operators is performed via finding the shortest path in a weighted graph called Query Mega Graph (QMG), which is a logical view of K machines. By lapse of time, number of tuples waiting in queues of different operators may be very different. When a queue becomes full, re-scheduling is done by updating weight of edges of QMG. In the new computed path, machines with more workload will be used less. The proposed system is formally presented and its correctness is proved. It is also modeled in PetriNets and its performance is evaluated and compared with serial query processing as well as the Min-Latency scheduling algorithm. The presented system is shown to outperform them w.r.t. tuple latency (response time), memory usage, throughput and also tuple loss- critical parameters in any data stream management systems.  相似文献   

In some business applications such as trading management in financial institutions, it is required to accurately answer ad hoc aggregate queries over data streams. Materializing and incrementally maintaining a full data cube or even its compression or approximation over a data stream is often computationally prohibitive. On the other hand, although previous studies proposed approximate methods for continuous aggregate queries, they cannot provide accurate answers. In this paper, we develop a novel prefix aggregate tree (PAT) structure for online warehousing data streams and answering ad hoc aggregate queries. Often, a data stream can be partitioned into the historical segment, which is stored in a traditional data warehouse, and the transient segment, which can be stored in a PAT to answer ad hoc aggregate queries. The size of a PAT is linear in the size of the transient segment, and only one scan of the data stream is needed to create and incrementally maintain a PAT. Although the query answering using PAT costs more than the case of a fully materialized data cube, the query answering time is still kept linear in the size of the transient segment. Our extensive experimental results on both synthetic and real data sets illustrate the efficiency and the scalability of our design. Moonjung Cho is a Ph.D. candidate in the Department of Computer Science and Engineering at State University of New York at Buffalo. She obtained her Master from same university in 2003. She has industry experiences as associate researcher for 4 years. Her research interests are in the area of data mining, data warehousing and data cubing. She has received a full scholarship from Institute of Information Technology Assessment in Korea. Jian Pei received the Ph.D. degree in Computing Science from Simon Fraser University, Canada, in 2002. He is currently an Assistant Professor of Computing Science at Simon Fraser University, Canada. In 2002–2004, he was an Assistant Professor of Computer Science and Engineering at the State University of New York at Buffalo, USA. His research interests can be summarized as developing advanced data analysis techniques for emerging applications. Particularly, he is currently interested in various techniques of data mining, data warehousing, online analytical processing, and database systems, as well as their applications in bioinformatics. His current research is supported in part by Natural Sciences and Engineering Research Council of Canada (NSERC) and National Science Foundation (NSF). He has published over 70 papers in refereed journals, conferences, and workshops, has served in the program committees of over 60 international conferences and workshops, and has been a reviewer for some leading academic journals. He is a member of the ACM, the ACM SIGMOD, and the ACM SIGKDD. Ke Wang received Ph.D from Georgia Institute of Technology. He is currently a professor at School of Computing Science, Simon Fraser University. Before joining Simon Fraser, he was an associate professor at National University of Singapore. He has taught in the areas of database and data mining. Ke Wang's research interests include database technology, data mining and knowledge discovery, machine learning, and emerging applications, with recent interests focusing on the end use of data mining. This includes explicitly modeling the business goal (such as profit mining, bio-mining and web mining) and exploiting user prior knowledge (such as extracting unexpected patterns and actionable knowledge). He is interested in combining the strengths of various fields such as database, statistics, machine learning and optimization to provide actionable solutions to real life problems. Ke Wang has published in database, information retrieval, and data mining conferences, including SIGMOD, SIGIR, PODS, VLDB, ICDE, EDBT, SIGKDD, SDM and ICDM. He is an associate editor of the IEEE TKDE journal and has served program committees for international conferences including DASFAA, ICDE, ICDM, PAKDD, PKDD, SIGKDD and VLDB.  相似文献   

Management and analysis of streaming data has become crucial with its applications to web, sensor data, network traffic data, and stock market. Data streams consist of mostly numeric data but what is more interesting are the events derived from the numerical data that need to be monitored. The events obtained from streaming data form event streams. Event streams have similar properties to data streams, i.e., they are seen only once in a fixed order as a continuous stream. Events appearing in the event stream have time stamps associated with them at a certain time granularity, such as second, minute, or hour. One type of frequently asked queries over event streams are count queries, i.e., the frequency of an event occurrence over time. Count queries can be answered over event streams easily, however, users may ask queries over different time granularities as well. For example, a broker may ask how many times a stock increased in the same time frame, where the time frames specified could be an hour, day, or both. Such types of queries are challenging especially in the case of event streams where only a window of an event stream is available at a certain time instead of the whole stream. In this paper, we propose a technique for predicting the frequencies of event occurrences in event streams at multiple time granularities. The proposed approximation method efficiently estimates the count of events with a high accuracy in an event stream at any time granularity by examining the distance distributions of event occurrences. The proposed method has been implemented and tested on different real data sets including daily price changes in two different stock exchange markets. The obtained results show its effectiveness.  相似文献   

In this paper we present algorithms for building and maintaining efficient collection trees that provide the conduit to disseminate data required for processing monitoring queries in a wireless sensor network. While prior techniques base their operation on the assumption that the sensor nodes that collect data relevant to a specified query need to include their measurements in the query result at every query epoch, in many event monitoring applications such an assumption is not valid. We introduce and formalize the notion of event monitoring queries and demonstrate that they can capture a large class of monitoring applications. We then show techniques which, using a small set of intuitive statistics, can compute collection trees that minimize important resources such as the number of messages exchanged among the nodes or the overall energy consumption. Our experiments demonstrate that our techniques can organize the data collection process while utilizing significantly lower resources than prior approaches.  相似文献   

One of the primary issues confronting XML message brokers is the difficulty associated with processing a large set of continuous XPath queries over incoming XML streams. This paper proposes a novel system designed to present an effective solution to this problem. The proposed system transforms multiple XPath queries before their run-time into a new data structure, called an XP-table, by sharing their common constraints. An XP-table is matched with a stream relation (SR) transformed from a target XML stream by a SAX parser. This arrangement is intended to minimize the run-time workload of continuous query processing. In addition, an early-query-termination strategy is proposed as an improved alternative to the basic approach. It optimizes query processing by arranging the evaluation sequence of the member-lists (m-lists) of an XP-table adaptively and offers increased efficiency, especially in cases of low selectivity. System performance is estimated and verified through a variety of experiments, including comparisons with previous approaches such as YFilter and LazyDFA. The proposed system is practically linear-scalable and stable for evaluating a set of XPath queries in a continuous and timely fashion.  相似文献   

Quantile computation has many applications including data mining and financial data analysis. It has been shown that an /spl epsi/-approximate summary can be maintained so that, given a quantile query (/spl phi/,/spl epsi/), the data item at rank /spl lceil//spl phi/N/spl rceil/ may be approximately obtained within the rank error precision /spl epsi/N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different /spl phi/ and /spl epsi/ poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.  相似文献   

