首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The article proposes two ways to improve the sort merge based band join algorithm. The techniques proposed address issues that have not been previously discussed: to choose a right relation as the inner relation to achieve better performance and to optimally allocate and adjust buffer allocations to make the algorithms robust to data skew and estimation errors  相似文献   

2.
传统的反向k近邻查询的研究主要集中在k=1时的单色移动对象的反向最近邻查询上,单色和双色的反向k近邻查询问题还没有解决。利用网格索引结构结合60°平面修剪策略,提出了一种解决单色和双色的移动对象的连续反向k近邻查询方法。最后实验结果验证了算法的有效性。  相似文献   

3.
We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called ‘Jive-join’ developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a set of temporary files that contain tuple identifiers but no input tuples. Stripe-join performs this efficiently even when the input relations are much larger than main memory, as long as the number of blocks in main memory is of the order of the square root of the number of blocks in the participating relations. Stripe-join is particularly efficient for self-joins. To our knowledge, Stripe-join is the first algorithm that, given a join index and a relation significantly larger than main memory, can perform a self-join with just a single pass over the input relation and without storing input tuples in intermediate files. Almost all the I/O is sequential, thus minimizing the impact of seek and rotational latency. The algorithm is resistant to data skew. It can also join multiple relations while still making only a single pass over each input relation. Using a detailed cost model, Stripe-join is analyzed and compared with competing algorithms. For large input relations, Stripe-join performs significantly better than Valduriez's algorithm and hash join algorithms. We demonstrate circumstances under which Stripe-join performs significantly better than Jive-join. Unlike Jive-join, Stripe-join makes no assumptions about the order of the join index.  相似文献   

4.
Top-k spatial joins   总被引:1,自引:0,他引:1  
Given two spatial data sets A and B, a top-k spatial join retrieves the k objects from A or B that intersect the largest number of objects from the other data set. Depending on the application requirements, there exist several variations of the problem. For instance, B may be a point data set, and the goal may be to retrieve the regions of A that contain the maximum number of points. The processing of such queries with conventional spatial join algorithms is expensive. However, several improvements are possible based on the fact that we only require a small subset of the result (instead of all intersection/containments pairs). In this paper, we propose output-sensitive algorithms for top-k spatial joins that utilize a variety of optimizations for reducing the overhead.  相似文献   

5.
High-dimensional similarity joins   总被引:3,自引:0,他引:3  
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the ε tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the ε tree and the R-tree family, and show that the ε tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the ε tree is twice to an order of magnitude faster than the R+ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the ε tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the ε tree  相似文献   

6.
Holes in joins     
A join of two relations in real databases is usually much smaller than their Cartesian product. This means that most of the combinations of tuples in the crossproduct of the respective relations do not appear together in the join result. We characterize these combinations as ranges of attributes that do not appear together. We sketch an algorithm for finding such combinations and present experimental results from real data sets. We then explore two potential applications of this knowledge in query processing. In the first application, we model empty joins as materialized views, we show how they can be used for query optimization. In the second application, we propose a strategy that uses information about empty joins for an improved join selectivity estimation.  相似文献   

7.
We present a method for transforming some outer joins to inner joins and describe a generalized semijoin reduction technique. The first part of the paper shows how to transform a given outer join query whose join graph is a tree to an equivalent inner join query. The method uses derived relations and join predicates. Derived relations contain columns corresponding to join conditions and may have virtual row identifiers, rows and attribute values. The constructed inner join query, after elimination of virtual row identifiers, has the same join tuples as the outer join query. Both the theoretical maximum number of virtual rows and the average number in practice are shown to be low. The method confines consideration of the non-associativity of outer joins to a single step. The second part of the paper generalizes to outer joins the well known technique of semijoin reduction of inner joins. It does so by defining the notions of influencing and needing, and using them to define full reduction and reduction plans. The technique is applied here to perform one step of the method presented in the first part. Semijoin reduction is useful in practice for executing join queries in distributed databases.  相似文献   

8.
The effectiveness of parallel processing of relational join operations is examined. The skew in the distribution of join attribute values and the stochastic nature of the task processing times are identified as the major factors that can affect the effective exploitation of parallelism. Expressions for the execution time of parallel hash join and semijoin are derived and their effectiveness analyzed. When many small processors are used in the parallel architecture, the skew can result in some processors becoming sources of bottleneck while other processors are being underutilized. Even in the absence of skew, the variations in the processing times of the parallel tasks belonging to a query can lead to high task synchronization delay and impact the maximum speedup achievable through parallel execution. For example, when the task processing time on each processor is exponential with the same mean, the speedup is proportional to P/ln(P) where P is the number of processors. Other factors such as memory size, communication bandwidth, etc., can lead to even lower speedup. These are quantified using analytical models  相似文献   

9.
An evaluation of XML queries such as XQuery or XPath expressions represents a challenging task due to its complexity. Many algorithms have been introduced to cope with this problem. Some of them, called binary joins, evaluate separated parts of a query and subsequently merge intermediate results, while the others, called holistic twig joins, evaluate a query as a whole. Moreover, these algorithms also differ in what index data structure they use to handle XML data. There exist cost-based approaches utilizing binary joins and various index data structures; however, they share a limitation. The limitation is that they cannot perform a join between query nodes not having a direct XPath relationship. Such a join can be advantageous especially if their joint selectivity is high. Since holistic joins work with all query nodes they overcome this limitation. In this article, we introduce such a holistic twig join called CostTwigJoin. To the best of our knowledge, CostTwigJoin is the first holistic join capable of combining various index data structures during an evaluation of an XML query. Usage of the holistic join has yet another advantage for cost-based approaches: an optimizer does not have to resolve the order of binary joins; therefore, the search space is reduced. In this article, we perform thorough experiments on hundreds of queries to evaluate our approach and demonstrate its advantages.  相似文献   

10.
We show that any expression of the relational division operator in the relational algebra with union, difference, projection, selection, constant-tagging, and joins, must produce intermediate results of quadratic size. To prove this result, we show a dichotomy theorem about intermediate sizes of relational algebra expressions (they are either all linear, or at least one is quadratic), and we link linear relational algebra expressions to expressions using only semijoins instead of joins.  相似文献   

11.
12.
Rank join operators perform a relational join among two or more relations, assign numeric scores to the join results based on a given scoring function, and return K join results with the highest scores, while accessing a subset of data from the input relations. Most of the rank join operators compute a score upper bound for a join result that can be potentially obtained after retrieving the unseen data. A join result is kept in an output buffer, and is deterministically reported to the user if its score is greater than or equal to the score upper bound. The value of the score upper bound decreases subject to further data extraction from the relations. In case of Web services as data sources, which are characterized by non-negligible response time for every data fetch, the value of score upper bound might decrease slowly. Consequently, there is a long delay in reporting a join result stored in the output buffer. This paper addresses the problem of efficiently reporting a top join result obtained by joining the data of two Web services, which are characterized by non-negligible response time. We present a probabilistic reporting method which computes the confidence with which a join result may appear among final top-K joins. It reports a join result as soon as the measure of its confidence exceeds a given threshold. This helps in reporting a join result soon after its observation. An extensive experimental study with various settings of different operating parameters validates the importance of the proposed approach on both real and synthetic data sets. The results show that our proposed approach significantly reduces the average difference between the time when a join result is observed and the time when it is reported, while incurring negligible errors in the final results.  相似文献   

13.
Fast joins using join indices   总被引:1,自引:0,他引:1  
Two new algorithms, “Jive join” and “Slam join,” are proposed for computing the join of two relations using a join index. The algorithms are duals: Jive join range-partitions input relation tuple ids and then processes each partition, while Slam join forms ordered runs of input relation tuple ids and then merges the results. Both algorithms make a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a temporary file, whose size is half that of the join index. Both algorithms require only that the number of blocks in main memory is of the order of the square root of the number of blocks in the smaller relation. By storing intermediate and final join results in a vertically partitioned fashion, our algorithms need to manipulate less data in memory at a given time than other algorithms. The algorithms are resistant to data skew and adaptive to memory fluctuations. Selection conditions can be incorporated into the algorithms. Using a detailed cost model, the algorithms are analyzed and compared with competing algorithms. For large input relations, our algorithms perform significantly better than Valduriez's algorithm, the TID join algorithm, and hash join algorithms. An experimental study is also conducted to validate the analytical results and to demonstrate the performance characteristics of each algorithm in practice. Received July 21, 1997 / Accepted June 8, 1998  相似文献   

14.
15.
The k Nearest Neighbor (kNN) join operation associates each data object in one data set with its k nearest neighbors from the same or a different data set. The kNN join on high-dimensional data (high-dimensional kNN join) is a very expensive operation. Existing high-dimensional kNN join algorithms were designed for static data sets and therefore cannot handle updates efficiently. In this article, we propose a novel kNN join method, named kNNJoin +, which supports efficient incremental computation of kNN join results with updates on high-dimensional data. As a by-product, our method also provides answers for the reverse kNN queries with very little overhead. We have performed an extensive experimental study. The results show the effectiveness of kNNJoin+ for processing high-dimensional kNN joins in dynamic workloads.  相似文献   

16.
17.
Jongik Kim 《Information Sciences》2006,176(22):3300-3331
For accelerating a structural join operation, current techniques focus on skipping elements that do not contribute to the results. They make use of external index structures (e.g. B+ tree) to determine a bunch of elements to be skipped. However, external indexes are too heavy for a structural join and the overhead of index lookups can reduce the benefit of skipping. In this paper, we proposed element trees and distribution encoded bitmaps for efficient element skipping. With proposed techniques, we can exploit the distribution of elements as well as the context information of a query for efficient skipping of unnecessary elements.  相似文献   

18.
Semantic approximation of data stream joins   总被引:1,自引:0,他引:1  
We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We make two main contributions. First, we define the problem space by discussing architectural models for data stream join processing and surveying suitable measures for the quality of an approximation of a set-valued query result. Second, we examine in detail a large part of this problem space. More precisely, we consider the number of generated result tuples as the quality measure and we propose optimal offline and fast online algorithms for it. In a thorough experimental study with synthetic and real data, we show the efficacy of our solutions.  相似文献   

19.
20.

利用集合相似度自连接算法找出一个集合集中所有相似度大于给定阈值的集合对有着广泛的应用. 基于过滤-验证框架和并行分布式计算框架MapReduce的集合相似度连接是近年来的研究热点. 但现有算法在阈值低时产生较大规模的候选集,导致性能不理想. 针对这一问题,提出采用频繁模式树FP-tree及其派生结构FP-tree*将数据压缩在内存中计算集合相似度自连接以减小候选集规模. 首先设计并讨论基于现有FP-tree*的集合相似度连接计算及其优缺点,提出遍历效率更高的线性频繁模式树结构模型TELP-tree及基于它的算法TELP-SJ(TELP-tree self join),其包括分别面向构建树和遍历树的2阶段过滤算法,这些算法可以减小树规模和减少树遍历. 然后,设计基于MapReduce的并行分布式算法FastTELP-SJ. 最后,基于4组真实应用数据集进行3组性能比较实验. 实验结果表明FastTELP-SJ算法面向高维大规模集合相似度自连接计算时,包括执行时间、内存占用率、磁盘使用量和可扩展性的运行效率最好.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号