首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 29 毫秒
1.
混合散列连接算法(HHJ)是数据库管理系统查询处理中一种重要的连接算法. 本文提出通过缓存优化来减少随机I/O的缓存优化混合散列连接算法(OHHJ), 即通过合理优化分区阶段桶缓存的大小来尽量减少分区过程中产生的随机I/O. 文章通过对分区(桶)大小、桶缓存大小、可用缓存大小、关系表大小与硬盘随机I/O访问特性之间的关系进行定量分析, 得出桶大小以及桶缓存大小最优分配的启发式. 实验结果表明OHHJ可以较好地减少传统HHJ算法分区阶段产生的随机I/O, 提升了算法性能.  相似文献   

2.
Fast joins using join indices   总被引:1,自引:0,他引:1  
Two new algorithms, “Jive join” and “Slam join,” are proposed for computing the join of two relations using a join index. The algorithms are duals: Jive join range-partitions input relation tuple ids and then processes each partition, while Slam join forms ordered runs of input relation tuple ids and then merges the results. Both algorithms make a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a temporary file, whose size is half that of the join index. Both algorithms require only that the number of blocks in main memory is of the order of the square root of the number of blocks in the smaller relation. By storing intermediate and final join results in a vertically partitioned fashion, our algorithms need to manipulate less data in memory at a given time than other algorithms. The algorithms are resistant to data skew and adaptive to memory fluctuations. Selection conditions can be incorporated into the algorithms. Using a detailed cost model, the algorithms are analyzed and compared with competing algorithms. For large input relations, our algorithms perform significantly better than Valduriez's algorithm, the TID join algorithm, and hash join algorithms. An experimental study is also conducted to validate the analytical results and to demonstrate the performance characteristics of each algorithm in practice. Received July 21, 1997 / Accepted June 8, 1998  相似文献   

3.
We propose a new algorithm, called Stripe-join, for performing a join given a join index. Stripe-join is inspired by an algorithm called ‘Jive-join’ developed by Li and Ross. Stripe-join makes a single sequential pass through each input relation, in addition to one pass through the join index and two passes through a set of temporary files that contain tuple identifiers but no input tuples. Stripe-join performs this efficiently even when the input relations are much larger than main memory, as long as the number of blocks in main memory is of the order of the square root of the number of blocks in the participating relations. Stripe-join is particularly efficient for self-joins. To our knowledge, Stripe-join is the first algorithm that, given a join index and a relation significantly larger than main memory, can perform a self-join with just a single pass over the input relation and without storing input tuples in intermediate files. Almost all the I/O is sequential, thus minimizing the impact of seek and rotational latency. The algorithm is resistant to data skew. It can also join multiple relations while still making only a single pass over each input relation. Using a detailed cost model, Stripe-join is analyzed and compared with competing algorithms. For large input relations, Stripe-join performs significantly better than Valduriez's algorithm and hash join algorithms. We demonstrate circumstances under which Stripe-join performs significantly better than Jive-join. Unlike Jive-join, Stripe-join makes no assumptions about the order of the join index.  相似文献   

4.
A query processing strategy which is based on pipelining and data-flow techniques is presented. Timing equations are developed for calculating the performance of four join algorithms: nested block, hash, sort-merge, and pipelined sort-merge. They are used to execute the join operation in a query in distributed fashion and in pipelined fashion. Based on these equations and similar sets of equations developed for other relational algebraic operations, the performance of query execution was evaluated using the different join algorithms. The effects of varying the values of processing time, I/O time, communication time, buffer size, and join selectively on the performance of the pipelined join algorithms are investigated. The results are compared to the results obtained by employing the same algorithms for executing queries using the distributed processing approach which does not exploit the vertical concurrency of the pipelining approach. These results establish the benefits of pipelining  相似文献   

5.
This paper presents a memory-constrained hash join algorithm (PaMeCo Join) designed to operate with main-memory column-store database systems. Whilst RAM has become more affordable and the popularity of main-memory database systems continues to grow, we recognize that RAM is a finite resource and that database systems rarely have an excess of memory available to them. Therefore, we design PaMeCo to operate within an arbitrary memory limitation by processing the input relations by parts, and by using a compact hash table that represents the contained tuples in a compact format. Coupled with a radix-clustering system that lowers memory latencies, we find that PaMeCo can offer competitive performance levels to other contemporary hash join algorithms in an unconstrained environment, while being up to three times faster than a high-performing hash join when memory constraints are applied.  相似文献   

6.
The cube adaptive-hash join algorithm, which combines the merits of nested-loop and hybrid-hash, is presented. The performance of these algorithms is compared through analytical cost modeling. The nonuniform data value distribution of the inner relation is shown to have a greater impact than that of the outer relation. The cube adaptive-hash algorithm outperforms the cube hybrid-hash algorithm when bucket overflow occurs. In the worst case, this algorithm converges to the cube nested-loop-hash algorithm. When there is no hash table overflow, the cube adaptive-hash algorithm converges to the cube hybrid-hash algorithm. Since the cube adaptive-hash algorithm adapts itself depending on the characteristics of the relations, it is relatively immune to the data distribution  相似文献   

7.
张宇  张延松  陈红  王珊 《软件学报》2017,28(3):490-501
众核架构协处理器Xeon Phi成为新兴的主流高性能计算平台.对于数据库应用而言,内存分析处理是一种计算密集型负载,其主要的性能取决于大事实表与维表之间的内存外键连接性能.本文关注于一种相对于缓存相关的分区哈希连接算法和缓存不相关的无分区哈希连接算法的缓存友好型外键连接算法,以适应Xeon Phi协处理器较小的LLC和高并发线程的特点.通过挖掘OLAP模式中的代理键特征,基于键值匹配的哈希探测操作可以进一步简化为事实表与维表之间基于主-外键参照完整性约束的代理键参照访问,因此复杂的哈希表和CPU代价较高的哈希探测操作可以简化为通过映射外键值为代理键向量内存偏移地址的方法对代理向量直接访问.基于代理向量参照访问的外键连接算法能够简单并高效地应用于Xeon Phi协处理器平台,通过更多的核心和高并发线程来掩盖内存访问延迟.实验中对传统的哈希连接算法(无分区哈希连接算法和基数分区哈希连接算法)和基于代理向量参照技术的外键连接算法在Xeon E5-2650 v3 10核处理器平台和Xeon Phi 5110P 60核协处理器平台进行性能测试和比较,实验结果给出了主流的内存外键连接算法在不同数据集和不同平台上全面的性能特征.  相似文献   

8.
Hash join is used to join large, unordered relations and operates independently of the data distributions of the join relations. Real-world data sets are not uniformly distributed and often contain significant skew. Although partition skew has been studied for hash joins, no prior work has examined how exploiting data skew can improve the performance of hash join. In this paper, we present histojoin, a join algorithm that uses histograms to identify data skew and improve join performance. Experimental results show that for skewed data sets histojoin performs significantly fewer I/O operations and is faster by 10–60% than hybrid hash join.  相似文献   

9.
基于并行B+-树的并行Join算法的设计、分析与实现   总被引:1,自引:0,他引:1  
B^+-树是一种有效的数据库存储结构,被普遍应用于各种关系数据库系统。把B^+-树并行化,使之用于并行数据库系统显然是一项很有意义的重要工作。本文研究了适用于并行数据库的并行B^+-树存储结构,提出两类基于并行B^+-树工并行Join算法。理论和实验结果表明,这些算法效率高基其它并行Join算法。  相似文献   

10.
In this paper we discuss the application of the dynamic clustering feature of hash to a relational data base machine. By partitioning the relation using hash, large load reductions in join and set operations are realized. Several machine architectures based on hash are presented. We propose a data base machineGRACE which adopts a novel relational algebraic processing algorithm based on hash and sort. Whereas conventional logic-per-track machines perform poorly in a join dominant environment,GRACE can execute join efficiently inO(N+M/K) time, whereN andM are the cardinalities of two relations andK the number of memory banks.  相似文献   

11.
基于共享Cache多核处理器的Hash连接优化   总被引:1,自引:0,他引:1  
邓亚丹  景宁  熊伟 《软件学报》2010,21(6):1220-1232
针对目前主流的多核处理器,研究了基于共享缓存多核处理器环境下的数据库Hash连接优化.首先提出基于Radix-Join算法的Hash连接多线程执行框架,通过实例分析了影响多线程Radix-Join算法性能的因素.在此基础上,优化了Hash连接多线程执行框架中的各种线程及其访问共享Cache的性能,优化了聚集连接时Hash连接算法的内存访问,并分析了多线程聚集划分的加速比.基于开源数据库INGRES和EaseDB,实现了所提出的连接多线程执行框架,在实验中测试了多线程Hash连接框架的性能.实验结果表明,该算法可以有效解决Hash连接执行时共享Cache在多线程条件下的访问冲突和处理器负载均衡问题,极大地提高了Hash连接性能.  相似文献   

12.
On applying hash filters to improving the execution of multi-join queries   总被引:1,自引:0,他引:1  
In this paper, we explore an approach of interleaving a bushy execution tree with hash filters to improve the execution of multi-join queries. Similar to semi-joins in distributed query processing, hash filters can be applied to eliminate non-matching tuples from joining relations before the execution of a join, thus reducing the join cost. Note that hash filters built in different execution stages of a bushy tree can have different costs and effects. The effect of hash filters is evaluat ed first. Then, an efficient scheme to determine an effective sequence of hash filters for a bushy execution tree is developed, where hash filters are built and applied based on the join sequence specified in the bushy tree so that not only is the reduction effect optimized but also the cost associated is minimized. Various schemes using hash filters are implemented and evaluated via simulation. It is experimentally shown that the application of hash filters is in general a very powerful means to improve th e execution of multi-join queries, and the improvement becomes more prominent as the number of relations in a query increases. Edited by G. Gardarin. Received October 1994 / Accepted December 1995  相似文献   

13.
With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side lengths of a supernode transformation (also known as tiling). We identify three parameters of supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, for sufficiently large supernodes and number of processors, and for the case where multiple supernodes are mapped to a single processor, we give an order n polynomial whose real positive roots include the optimal supernode size. For two special cases, 1) two-dimensional algorithm problems and 2) n-dimensional algorithm problems, where the communication cost is dominated by the startup penalty and, therefore, can be approximated by a constant, we give a closed form expression for the optimal supernode size, which is independent of the supernode relative side lengths and cutting hyperplanes. For the case where the algorithm iteration index space and the supernodes are hyperrectangular, we give closed form expressions for the optimal supernode relative side lengths. Our experiment shows a good match of the closed form expressions with experimental data  相似文献   

14.
A novel algorithm for relational equijoin is presented. The algorithm is a modification of merge join, but promises superior performance for medium-size inputs. In many cases it even compares favorably with both merge join and hybrid hash join, which is shown using analytic cost functions  相似文献   

15.
Analyzes the costs, and describes the implementation, of three hash-based join algorithms for a general purpose shared-memory multiprocessor. The three algorithms considered are the hashed loops, GRACE and hybrid algorithms. We also describe the results of a set of experiments that validate the cost models presented and demonstrate the relative performance of the three algorithms  相似文献   

16.
Meshing Streaming Updates with Persistent Data in an Active Data Warehouse   总被引:1,自引:0,他引:1  
Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.  相似文献   

17.
Presents a parallel hash join algorithm that is based on the concept of hierarchical hashing, to address the problem of data skew. The proposed algorithm splits the usual hash phase into a hash phase and an explicit transfer phase, and adds an extra scheduling phase between these two. During the scheduling phase, a heuristic optimization algorithm, using the output of the hash phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the hash partitions with the largest skew values and splits them as necessary, assigning each of them to an optimal number of processors. Assuming for concreteness a Zipf-like distribution of the values in the join column, a join phase which is CPU-bound, and a shared nothing environment, the algorithm is shown to achieve good join phase load balancing, and to be robust relative to the degree of data skew and the total number of processors. The overall speedup due to this algorithm is compared to some existing parallel hash join methods. The proposed method does considerably better in high skew situations  相似文献   

18.
胡伟  徐福缘  马庆国 《计算机科学》2012,39(2):198-202,221
外包数据库中的连接查询比范围查询更困难,因为客户端需要验证连接结果的一致性,而传统的对单个表的签名不能有效地支持对连接查询结果的直接验证。提出了使用哈希连接保证数据一致性的2种算法,它们分别通过在服务器端和客户端计算哈希函数来实现连接查询。给出了这2种算法的详细描述,证明了它们满足一致性要求,而且分析了算法在通信量、服务器端和客户端执行的代价。最后在实验中通过设置不同的参数对它们在服务器端和客户端的运行时间进行了比较,总结了它们在实际应用中的优缺点。  相似文献   

19.
Join processing in relational database systems continues to be a difficult and challenging problem. In this research, we propose a criss-cross hash join strategy that draws from both hashing and indexing techniques, inheriting the advantages of each. To facilitate the criss-cross hash join, a simple data structure, termed page map, is introduced. The page maps aid in reducing the hashing effort incurred in the current hash based join methods. Furthermore, the page maps implicitly capture and exploit the possible inherent order among tuples in the relations, however partial it may be, to achieve superior performance. As the proposed methodology relies on the hashing scheme, the page maps are simpler, more compact, and easier to maintain than the traditional data structures associated with index based join methods. We develop the ideas intuitively first, followed by a formal development of the concepts and the algorithms. A detailed probabilistic analysis of the algorithms is presented and their performance is assessed through extensive empirical investigations. The empirical analysis suggests significant performance improvements over the current state-of-the-art hybrid hash method, especially in the presence of possible inherent order  相似文献   

20.
空间数据库中连接运算的处理与优化   总被引:7,自引:0,他引:7       下载免费PDF全文
空间数据库的性能问题严重制约了它的应用与发展 .由于空间连接运算是空间数据库中最复杂、最耗时的基本操作 ,因此其处理效率在很大程度上决定了空间数据库的整体性能 .尽管目前已经有许多空间连接算法 ,但空间连接运算的代价估计和查询优化仍然有待进一步研究 .众所周知 ,大部分空间连接算法都是基于 R树索引实现的 ,如果参与空间连接运算的关系上没有索引或只有部分索引 ,那么就需要使用特殊的算法来处理 .另外 ,各种算法的代价评估模型需要一个相对统一的计算方法 ,实践证明 ,根据空间数据库的实际情况 ,使用 I/ O代价来估计算法的复杂性较为合理 .在此基础上 ,针对复杂的空间查询中可能出现多个关系参与空间连接运算的情况 ,故还需要合理地应用动态编程算法来找出代价最优的连接顺序 ,以便最终形成一个通用的算法框架 .通过对该算法框架的复杂性分析可以看出 ,在此基础上实现的空间数据库查询优化系统将具有较高的时空效率 ,并且能够处理非常复杂的空间查询  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号