首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
基于区域划分的XML结构连接   总被引:22,自引:7,他引:22       下载免费PDF全文
王静  孟小峰  王珊 《软件学报》2004,15(5):720-729
结构连接是XML查询处理的核心操作,受到了研究界的关注.高效的算法是高效查询处理的关键.目前已经提出了许多结构连接的算法,它们中的大多数都基于如下的前提条件之一:输入元素集合存在索引或者有序.当这些条件不成立时,由于对输入数据临时排序或建索引的代价,这些算法的性能会大大下降.基于这样的观察,提出了一种基于区域划分的结构连接算法.该算法基于任务分解的思想,利用区域编码的特点对输入集合进行划分.给出了详细的算法设计,并对算法的I/O复杂性进行了分析.大量的实验结果显示,该算法具有良好的 性能,在输入数据无序或没有索引的情况下优于现有的排序合并算法,可以为查询计划提供更多的选择.  相似文献   

Semijoin has traditionally been relied upon to reduce the cost of data transmission for distributed query processing. However, judiciously applying join operations as reducers can lead to further reduction in the amount of data transmission required. In view of this fact, we explore the approach of using join operations as reducers in distributed query processing. We first show that the problem of determining a sequence of join operations for a query can be transformed to that of finding a specific type of set of cuts to the corresponding query graph, where a cut to a graph is a partition of nodes in that graph. Then, in light of this concept, we prove that the problem of determining the optimal sequence of join operations for a given query graph is of exponential complexity, thus justifying the necessity of applying heuristic approaches to solve this problem. By mapping the problem of determining a sequence of join reducers into the one of finding a set of cuts, we develop (for tree and general query graphs, respectively) efficient heuristic algorithms to determine a join reducer sequence for distributed query processing. The algorithms developed are based on the concept of divide and conquer and are of polynomial time complexity. Simulation is performed to evaluate these algorithms  相似文献   

The pipelined execution of multijoin queries in a multiprocessor-based database system is explored in this paper. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent to the next join for processing. The execution of a query is usually denoted by a query execution tree. To improve the execution of pipelined hash joins, an innovative approach to query execution tree selection is proposed to exploit segmented right-deep trees, which are bushy trees of right-deep subtrees. We first derive an analytical model for the execution of a pipeline segment, and then, in the light of the model, we develop heuristic schemes to determine the query execution plan based on a segmented right-deep tree so that the query can be efficiently executed. As shown by our simulation, the proposed approach, without incurring additional overhead on plan execution, possesses more flexibility in query plan generation, and can lead to query plans of better performance than those achievable by the previous schemes using right-deep trees  相似文献   

丁祥武  李子通 《计算机科学》2016,43(11):265-271, 308
集成多核CPU-GPU架构已经成为计算机处理器芯片的发展方向。利用这种架构的并行计算能力进行数据处理已经成为了数据库领域的研究热点。为了提高列存储系统的查询性能,首先改进了已有协处理机制中的负载分配策略,通过监测数据库系统CPU占用率,动态地为处理器提供合理的数据划分;然后,针对集成多核CPU-GPU架构上的数据预取机制,提出了一种确定预取数据大小的模型,同时,针对GPU访存的特点,进行了GPU访存优化;最后,使用OpenCL作为编程语言,实现了一种集成多核CPU-GPU架构上的列存储排序归并连接算法,并采用提出的方法对连接处理进行优化。实验证明,所提优化策略可以使列存储系统排序归并连接性能提升33%。  相似文献   

Massive XML data are increasingly generated for the representation, storage and exchange of web information. Twig query processing over massive XML data has become a research focus. However, most traditional algorithms cannot be directly implemented in a distributed manner. Some of the existing distributed algorithms generate a lot of useless intermediate results and execute many join operations of partial results in most cases; others require the priori knowledge of query pattern before XML partition, storage and query processing, which is impractical in the cases of large-scale data or frequent incoming new queries. To improve efficiency and scalability, in this paper, we propose a 3-phase distributed algorithm DisT3 based on node distribution mechanism to avoid unnecessary intermediate results. Furthermore, we propose a lightweight local index ReP with an enhanced XML partitioning approach using arbitrary partitioning strategy, and based on ReP we propose an improved 2-phase distributed algorithm DisT2ReP to further reduce the communication cost. After the performance guarantees are analyzed, extensive experiments are conducted to verify the efficiency and scalability of our proposed algorithms in distributed twig query applications.  相似文献   

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing site for query execution. This typically introduces high communication overhead. Our observation is that semijoin, effective in reducing communication overhead in distributed database query processing, can be also effective in distributed stream query processing. The challenge, however, lies in the streaming nature of the tuples, as it requires continuous and incremental processing of an unbounded sequence of tuples instead of one-time processing of a set of stored tuples. This paper describes our comprehensive work done to address the challenge. Specifically, we first propose a distributed stream join processing model that handles the issue of network delays introduced from the shipment of data streams, and allows for efficient batch processing. Then, based on the model, we propose join algorithms in a multi-way join case: first, one-way join algorithms for different combinations of join placement and join method and, then, multi-way join algorithms assuming linear join ordering. Regarding the join method, two distributed join methods are introduced: (1) simple join, in which full tuples are forwarded to the query processing site and (2) semijoin-based join, in which partial tuples are forwarded. A semijoin-based join can be executed with different possible semijoin strategies which incur different communication overheads. We present a complete set of join algorithms considering all possible semijoin strategies, and propose an optimization algorithm. The join algorithms are executed continuously in an incremental manner as tuples arrive, and never ship tuples redundantly. The optimization algorithm constructs an efficient multi-way join plan by using a greedy heuristic which adds to the plan one stream with the minimum join execution cost in each step. Through extensive experiments, we conduct comparative studies of the performance among the proposed one-way join algorithms and the efficiency of the generated plan between the optimization algorithm based on the greedy heuristic and the exhaustive search, respectively.  相似文献   

As RDF data continue to gain popularity, we witness the fast growing trend of RDF datasets in both the number of RDF repositories and the size of RDF datasets. Many known RDF datasets contain billions of RDF triples (subject, predicate and object). One of the grant challenges for managing these huge RDF data is how to execute RDF queries efficiently. In this paper, we address the query processing problems against the billion triple challenges. We first identify some causes for the problems of existing query optimization schemes, such as large intermediate results, initial query cost estimation errors. Then, we present our block-oriented dynamic query plan generation approach powered with pipelining execution. Our approach consists of two phases. In the first phase, a near-optimal execution plan for queries is chosen by identifying the processing blocks of queries. We group the join patterns sharing a join variable into building blocks of the query plan since executing them first provides opportunities to reduce the size of intermediate results generated. In the second phase, we further optimize the initial pipelining for a given query plan. We employ optimization techniques, such as sideways information passing and semi-join, to further reduce the size of intermediate results, improve the query processing cost estimation and speed up the performance of query execution. Experimental results on several RDF datasets of over a billion triples demonstrate that our approach outperforms existing RDF query engines that rely on dynamic programming based static query processing strategies.  相似文献   

This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor  相似文献   

分布式处理是数据流管理系统发展的必然趋势。文章研究了分布式数据流的连接查询,提出DM3Join算法,它由2部分组成:一是通过分解并发的连接请求,合并相同的连接谓词,形成分布式查询操作算子;二是数据流在各分布式代理(Agent)中流转实现部分连接,并在查询引擎处组合成最终结果。DM3Join算法采用了一种类似路由表的结构执行窗口连接,由于可以共享中间结果,算法只需扫描数据1遍。分析和实验证明,该连接算法是高效的。  相似文献   

The problem of combining join and semijoin reducers for distributed query processing is studied. An approach based on interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree first. The join sequence tree provides an efficient way to identify for each semijoin its correlated semijoins as well as its reducible relations under the join sequence. In light of these properties, an algorithm for determining an effective sequence of join and semijoin reducers is developed. Examples are given to illustrate the results. They show the advantage of using a combination of joins and semijoins as reducers for distributed query processing  相似文献   

针对云环境下空间数据连接查询处理问题,提出了一种基于Spark的多路空间连接查询处理算法BSMWSJ.该算法采用网格划分方法将整个数据空间划分成大小相同的网格单元,并将各类数据集中的空间对象,根据其空间位置划分到相应的网格单元中,不同网格单元中的空间数据对象进行并行连接查询处理.在多路空间连接查询处理过程中,采用边界过滤的方法来过滤无用数据,即通过计算前面连接操作候选结果的MBR来过滤后续连接数据集,从而过滤掉无用的连接对象,减少连接对象的多余投影与复制,并采用重复避免策略来减少重复结果的输出,从而进一步减少后续连接计算的代价.合成数据集和真实数据集上的大量实验结果表明:提出的多路空间连接查询处理算法在性能上明显优于现有的多路连接查询处理算法.  相似文献   

Web search engines need to provide high throughput and short query latency. Recent results show that pipelined query processing over a term-wise partitioned inverted index may have superior throughput. However, the query processing latency and scalability with respect to the collections size are the main challenges associated with this method. In this paper, we evaluate the effect of inverted index skipping on the performance of pipelined query processing. Further, we introduce a novel idea of using Max-Score pruning within pipelined query processing and a new term assignment heuristic, partitioning by Max-Score. Our current results indicate a significant improvement over the state-of-the-art approach and lead to several further optimizations which include dynamic load balancing, intra-query concurrent processing and a hybrid combination between pipelined and non-pipelined execution. Lastly, we show how the state of term-wise partitioning relates to the industry standard document-wise partitioning. Even though there are situations pipelined query processing is advantegous, document-wise partitioning is still the road to follow.  相似文献   

The authors discuss various performance issues in distributed query processing. They validate and evaluate the performance of the local reduction (LR) the fragment and replicate strategy (FRS) and the partition and replicate strategy (PRS) optimization algorithms. The experimental results reveal that the choices made by these algorithms concerning which local operations should be performed, which relation should remain fragmented or which relation should be partitioned are valid. It is shown using experimental results that various parameters, such as the number of processing sites, partitioning speed relative to join speed, and sizes of the join relations, affect the performance of PRS significantly. It is also shown that the response times of query execution are affected significantly by the degree of site autonomy, interferences among processes, interface with the local database management systems (DBMSs) and communications facilities. Pipeline strategies for processing queries in an environment where relations are fragmented are studied  相似文献   

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as reducers in query processing, namely, gainful semi-joins and pure joint attributes, are used. Some semi-joint, though not profitable themselves, may benefit the execution of subsequent join operations as reducers. Such a semi-join is termed a gainful semi-join. In addition, join attributes that are not part of the output attributes are referred to as pure join attributes. They exploit the usefulness of gainful semi-joins and use the removability of pure join attributes to reduce the amount of data transmission required for query processing. Heuristic searches are developed to determine a sequence of join and semi-join reducers for query processing. Results indicate the importance of the approach to combining joins and semi-joins for distributed query processing  相似文献   

高锦涛  李战怀  杜洪涛  刘文洁 《软件学报》2019,30(11):3364-3381
排序合并连接是数据库系统一种重要的连接实现方式,比哈希连接有更广泛的应用.分布式环境下,数据分片、分布存储,面对昂贵的网络代价,进行高效排序合并连接的挑战巨大.传统策略首先针对连接数据进行排序,然后基于排好序的数据执行合并连接.这两部分操作均基于原始数据进行操作,通常情况下,原始连接数据存在无用数据块,这些数据块无需连接,但会增加额外开销,包括网络开销.随着数据量的增多,出现无用数据块的概率增大,额外开销随之增多.传统策略没有预先处理这些无用数据块.针对这个问题,提出一种分布式环境下基于剪枝的并行排序合并连接策略(parallel sort-merge join based on prune,简称Pr_PSMJ).其特点是,连接发生之前高效完成对连接对象无用数据块的剪枝处理,提高整体连接效率.基本思想是,根据连接对象对应的连接分区数据统计信息,构造一种双边邻接表(bilateral adjacency list,简称BAL),用来对连接数据中无用数据块进行剪枝,并保证最终连接结果的正确性;剪枝完成后,利用BAL计算出各个最佳本地连接执行点,并指导分区数据的迁移,使数据移动量最小;在连接阶段,由于BAL保证本地连接执行节点的独立性,因此能够轻松并行执行整个连接过程,并在每个连接点本地利用多核环境完成局部并行排序合并连接;最后,将局部结果合并成最终结果.由于Pr_PSMJ中的高效剪枝策略是在连接执行之前完成的,因此几乎适合任何合并连接操作,并且对于其他连接策略也有借鉴作用.给出了基于Pr_PSMJ的算法的正确性、效率性以及适应性分析,并且给出实验验证,证明了在分布式大数据量排序合并连接情况下,Pr_PSMJ相对于其他策略能够有效减少网络开销,并提高连接效率.  相似文献   

以目标节点为导向的XML路径查询处理   总被引:14,自引:4,他引:14  
王静  孟小峰  王宇  王珊 《软件学报》2005,16(5):827-837
XML查询语言将复杂路径表达式作为核心内容.为了加速路径表达式处理,基于路径分解和结构连接操作的处理策略需要更深入的研究.以目标节点为导向的XML路径查询处理框架被提了出来.该方法利用了扩展基本操作来减少连接操作的数目.在路径分解和查询计划选择的过程中,利用查询树中的目标节点来避免中间结果的传递.除了分解规则和策略以外,提出了一组扩展的基本操作和实现算法.初步的实验结果显示,该方法具有良好的性能.它为路径查询处理提供了更多的选择.  相似文献   

A multidatabase system (MDBS) allows the users to simultaneously access heterogeneous,and autonomous databases using an integrated schema and a single global query language. The query optimization problem in MDBSs is quite different from the query optimization problem in distributed homogeneous databases due to schema heterogeneity and autonomy of local database systems. In this work, we consider the optimization of query distribution in case of data replication and the optimization of intersite joins, that is, the join of the results returned by the local sitesin response to the global subqueries. The algorithms presented for the optimization of intersite joins try to maximize the parallelism in execution and take the federated nature of the problem into account. It has also been shown through a comparativeperformance study that the proposed intersite join optimization algorithms are efficient.The approach presented can easily be generalized to any operation required for intersite query processing. The query optimization scheme presentedin this paper is being implemented within the scopeof a multidatabase system which is based on OMG‘sobject management architecture.  相似文献   

The basic concept of pipelined data-parallel algorithms is introduced by contrasting the algorithms with other styles of computation and by a simple example (a pipeline image distance transformation algorithm). Pipelined data-parallel algorithms are a class of algorithms which use pipelined operations and data level partitioning to achieve parallelism. Applications which involve data parallelism and recurrence relations are good candidates for this kind of algorithm. The computations are ideal for distributed-memory multicomputers. By controlling the granularity through data partitioning and overlapping the operations through pipelining, it is possible to achieve a balanced computation on multicomputers. An analytic model is presented for modeling pipelined data-parallel computation on multicomputers. The model uses timed Petri nets to describe data pipelining operations. As a case study, the model is applied to a pipelined matrix multiplication algorithm. Predicted results match closely with the measured performance on a 64-node NCUBE hypercube multicomputer  相似文献   

We consider adaptive index utilization as a fine-grained problem in autonomic databases in which an existing index is dynamically determined to be used or not in query processing. As a special case, we study this problem for structural joins, the core operator in XML query processing, in the main memory. We find that index utilization is beneficial for structural joins only under certain join selectivity and distribution of matching elements. Therefore, we propose adaptive algorithms to decide whether to use an index probe or a data scan for each step of matching during the processing of a structural join operator. Our adaptive algorithms are based on the history, the look-ahead information, or both. We have developed a cost model to facilitate this adaptation and have conducted experiments with both synthetic and real-world data sets. Our results show that adaptively utilizing indexes in a structural join improves the performance by taking advantage of both sequential scans and index probes.  相似文献   

Distributed query processing algorithms usually perform data reduction by using a semijoin program, but the problem with these approaches is that they still require an explicit join of the reduced relations in the final phase. We introduce an efficient algorithm for join processing in distributed database systems that makes use of bipartite graphs in order to reduce data communication costs and local processing costs. The bipartite graphs represent the tuples that can be joined in two relations taking also into account the reduction state of the relations. This algorithm fully reduces the relations at each site. We then present an adaptive algorithm for response time optimization that takes into account the system configuration, i.e., the additional resources available and the data characteristics, in order to select the best strategy for response time minimization. We also report on the results of a set of experiments which show that our algorithms outperform a number of the recently proposed methods for total processing time and response time minimization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号