期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A top-k spatial join querying processing algorithm based on spark

《Information Systems》2020

Aiming at the problem of top-k spatial join query processing in cloud computing systems, a Spark-based top-k spatial join (STKSJ) query processing algorithm is proposed. In this algorithm, the whole data space is divided into grid cells of the same size by a grid partitioning method, and each spatial object in one data set is projected into a grid cell. The Minimum Bounding Rectangle (MBR) of all spatial objects in each grid cell is computed. The spatial objects overlapping with these MBRs in another spatial data set are replicated to the corresponding grid cells, thereby filtering out spatial objects for which there are no join results, thus reducing the cost of subsequent spatial join processing. An improved plane sweeping algorithm is also proposed that speeds up the scanning mode and applies threshold filtering, thus greatly reducing the communication and computation costs of intermediate join results in subsequent top-k aggregation operations. Experimental results on synthetic and real data sets show that the proposed algorithm has clear advantages, and better performance than existing top-k spatial join query processing algorithms. 相似文献

2.

Join and data redistribution algorithms for hypercubes

Baru C.K. Padmanabhan S. 《Knowledge and Data Engineering, IEEE Transactions on》1993,5(1):161-168

An important aspect of database processing in parallel computer systems is the use of data parallel algorithms. Several parallel algorithms for the relational database join operation in a hypercube multicomputer system are given. The join algorithms are classified as cycling or global partitioning based on the tuple distribution method employed. The various algorithms are compared under a common framework, using time complexity analysis as well as an implementation on a 64-node NCUBE hypercube system. In general, the global partitioning algorithms demonstrate better speedup. However, the cycling algorithm can perform better than the global algorithms in specific situations, viz., when the difference in input relation cardinalities is large and the hypercube dimension is small. The usefulness of the data redistribution operation in improving the performance of the join algorithms, in the presence of uneven data partitions, is examined. The results indicate that redistribution significantly decreases the join algorithm execution times for unbalanced partitions 相似文献

3.

Distributed stream join under workload variance

Junhua Fang Rong Zhang Xiaotong Wang Aoying Zhou 《World Wide Web》2017,20(5):1089-1110

Flexible and self-adaptive stream join processing plays an important role in a parallel shared-nothing environments. Join-Matrix model is a high-performance model which is resilient to data skew and supports arbitrary join predicates for taking random tuple distribution as its routing policy. To maximize system throughputs and minimize network communication cost, a scalable partitioning scheme on matrix is critical. In this paper, we present a novel flexible and adaptive scheme partitioning model for stream join operator, which ensures high throughput but with economical resource usages by allocating resources on demand. Specifically, a lightweight scheme generator, which requires the sample of each stream volume and processing resource quota of each physical machine, generates a join scheme; then a migration plan generator decides how to migrate data among machines under the consideration of minimizing migration cost while ensuring correctness. We do extensive experiments on different kinds of join workloads and the evaluation shows high competence comparing with baseline systems on benchmark data and real data. 相似文献

4.

Utilizing page-level join index for optimization in parallel joinexecution

Chiang Lee Zue-An Chang 《Knowledge and Data Engineering, IEEE Transactions on》1995,7(6):900-914

This paper presents a methodology for the optimization of parallel join execution. Past research on parallel join methods mostly focused on the design of algorithms for partitioning (e.g. hash) relations and distributing data buckets as evenly as possible to the processors. Once data is distributed to the processors, it assumes that all processors will complete their tasks at about the same time. We stress that this is true if no further information such as page-level join index is available. Otherwise, the join execution can be further optimized and the workload in the processors may still be unbalanced. We study such problems that may incur in a shared-nothing architecture environment and propose algorithms for the problems. Also, a simulation study is performed to understand the characteristics of the proposed method 相似文献

5.

一种有效的并行数据库动态负载平衡连接算法 总被引：1，自引：0，他引：1

关心欧增桂王玲《计算机工程与应用》2007,43(12):150-154

在基于Shared-nothing结构的并行数据库中,负载平衡一直是影响查询处理性能的重要因素。在数据库中频繁使用的连接操作会因为各种因素导致的负载倾斜和额外的通讯开销而降低数据库的整体性能。提出了一种基于RCMD分布方法的动态负载平衡连接算法,能够在连接操作的执行过程中动态调整各个结点的负载。理论分析和实验结果证明提出的算法能够有效地平衡负载,提高并行数据库的执行效率。相似文献

6.

Parallel Star Join+DataIndexes: efficient query processing in data warehouses and OLAP 总被引：1，自引：0，他引：1

Datta A. VanderMeer D. Ramamritham K. 《Knowledge and Data Engineering, IEEE Transactions on》2002,14(6):1299-1316

On-line analytical processing (OLAP) refers to the technologies that allow users to efficiently retrieve data from the data warehouse for decision-support purposes. Data warehouses tend to be extremely large, it is quite possible for a data warehouse to be hundreds of gigabytes to terabytes in size (Chauduri and Dayal, 1997). Queries tend to be complex and ad hoc, often requiring computationally expensive operations such as joins and aggregation. Given this, we are interested in developing strategies for improving query processing in data warehouses by exploring the applicability of parallel processing techniques. In particular, we exploit the natural partitionability of a star schema and render it even more efficient by applying DataIndexes-a storage structure that serves both as an index as well as data and lends itself naturally to vertical partitioning of the data. DataIndexes are derived from the various special purpose access mechanisms currently supported in commercial OLAP products. Specifically, we propose a declustering strategy which incorporates both task and data partitioning and present the Parallel Star Join (PSJ) Algorithm, which provides a means to perform a star join in parallel using efficient operations involving only rowsets and projection columns. We compare the performance of the PSJ Algorithm with two parallel query processing strategies. The first is a parallel join strategy utilizing the Bitmap Join Index (BJI), arguably the state-of-the-art OLAP join structure in use today. For the second strategy we choose a well-known parallel join algorithm, namely the pipelined hash algorithm. To assist in the performance comparison, we first develop a cost model of the disk access and transmission costs for all three approaches. 相似文献

7.

A graph theoretical approach to determine a join reducer sequencein distributed query processing

Ming-Syan Chen Yu P.S. 《Knowledge and Data Engineering, IEEE Transactions on》1994,6(1):152-165

Semijoin has traditionally been relied upon to reduce the cost of data transmission for distributed query processing. However, judiciously applying join operations as reducers can lead to further reduction in the amount of data transmission required. In view of this fact, we explore the approach of using join operations as reducers in distributed query processing. We first show that the problem of determining a sequence of join operations for a query can be transformed to that of finding a specific type of set of cuts to the corresponding query graph, where a cut to a graph is a partition of nodes in that graph. Then, in light of this concept, we prove that the problem of determining the optimal sequence of join operations for a given query graph is of exponential complexity, thus justifying the necessity of applying heuristic approaches to solve this problem. By mapping the problem of determining a sequence of join reducers into the one of finding a set of cuts, we develop (for tree and general query graphs, respectively) efficient heuristic algorithms to determine a join reducer sequence for distributed query processing. The algorithms developed are based on the concept of divide and conquer and are of polynomial time complexity. Simulation is performed to evaluate these algorithms 相似文献

8.

一种并行多路空间连接处理方法

刘宇朱仲英施颂椒《小型微型计算机系统》2001,22(9):1092-1095

空间连接查询是最耗时,最重要的空间查询、空间多路连接是涉及多个空间关系的连接查询,顺序空间连接查询的效率还是不能令人满意,研究利用并行机制提高空间连接查询效率成为有吸引力的方向,并行空间连接处理由三个阶段组成;任务创建,任务分配和任务并行执行,本文提出一种新的平面扫描方法用于多路并行处理的任务创建过程,随机提出基于花费估计的动态任务分配策略,给出了花费模型,并将其推到处理多路并行连接查询处理以实现负荷平衡。相似文献

9.

Load-Balanced Join Processing in Shared-Nothing Systems

Lu H. J. Tan K. L. 《Journal of Parallel and Distributed Computing》1994,23(3)

In a shared-nothing parallel database system, a join operation is split into a set of tasks that are allocated to the nodes in the system to be executed concurrently and independently. While parallel processing could greatly reduce the completion time of a join operation, the system performance may degrade because of load imbalance across the nodes caused by data skewness in the relations. Load-balanced join processing uses various techniques to evenly distribute the load among nodes in a system and hence improves the overall system performance. In this paper, the basic issues in designing load-balanced parallel join algorithms are identified. From the solutions to those issues, a large set of load-balanced join algorithms can be constructed. Performance of four representative algorithms-two dynamic load-balancing algorithms proposed in this paper and two static load-balancing algorithms adapted from similar algorithms in the literature-is studied and compared with that of a parallel join algorithm that does not balance the join load. The results of our study clearly show the benefits of load-balancing. This study also demonstrates that the dynamic load-balancing techniques proposed in this paper not only are feasible but also provide good system performance. 相似文献

10.

MapReduce连接查询的I/O代价研究

宋杰李甜甜朱志良鲍玉斌于戈《软件学报》2015,26(6):1438-1456

数据的指数级增长给数据管理和分析带来了严峻的挑战.连接查询是数据分析中一种常用运算,而MapReduce是一种用于大规模数据集并行处理的编程模型,研究基于MapReduce的连接查询代价评估和查询优化,有着学术意义和应用价值.MapReduce连接查询算法的性能主要取决于I/O代价(包括本地和网络I/O),而I/O代价与数据集以及连接运算的特征参数相关,通过对二元连接的I/O代价评估可以优化多元连接执行计划.基于此,首先提出了二元连接查询的I/O代价模型;随后,对现有二元连接算法进行形式化定义和简单扩展,归纳出6种基于MapReduce连接查询算法,并通过算法白盒分析定义它们的I/O代价函数;最后,提出一种多元连接最优执行计划的选择算法.通过实验表明I/O代价模型的正确性且能够准确地反映算法的性能优劣. 相似文献

11.

基于直方图的并行结构连接算法

李建新王国仁汤南王斌于亚新张海宁《计算机研究与发展》2004,41(10):1768-1773

连接操作是最昂贵且常用的数据库操作．在传统数据库系统中，主要的连接操作是等值连接操作，因此，传统的并行连接算法主要集中于并行等值连接操作．另外，随着XML在Web应用中变得越来越重要，XML已经成为Internet上一种新的数据交换标准．对XML数据的连接操作不同于传统数据库中的等值连接操作，它属于结构连接操作．以前适合等值连接操作的并行连接算法并不能有效地解决结构连接问题．因此，第1次提出了并行结构连接问题，并且通过应用直方图的思想于并行连接中，从而提出两种基本的并行XML结构连接算法、等高直方图连接算法和等宽直方图连接算法．实验表明这两种算法具有较好的性能．相似文献

12.

分布式数据流上的高性能分发策略

房俊华王晓桐张蓉周傲英《软件学报》2017,28(3):563-578

随着大数据应用的普及,高效可扩展的数据流操作在实时分析处理中扮演着越来越重要的角色.分布式并行处理架构是应对大流量、低延时数据流处理任务的一种有效解决方案.然而,在Key-based分组并行处理中,由于数据的倾斜分布及数据流本身的实时、动态和数据规模不可预知等特性,使得数据流分布并行处理系统存在持续且动态的负载不均衡现象,这会造成系统时效性降低、硬件资源浪费等问题.现有的研究工作处理均衡负载有两种方案：1）基于key粒度的迁移使得并行处理节点负载达到均衡,2）基于元组粒度级别的拆分,采用随机分发来使系统均衡.前者将系统调整至给定的均衡容忍范围内,类似于一维装箱的NP问题;后者对key的拆分势必带来新的为维护Key-based操作的正确性而增加的额外代价,如内存及网络通信成本.本文综合两种方法,提出对key按需拆分、尽量合并的方法,通过轻量级均衡调整算法以及保证Key-based操作特性的拆分方法,使系统既能达到后者的均衡,又能减少细粒度均衡所带来的额外代价. 相似文献

13.

基于区域划分的XML结构连接 总被引：22，自引：7，他引：22

下载免费PDF全文

王静孟小峰王珊《软件学报》2004,15(5):720-729

结构连接是XML查询处理的核心操作,受到了研究界的关注.高效的算法是高效查询处理的关键.目前已经提出了许多结构连接的算法,它们中的大多数都基于如下的前提条件之一:输入元素集合存在索引或者有序.当这些条件不成立时，由于对输入数据临时排序或建索引的代价，这些算法的性能会大大下降.基于这样的观察，提出了一种基于区域划分的结构连接算法.该算法基于任务分解的思想，利用区域编码的特点对输入集合进行划分.给出了详细的算法设计，并对算法的I/O复杂性进行了分析.大量的实验结果显示，该算法具有良好的性能，在输入数据无序或没有索引的情况下优于现有的排序合并算法，可以为查询计划提供更多的选择. 相似文献

14.

基于MapReduce的数据倾斜连接算法

梁俊杰何利民《计算机科学》2016,43(9):27-31

连接操作是大规模数据集在数据分析应用中最常用的操作,针对MapReduce自身不能有效地处理数据倾斜情况下的连接操作,提出了基于MapReduce的频次分类连接算法。根据数据在连接数据集中出现的频率将整个数据集分为3类,对倾斜数据利用分区算法和广播算法实现数据重分布,以消除数据倾斜的影响;对非倾斜数据采用Hash算法实现数据重分布。重分布后的数据在单节点内即可完成数据连接操作,避免了MapReduce框架下连接操作的跨节点传输代价;同时有效地均衡了MapReduce各节点的任务负载,从而提高了数据倾斜状态下连接操作的效率。通过与传统连接算法的对比,证明了所提算法的有效性和实用性。相似文献

15.

基于CPU-GPU异构体系结构的并行字符串相似性连接方法

徐坤浩聂铁铮申德荣寇月于戈《计算机研究与发展》2021,58(3):598-608

相似性连接技术在数据清洗、数据集成等领域中具有重要意义,近年来引起了学术界的广泛关注.随着数据量的不断增大、数据处理实时性的要求逐渐提高以及处理器性能提升瓶颈的出现,传统的串行相似性连接方法已经不能满足当前大数据处理的需求.近些年,GPU作为协处理器在机器学习等领域取得了良好的加速效果,因此基于GPU的并行算法开始成为解决各类性能问题的有效解决方案.为此,提出了基于CPU-GPU异构体系的并行相似性连接方法.首先,方法使用GPU构建倒排索引,索引采用SoA(struct of arrays)结构,从而解决了传统索引结构在并行模式下读写效率低的问题.其次,针对串行算法的性能问题,提出基于过滤验证框架的并行双重长度过滤算法,其中利用前缀过滤和构建好的倒排索引提升过滤效果.方法中相似度精确计算验证过程使用CPU计算执行,从而充分利用CPU-GPU的异构计算资源.最后,在多个数据集上进行实验验证性能.通过与串行相似性连接算法进行对比,实验结果表明所提出方法相对于已有方法具有更好的过滤效果和更低的索引生成代价,并在相似性连接上具有更好的性能和良好的加速比. 相似文献

16.

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Yaobin HE Haoyu TAN Wuman LUO Shengzhong FENG Jianping FAN 《Frontiers of Computer Science》2014,8(1):83-99

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN. 相似文献

17.

Optimal parallel scheduling of M-way join queries

Farshad Fotouhi Jason Leigh Satyendra P. Rana 《Information Systems》1991,16(6):627-639

The problem of computing multirelation (M-way) join queries on uniprocessor architectures has been considered by many researchers in the past. This paper lays the necessary foundation for work involving optimization of M-way joins in parallel architectures. We explain the inadequacies of previous uniprocessor strategies and describe a more suitable formulation based on the concept of matching in graph theory to approach the problem in a parallel environment. It has been shown that the problem of optimizing M-way joins is an NP-hard problem and hence we would expect that in a parallel processing environment the search space of possible solutions (join schedules) would be enormous, especially when a variable number of processors are considered. Our strategy seeks to reduce the region to search by partitioning the search space according to the number of available processors. Based on this a significant portion of the search space, which will produce non-optimal join schedules, may be ignored. 相似文献

18.

Measuring and modelling the performance of a parallel ODMG compliant object database server

Sandra de F. Mendes Sampaio Norman W. Paton Jim Smith Paul Watson 《Concurrency and Computation》2006,18(1):63-109

Object database management systems (ODBMSs) are now established as the database management technology of choice for a range of challenging data intensive applications. Furthermore, the applications associated with object databases typically have stringent performance requirements, and some are associated with very large data sets. An important feature for the performance of object databases is the speed at which relationships can be explored. In queries, this depends on the effectiveness of different join algorithms into which queries that follow relationships can be compiled. This paper presents a performance evaluation of the Polar parallel object database system, focusing in particular on the performance of parallel join algorithms. Polar is a parallel, shared‐nothing implementation of the Object Database Management Group (ODMG) standard for object databases. The paper presents an empirical evaluation of queries expressed in the ODMG Query Language (OQL), as well as a cost model for the parallel algebra that is used to evaluate OQL queries. The cost model is validated against the empirical results for a collection of queries using four different join algorithms, one that is value based and three that are pointer based. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

19.

New algorithms for parallelizing relational database joins in thepresence of data skew

Wolf J.L. Dias D.M. Yu P.S. Turek J. 《Knowledge and Data Engineering, IEEE Transactions on》1994,6(6):990-997

Parallel processing is an attractive option for relational database systems. As in any parallel environment however, load balancing is a critical issue which affects overall performance. Load balancing for one common database operation in particular, the join of two relations, can be severely hampered for conventional parallel algorithms, due to a natural phenomenon known as data skew. In a pair of recent papers (J. Wolf et al., 1993; 1993), we described two new join algorithms designed to address the data skew problem. We propose significant improvements to both algorithms, increasing their effectiveness while simultaneously decreasing their execution times. The paper then focuses on the comparative performance of the improved algorithms and their more conventional counterparts. The new algorithms outperform their more conventional counterparts in the presence of just about any skew at all, dramatically so in cases of high skew 相似文献

20.

应对倾斜数据流在线连接方法

王春凯孟小峰《软件学报》2018,29(3):869-882

并行环境下的分布式连接处理要求制定划分策略以减少状态迁移和通信开销。相对于数据库管理系统而言,分布式数据流管理系统中的在线θ连接操作需要更高的计算成本和内存资源。基于完全二部图的连接模型可支持分布式数据流的连接操作。因为连接操作的每个关系仅存放于二部图模型的一侧处理单元,无需复制数据,且处理单元相互独立,因此该模型具有内存高效、易伸缩和可扩展等特性。然而,由于数据流速的不稳定性和属性值分布的不均衡性,导致倾斜数据流的连接操作易出现集群负载不均衡的现象。针对倾斜数据流的连接操作,模型无法动态分配查询节点,并需要人工干预数据分组的参数设置。尤其是应对全部历史数据的连接查询,模型效率更低。基于上述问题,提出了管理倾斜数据流连接的框架,使用基于键值和元组混合的划分样式有效应对二部图模型的各侧倾斜数据。并设计了重新动态分配查询节点的策略和状态迁移算法,以支持全历史数据的连接查询和自适应的资源管理。针对合成数据和真实数据的实验表明,该方案可有效应对倾斜数据的连接操作并进一步提升分布式数据流管理系统的吞吐率,特别是降低云环境中的计算成本。相似文献