期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comments on `Hash-based and index-based join algorithms for cubeand ring connected multicomputers' by E.R. Omiecinski and E.T. Lin

Tien J.-Y. Yang W.-P. 《Knowledge and Data Engineering, IEEE Transactions on》1991,3(3):387-389

The cost distributions of both the parallel hybrid-hash join and the parallel join-index join algorithms proposed in the above-named work (ibid., vol.1, p.329-43, Sept. 1989) are presented in more detail. The result shows that almost the entire relation may need to be retrieved from disk, though the join selectivity is low. A table of semi-join selectives and cube sizes is given to show the condition that the join-index method performs better than the hybrid-hash method, i.e., the really low selectivity for the join-index method. An error in one of the cost formulas is corrected, and a more efficient method on the final join in the join-index method is proposed 相似文献

2.

Executing algorithms with hypercube topology on torusmulticomputers

Gonzalez A. Valero-Garcia M. Diaz de Cerio L. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(8):803-814

Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms 相似文献

3.

A parallel distributive join algorithm for cube-connectedmultiprocessors

Chung S.M. Jaerheen Yang 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(2):127-137

This paper presents a parallel distributive join algorithm for cube-connected multiprocessors. The performance analysis shows that the proposed algorithm has an almost linear speedup over the sequential distributive join algorithm as the number of processors increases, and its performance is comparable to that of the parallel hybrid-hash join algorithm. A big advantage of the proposed algorithm over hash-based join algorithms is that it does not have the bucket overflow problem caused by nonuniform hashing of the smaller operand relation. Moreover, the proposed algorithm can easily support the nonequijoin operation, which is very hard to implement by using hash-based join algorithms 相似文献

4.

The adaptive-hash join algorithm for a hypercube multicomputer

Omiecinski E. Lin E.T. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(3):334-349

The cube adaptive-hash join algorithm, which combines the merits of nested-loop and hybrid-hash, is presented. The performance of these algorithms is compared through analytical cost modeling. The nonuniform data value distribution of the inner relation is shown to have a greater impact than that of the outer relation. The cube adaptive-hash algorithm outperforms the cube hybrid-hash algorithm when bucket overflow occurs. In the worst case, this algorithm converges to the cube nested-loop-hash algorithm. When there is no hash table overflow, the cube adaptive-hash algorithm converges to the cube hybrid-hash algorithm. Since the cube adaptive-hash algorithm adapts itself depending on the characteristics of the relations, it is relatively immune to the data distribution 相似文献

5.

一种并行处理多维连接和聚集操作的有效方法 总被引：1，自引：0，他引：1

薛永生黄震华段江娇张延松吕晓华《计算机研究与发展》2004,41(10):1661-1669

随着并行计算算法的完善和廉价、功能强大的多处理机系统的成熟，使得采用多处理机系统来并行处理多维数据仓库的连接和聚集操作成为当前有效提高OLAP查询处理性能的首选技术．为此，提出一种降低连接和聚集操作开销的并行算法PJAMDDC(parallel join and aggregation for multi-dimensional data cube)．算法充分考虑了多维数据立方体的存储机制和多处理机分布系统的结构特点，在原有聚集计算多维数据立方体的搜索点阵逻辑结构的基础上，采用多维数据仓库的层次联合代理(hierarchy combined surrogate)和对立方体的搜索点阵进行加权的方法，使得立方体数据在多个处理机间的分配达到最佳的状态，从而在分割多维数据的同时，提高了并行处理多维连接和聚集操作的效率．算法实验评估表明，PJAMDDC算法并行处理多维数据仓库的连接和聚集操作是有效的．相似文献

6.

Parallel Shear-Warp Factorization Volume Rendering Using Efficient 1-D and 2-D Partitioning Schemes for Distributed Memory Multicomputers

Ching-Feng Lin Don-Lin Yang Yeh-Ching Chung 《The Journal of supercomputing》2002,22(3):277-302

3-D data visualization is very useful for medical imaging and computational fluid dynamics. Volume rendering can be used to exhibit the shape and volumetric properties of 3-D objects. However, volume rendering requires a considerable amount of time to process the large volume of data. To deliver the necessary rendering rates, parallel hardware architectures such as distributed memory multicomputers offer viable solutions. The challenge is to design efficient parallel algorithms that utilize the hardware parallelism effectively. In this paper, we present two efficient parallel volume rendering algorithms, the 1D-partition and 2D-partition methods, based on the shear-warp factorization for distributed memory multicomputers. The 1D-partition method has a performance bound on the size of the volume data. If the number of processors is less than a threshold, the 1D-partition method can deliver a good rendering rate. If the number of processors is over a threshold, the 2D-partition method can be used. To evaluate the performance of these two algorithms, we implemented the proposed methods along with the slice data partitioning, volume data partitioning, and sheared volume data partitioning methods on an IBM SP2 parallel machine. Six volume data sets were used as the test samples. The experimental results show that the proposed methods outperform other compatible algorithms for all test samples. When the number of processors is over a threshold, the experimental results also demonstrate that the 2D-partition method is better than the 1D-partition method. 相似文献

7.

Utilizing page-level join index for optimization in parallel joinexecution

Chiang Lee Zue-An Chang 《Knowledge and Data Engineering, IEEE Transactions on》1995,7(6):900-914

This paper presents a methodology for the optimization of parallel join execution. Past research on parallel join methods mostly focused on the design of algorithms for partitioning (e.g. hash) relations and distributing data buckets as evenly as possible to the processors. Once data is distributed to the processors, it assumes that all processors will complete their tasks at about the same time. We stress that this is true if no further information such as page-level join index is available. Otherwise, the join execution can be further optimized and the workload in the processors may still be unbalanced. We study such problems that may incur in a shared-nothing architecture environment and propose algorithms for the problems. Also, a simulation study is performed to understand the characteristics of the proposed method 相似文献

8.

Performance analysis of parallel query processing algorithms forobject-oriented databases

Su S.Y.W. Ranka S. Xiang He 《Knowledge and Data Engineering, IEEE Transactions on》2000,12(6):979-996

Two types of parallel processing and optimization algorithms for processing object-oriented databases are the hybrid-hash pointer-based (HHP) algorithms and multi-wavefront (MWF) algorithms. We analyze these two algorithms and develop analytical formulas to capture their main performance features. We study their performance in three application environments, characterized by large databases having many object classes, each of which, respectively, (1) contains a large number of instances; (2) contains a relatively small number of instances; and (3) is of varying size. A horizontal data partitioning strategy is used in (1). A class-per-node assignment strategy is used in (2). In (3), object classes are partitioned horizontally and assigned to a varying number of processors depending on their different sizes. The MWF algorithm has three distinguishing features which contribute to its better performance: (a) a two-phase processing strategy, (b) vertical partitioning of horizontal segments, and (c) dynamic determination of the collision point in MWF propagations, which results in an optimized query execution plan. If these features are adopted by an HHP algorithm, its performance is comparable with that of the MWF algorithm because the difference in CPU time between them is negligible. The computing environment is a network of workstations having a shared-nothing architecture. The schema and some queries selected from the OO7 benchmark are used in the performance analyses and comparisons. The queries are modified slightly in different data environments in order to reflect the features of diverse database applications 相似文献

9.

Efficient processor assignment algorithms and loop transformationsfor executing nested parallel loops on multiprocessors

Chien-Min Wang Sheng-De Wang 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(1):71-82

An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations 相似文献

10.

An Adaptive Parallel Distributive Join Algorithm on a Cluster of Workstations

Soon M. Chung Arindam Chatterjee 《The Journal of supercomputing》2002,21(1):5-35

In this paper, we present an adaptive version of the parallel Distributive Join (DJ) algorithm that we proposed in [5]. The adaptive parallel DJ algorithm can handle the data skew in operand relations efficiently. We implemented the original and adaptive parallel DJ algorithms on a network of Alpha workstations using the Parallel Virtual Machine (PVM). We analyzed the performance of the algorithms, and compared it with that of the parallel Hybrid-Hash (HH) join algorithms. Our results show that the parallel DJ algorithms perform comparably with the parallel HH join algorithms over the entire range of the number of processors used and for different join selectivities. A significant advantage of the parallel DJ algorithms is that they can easily support non-equijoin operations. 相似文献

11.

The worst-case analysis of the Garey–Johnson algorithm

Claire Hanen Yakov Zinder 《Journal of Scheduling》2009,12(4):389-400

The Garey–Johnson algorithm is a well known polynomial-time algorithm constructing an optimal schedule for the maximum lateness problem with unit execution time tasks, two parallel identical processors, precedence constraints and release times. The paper is concerned with the worst-case analysis of a generalization of the Garey–Johnson algorithm to the case of arbitrary number of processors. In contrast to other algorithms for the maximum lateness problem, the tight performance guarantee for the even number of processors differs from the tight performance guarantee for the odd number of processors. 相似文献

12.

Partitioning and mapping of nested loops for linear array multicomputers 总被引：1，自引：1，他引：0

Jang-Ping Sheu Tzung-Shi Chen 《The Journal of supercomputing》1995,9(1-2):183-202

In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers. 相似文献

13.

Performance Measures for Evaluating Algorithms for SIMD Machines

《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331

This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献

14.

Task scheduling algorithm using minimized duplications in homogeneous systems 总被引：2，自引：0，他引：2

KwangSik MyongJin MunSuck JinHa WanOh SangBang 《Journal of Parallel and Distributed Computing》2008,68(8):1146-1156

For fine grain task graphs, duplication-based scheduling algorithms are generally more efficient than list and cluster-based algorithms. However, most duplication-based heuristics try to duplicate all possible ancestor nodes of a given join node, in order to reduce the earliest start time (EST) of the join node, even though these ancestor nodes have already been allocated in previous steps. Thus, these duplication heuristics inevitably induce redundant duplications, which lead to the superfluous consumption of resources and generally deteriorate the scheduling result in the case of a bounded number of processors. When scheduling algorithms are used on an unbounded number of processors, the required number of processors grows excessively with the size of the task graph, thereby limiting the practicality of these algorithms for large task graphs. In this paper, we propose a novel algorithm designed to allocate join nodes without redundant duplications. In the proposed algorithm, if the ancestor nodes of a join node are duplicated when scheduling the join node, the original allocations of these ancestor nodes are removed using a very efficient method. The performance of the proposed algorithm, in terms of its normalized schedule length and efficiency, is compared with that of some of the recently proposed algorithms. The proposed algorithm generates better or comparable schedules with minimized duplication. Specifically, the simulation results show that it is most useful on a bounded number of processors. 相似文献

15.

The scalability of FFT on parallel computers

Gupta A. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(8):922-932

The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture 相似文献

16.

Finding the roots of a polynomial on an MIMD multicomputer

Michel Consnard Pierre Fraigniaud 《Parallel Computing》1990,15(1-3):75-85

This paper introduces the parallelization on a distributed memory multicomputer of two iterative methods for finding all the roots of a given polynomial. The parallel algorithms share the computation of the roots among the processors and perform a total exchange of the data at each step. Since the amount of communications is the main drawback of this approach, we study the effect of the network topology on the performance of the algorithms. Particularly, we show that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications. For each topology is computed the optimal number of processors. Experiments on the hypercube FPS T40 illustrate the results. 相似文献

17.

A Dynamic Load Balancing Framework for Real-time Applications in Message Passing Systems

Ghada F. El Kabbany Nayer M. Wanas Nadia H. Hegazi Samir I. Shaheen 《International journal of parallel programming》2011,39(2):143-182

Load balancing algorithms are designed essentially to equally distribute the load on processors and maximize their utilities while minimizing the total task execution time. In order to achieve these goals, the load-balancing mechanism should be “fair” in distributing the load across the different processors. This implies that the difference between the heaviest-loaded and the lightest-loaded processors should be minimized. Therefore, the load information on each processor must be updated such that the load-balancing mechanism can be more effective. In this work, we present an application independent dynamic algorithm for scheduling tasks and load- balancing in message passing systems. We propose a DAG-based Dynamic Load Balancing algorithm for Real time applications (DAG-DLBR) that is designed to work dynamically to cope with possible changes in the load that might occur during runtime. This algorithm addresses the challenge of devising a load balancing scheme which judicially deals with the hybrid execution of existing real-time application (represented by a Direct Acyclic Graph (DAG)) together with newly arriving jobs. The main objective of this algorithm is to reduce response times of the newly arriving jobs while maintaining the time constrains of the existing DAG. To evaluate the performance of the DAG-DLBR algorithm, a comparison with the performance of two common dynamic load balancing algorithms is presented. This comparison is performed by evaluating, experimentally, the execution time of different load balancing algorithms on a homogenous real parallel machine. In addition, the values of load imbalance, the execution time, and the communication overhead time are evaluated analytically using different benchmarks as test-bed workloads. These workloads cover a wide range of dynamic applications with different task types. Experimental results illustrate the improved performance of the DAG-DLBR algorithm compared to both distributed and hierarchal based algorithms by at least 12 and 19%, respectively. This improvement is true for all workloads, even with highly dependent workload. The DAG-DLBR algorithm achieves lower computation time than its corresponding values of both the distributed and the hierarchical-based algorithms for 4, 8, 12 and 16 processors. 相似文献

18.

Efficient Methods for Multi-Dimensional Array Redistribution

Hsu Ching-Hsien Chung Yeh-Ching Dow Chyi-Ren 《The Journal of supercomputing》2000,17(1):23-46

In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access on distributed memory multicomputers. Since the redistribution is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods for multi-dimensional array redistribution. Based on the previous work, the basic-cycle calculation technique, we present a basic-block calculation (BBC) and a complete-dimension calculation (CDC) techniques. We also developed a theoretical model to analyze the computation costs of these two techniques. The theoretical model shows that the BBC method has smaller indexing costs and performs well for the redistribution with small array size. The CDC method has smaller packing/unpacking costs and performs well when array size is large. When implemented these two techniques on an IBM SP2 parallel machine along with the PITFALLS method and the Prylli's method, the experimental results show that the BBC method has the smallest execution time of these four algorithms when the array size is small. The CDC method has the smallest execution time of these four algorithms when the array size is large. 相似文献

19.

Parallel Distributive Join Algorithm on the Intel Paragon

Chung Soon M. Chatterjee Arindam 《The Journal of supercomputing》1999,13(2):151-169

In this paper, we analyze the performance of the parallel Distributive Join algorithm that we proposed in Chung and Yang 1995. We implemented the algorithm on an Intel Paragon machine and analyzed the effect of the number of processors and the join selectivity on the performance of the algorithm. We also compared the performance of the Distributive Join (DJ) algorithm with that of the Hybrid-Hash(HH) join algorithm. Our results show that the DJ performs comparably with the HH over the entire range of number of processors used and different join selectivities. A big advantage of the parallel DJ algorithm over the HH join algorithm is that it can easily support non-equijoin operations. The results can also be used to estimate the performance of file I/O intensive applications to be implemented on the Intel Paragon machine. 相似文献

20.

Resources Snapshot Model for Concurrent Transactions in Multi-Core Processors

下载免费PDF全文

赵雷杨季文《计算机科学技术学报》2013,28(1):106-118

Transaction parallelism in database systems is an attractive way of improving transaction performance.There exists two levels of transaction parallelism,inter-transaction level and intra-transaction level.With the advent of multicore processors,new hopes of improving transaction parallelism appear on the scene.The greatest execution efficiency of concurrent transactions comes from the lowest dependencies of them.However,the dependencies of concurrent transactions stand in the way of exploiting parallelism.In this paper,we present Resource Snapshot Model(RSM) for resource modeling in both levels.We propose a non-restarting scheduling algorithm in the inter-transaction level and a processor assignment algorithm in the intra-transaction level in terms of multi-core processors.Through these algorithms,execution performance of transaction streams will be improved in a parallel system with multiple heterogeneous processors that have different number of cores. 相似文献