期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

孙彤方滨兴李晓明《计算机学报》1995,(7)

本文分析了面向分布存储ＳＩＭＤ／ＭＩＭＤ并行机的并行程序的优化数据安放问题，在ＦＯＲＡＬＬ程序模型和ＭＥＳＨ通信模型之上，研究了数据分解过程中减少通信代价的优化要求．我们使用维偏好图描述并行数组之间的对准需求，通过消除维偏好图中的冲突，可得到维对准图．一个维对准图就对应一个数据安放方案．维对准图的总代价越大，对应的通信代价就越小．文中给出了求最大代价维对准目的一个近似算法．相似文献

2.

A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers 总被引：2，自引：2，他引：0

Guo Minyi Nakata Ikuo 《The Journal of supercomputing》2001,20(3):243-265

相似文献

3.

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Tristan?Aubrey-Jones Email author Bernd?Fischer 《International journal of parallel programming》2016,44(3):552-573

Distributed memory architectures such as Linux clusters have become increasingly common but remain difficult to program. We target this problem and present a novel technique to automatically generate data distribution plans, and subsequently MPI implementations in C++, from programs written in a functional core language. The main novelty of our approach is that we support distributed arrays, maps, and lists in the same framework, rather than just arrays. We formalize distributed data layouts as types, which are then used both to search (via type inference) for optimal data distribution plans and to generate the MPI implementations. We introduce the core language and explain our formalization of distributed data layouts. We describe how we search for data distribution plans using an adaptation of the Damas–Milner type inference algorithm, and how we generate MPI implementations in C++ from such plans. 相似文献

4.

Parallel algorithms for bipartite matching problems on distributed memory computers

Johannes Langguth Md. Mostofa Ali PatwaryFredrik Manne 《Parallel Computing》2011,37(12):820-845

We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers.The presented algorithm is based on the Push-Relabel algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at developing parallel implementations of it have focused on shared memory computers using only a limited number of processors.We first present a straightforward adaptation of these shared memory algorithms to distributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communication and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infrequently. To achieve good load balance, we limit the speed at which global relabelings traverse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors.The algorithm can also be used to find ?-approximate matchings quickly. 相似文献

5.

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

Imamura Toshiyuki 《The Journal of supercomputing》2000,15(1):95-110

The Vertical Block–cyclic Distributed Parallel LU Factorization Method (VBPLU) is effectively processed on a distributed memory parallel computer. VBPLU is based on the two techniques, the block algorithm and the aggregation of communications. Since startup time dominates the data communication and the aggregation reduces communication isssues, the total performance has been much improved. Furthermore this method uses long vectors so that it is also advantageous on vector processors. In this paper, we have constructed a modeling of VBPLU using a simplified LogGP model with analytical formulae, and estimated accurately the computational cost taking into account load distributions caused by data layout and process mapping. Some knowledge for optimization of block algorithm has been obtained. Our estimations have been verified through numerical experiments on three different distributed memory parallel computers. 相似文献

6.

Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers 总被引：3，自引：0，他引：3

Chen Tzung-Shi Chang Chih-Yung 《The Journal of supercomputing》2002,21(2):191-211

Minimizing data communication over processors is the key to compile programs for distributed memory multicomputers. In this paper, we propose new data partition and alignment techniques for partitioning and aligning data arrays with a program in a way of minimizing communication over processors. We use skewed alignment instead of the dimension-ordered alignment techniques to align data arrays. By developing the skewed scheme, we can solve more complex programs with minimized data communication than that of the dimension-ordered scheme. Finally, we compare the proposed scheme with the dimension-ordered alignment one by experimental results. The experimental results show that our proposed scheme has more opportunities to align data arrays such that data communications over processors can be minimized. 相似文献

7.

Memetic Algorithms for Parallel Code Optimization 总被引：1，自引：0，他引：1

Ender Özcan Esin Onbaşioğlu 《International journal of parallel programming》2007,35(1):33-61

Discovering the optimum number of processors and the distribution of data on distributed memory parallel computers for a given algorithm is a demanding task. A memetic algorithm (MA) is proposed here to find the best number of processors and the best data distribution method to be used for each stage of a parallel program. Steady state memetic algorithm is compared with transgenerational memetic algorithm using different crossover operators and hill-climbing methods. A self-adaptive MA is also implemented, based on a multimeme strategy. All the experiments are carried out on computationally intensive, communication intensive, and mixed problem instances. The MA performs successfully for the illustrative problem instances. 相似文献

8.

An Efficient Parallel Algorithm to Solve Block–Toeplitz Systems

P.?Alonso Email author J.?M.?Badía A.?M.?Vidal 《The Journal of supercomputing》2005,32(3):251-278

In this paper, we present an efficient parallel algorithm to solve Toeplitz–block and block–Toeplitz systems in distributed memory multicomputers. This algorithm parallelizes the Generalized Schur Algorithm to obtain the semi-normal equations. Our parallel implementation reduces the communication cost and optimizes the memory access. The experimental analysis on a cluster of personal computers shows the scalability of the implementation. The algorithm is portable because it is based on standard tools and libraries, such as ScaLAPACK and MPI. 相似文献

9.

基于分布存储系统的并行编译关键技术

贾明飞董渭清黄泳翔《计算机工程与应用》2003,39(22):103-106,152

分布存储系统的并行编译器需要解决各局部存储器之间数据分布问题和各处理机之间通信优化问题。论文并行编程模型、代码和数据分布、通信优化以及代码生成问题四个方面论述了基于分布存储系统的并行编译关键技术并提出了进一步研究所要解决的问题。相似文献

10.

Distributing data and control for ray tracing in parallel 总被引：3，自引：0，他引：3

Badouel D. Bouatouch K. Priol T. 《Computer Graphics and Applications, IEEE》1994,14(4):69-77

We first briefly describe the methodology of programming ray-tracing algorithms on distributed-memory parallel computers, or DMPCs, and review previous efforts to overcome the problems of data distribution and load balancing. Then we present two algorithms designed for DMPCs and implemented on an Intel iPSC/2. We also compare the results of our experiments with them. The first algorithm, a data-oriented parallel implementation based on message passing, demonstrates how complex designing a parallel ray-tracing algorithm can be. The second algorithm shows how we can eliminate some complexity using a control-oriented parallel approach and a shared virtual memory 相似文献

11.

Communication Benchmarking and Performance Modelling of MPI Programs on Cluster Computers

D.?A.?Grove Email author P.?D.?Coddington 《The Journal of supercomputing》2005,34(2):201-217

This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors. 相似文献

12.

Optimizing locality and scalability of embedded Runge–Kutta solvers using block-based pipelining

《Journal of Parallel and Distributed Computing》2006,66(3):444-468

The increasing gap between the speeds of processors and main memory has led to hardware architectures with an increasing number of caches to reduce average memory access times. Such deep memory hierarchies make the sequential and parallel efficiency of computer programs strongly dependent on their memory access pattern. In this paper, we consider embedded Runge–Kutta methods for the solution of ordinary differential equations and study their efficient implementation on different parallel platforms. In particular, we focus on ordinary differential equations which are characterized by a special access pattern as it results from the spatial discretization of partial differential equations by the method of lines. We explore how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability. Experiments show that this approach results in efficiency improvements on several recent sequential and parallel computers. 相似文献

13.

Array Operation Synthesis to Optimize HPF Programs on Distributed Memory Machines

《Journal of Parallel and Distributed Computing》2001,61(4):467-500

An increasing number of programming languages, such as Fortran 90, HPF, and APL, provide a rich set of intrinsic array functions and array expressions. These constructs, which constitute an important part of data parallel languages, provide excellent opportunities for compiler optimizations. The synthesis of consecutive array operations or array expressions into a composite access function of the source arrays at compile time has been shown (A. T. Budd, ACM Trans. Programm. Lang. Syst.6 (July 1984), 297–313; G. H. Hwang et al., in “Proc. of ACM SIGPLAN Conference on Principles and Practice of Parallel Programming, 1995,” pp. 112–122) to be an effective scheme for optimizing programs on flat shared memory parallel architectures. It remains, however, to be studied how the synthesis scheme can be incorporated into optimizing HPF-like programs on distributed memory machines by taking into account communication costs. In this paper, we propose solutions to address this issue. We first show how to perform array operation synthesis on HPF programs, and we demonstrate its performance benefits on distributed memory machines with real applications. In addition, to prevent a situation we call “synthesis anomaly,” we present an optimal solution to guide the array synthesis process on distributed memory machines. Due to the optimal problem being NP-hard, we further develop a practical strategy that compilers can use on distributed memory machines with HPF programs. Our synthesis engine is implemented as a Web-based tool, called Syntool, and experimental results show significant performance improvement over the base codes for HPF code fragments from real appli- cations on parallel machines. Our experiments were performed on three distributed memory machines: an 8-node DEC Alpha Farm, a 16-node IBM SP-2, and a 16-node nCUBE/2. 相似文献

14.

Porting industrial codes and developing sparse linear solvers on parallel computers

《Computing Systems in Engineering》1995,6(4-5):295-305

We address the main issues when porting existing codes from serial to parallel computers and when developing portable parallel software on MIMD multiprocessors (shared memory, virtual shared memory, and distributed memory multiprocessors, and networks of computers). We discuss the use of numerical libraries as a way of developing portable and efficient parallel code. We illustrate this by using examples from our experience in porting industrial codes and in designing parallel numerical libraries. We report in some detail on the parallelization of scientific applications coming from Centre National d'Etudes Spatiales and from Aérospatiale, and we illustrate how it is possible to develop portable and efficient numerical software by considering the parallel solution of sparse linear systems of equations. 相似文献

15.

Improving communication scheduling for array redistribution

《Journal of Parallel and Distributed Computing》2005,65(5):553-563

Many scientific applications require array redistribution when the programs run on distributed memory parallel computers. It is essential to use efficient algorithms for redistribution, otherwise the performance of the programs will degrade considerably. The redistribution overheads consist of two parts: index computation and inter-processor communication. If there is no communication scheduling in a redistribution routine, the inter-processor communication will incur a larger communication idle time when there exists node contention and/or difference among message lengths during one particular communication step. In order to solve this problem, in this paper, we propose an efficient scheduling scheme that not only minimizes the number of communication steps and eliminates node contention, but also minimizes the difference of message lengths in each communication step. Thus, the communication idle time is reduced in redistribution routines. 相似文献

16.

Parallel implementation of a ray tracing algorithm for distributed memory parallel computers

Tong-Yee Lee C. S. Raghavendra John B. Nicholas 《Concurrency and Computation》1997,9(10):947-965

Ray tracing is a well known technique to generate life-like images. Unfortunately, ray tracing complex scenes can require large amounts of CPU time and memory storage. Distributed memory parallel computers with large memory capacities and high processing speeds are ideal candidates to perform ray tracing. However, the computational cost of rendering pixels and patterns of data access cannot be predicted until runtime. To parallelize such an application efficiently on distributed memory parallel computers, the issues of database distribution, dynamic data management and dynamic load balancing must be addressed. In this paper, we present a parallel implementation of a ray tracing algorithm on the Intel Delta parallel computer. In our database distribution, a small fraction of database is duplicated on each processor, while the remaining part is evenly distributed among groups of processors. In the system, there are multiple copies of the entire database in the memory of groups of processors. Dynamic data management is acheived by an ALRU cache scheme which can exploit image coherence to reduce data movements in ray tracing consecutive pixels. We balance load among processors by distributing subimages to processors in a global fashion based on previous workload requests. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. © 1997 John Wiley & Sons, Ltd. 相似文献

17.

Communication-free data allocation techniques for parallelizingcompilers on multicomputers

Tzung-Shi Chen Jang-Ping Sheu 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(9):924-938

In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied 相似文献

18.

Development of a parallel direct simulation code to investigate reactive flows

《Computers & Fluids》1996,25(5):485-496

Solving the Navier-Stokes equations with detailed modeling of the transport and reaction terms remains at the present time a very difficult challenge. Direct simulations of two-dimensional reactive flows using accurate models for the chemical reactions generally require days of computing time on today's most powerful serial vector supercomputers. Up to now, realistic three-dimensional simulations remain practically impossible. Working with parallel computers seems to be at the present time the only possible solution to investigate more complicated problems at acceptable costs, however, lack of standards on parallel architectures constitutes a real obstacle. In this paper, we describe the structure of a parallel two-dimensional direct simulation code using detailed transport, thermodynamic and reaction models. Separating the modules controlling the parallel work from the flow solver, it is possible to get a high compatibility degree between parallel computers using distributed memory and message-passing communication. A dynamic load-balancing procedure is implemented in order to optimize the distribution of the load among the different nodes. Efficiencies obtained with this code on many different architectures are given. First examples of application conceding the interaction between vortices and a diffusion flame are shown in order to illustrate the possibilities of the solver. 相似文献

19.

An implementation framework for HPF distributed arrays onmessage-passing parallel computer systems

van Reeuwijk K. Denissen W. Sips H.J. Paalvast E.M.R.M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(9):897-914

Data parallel languages, like High Performance Fortran (HPF), support the notion of distributed arrays. However, the implementation of such distributed array structures and their access on message passing computers is not straightforward. This holds especially for distributed arrays that are aligned to each other and given a block-cyclic distribution. In this paper, an implementation framework is presented for HPF distributed arrays on message passing computers. Methods are presented for efficient (in space and time) local index enumeration, local storage, and communication. Techniques for local set enumeration provide the basis for constructing local iteration sets and communication sets. It is shown that both local set enumeration and local storage schemes can be derived from the same equation. Local set enumeration and local storage schemes are shown to be orthogonal, i.e., they can be freely combined. Moreover, for linear access sequences generated by our enumeration methods, the local address calculations can be moved out of the enumeration loop, yielding efficient local memory address generation. The local set enumeration methods are implemented by using a relatively simple general transformation rule for absorbing ownership tests. This transformation rule can be repeatedly applied to absorb multiple ownership tests. Performance figures are presented for local iteration overhead, a simple communication pattern, and storage efficiency 相似文献

20.

Efficient algorithms for data distribution on distributed memoryparallel computers

PeiZong Lee 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(8):825-839

Data distribution has been one of the most important research topics in parallelizing compilers for distributed memory parallel computers. Good data distribution schema should consider both the computation load balance and the communication overhead. In this paper, we show that data redistribution is necessary for executing a sequence of Do-loops if the communication cost due to performing this sequence of Do-loops is larger than a threshold value. Based on this observation, we can prune the searching space and derive efficient dynamic programming algorithms for determining effective data distribution schema to execute a sequence of Do-loops with a general structure. Experimental studies on a 32-node nCUBE-2 computer are also presented 相似文献