共查询到20条相似文献,搜索用时 15 毫秒
1.
An environment that lets system applications be expressed as virtual machines, through which architecture-independent multiple-instruction, multiple-data stream (MIMD) programs are written, is described. The virtual machine hides the hardware configuration from the programmer so that the MIMD programming environment always appears the same, regardless of the actual hardware. The data-definition and procedural high-level languages used in the environment and the generation of object code in the environment are discussed. The runtime configuration of the system and an implemented prototype of the system are described 相似文献
2.
Various proposals for networks of large numbers of processors are reviewed. Bottleneck problems arise in these networks with the flow of data between processors. Communication problems which can arise in practical situations are discussed and techniques for reducing bottlenecks are developed. Some simulation results are given for the binary n-cube. 相似文献
3.
Bronson E.C. Casavant T.L. Jamieson L.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):195-205
An experimental analysis of the architecture of an SIMD/MIMD parallel processing system is presented. Detailed implementations of parallel fast Fourier transform (FFT) programs were used to examine the performance of the prototype of the PASM (Partitionable SIMD/MIMD) parallel processing system. Detailed execution-time measurements using specialized timing hardware were made for the complete FFT and for components of SIMD, MIMD, and barrier-synchronized MIMD implementations. The component measurements isolated the effects of floating-point arithmetic operations, interconnection network transfer operations, and program control overhead. The measurements allow an accurate extrapolation of the execution time, speedup, and efficiency of the MIMD, SIMD, and barrier-synchronized MIMD programs to a full 1024-processor PASM system. This constitutes one of the first results of this kind, in which controlled experiments on fixed hardware were used to make comparisons of these fundamental modes of computing. Overall, the experimental results demonstrate the value of mixed-mode SIMD/MIMD computing and its suitability for computational intensive algorithms such as the FET 相似文献
4.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd. 相似文献
5.
Parallel Gaussian elimination on an MIMD computer 总被引:3,自引:0,他引:3
This paper introduces a graph-theoretic approach to analyse the performances of several parallel Gaussian-like triangularization algorithms on an MIMD computer. We show that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models. We derive new complexity results and compare the asymptotic performances of these parallel versions. 相似文献
6.
7.
A parallel FFT on an MIMD machine 总被引:5,自引:0,他引:5
In this paper we present a parallelization of the Cooley- Tukey FFT algorithm that is implemented on a shared-memory MIMD (non-vector) machine that was built in the Dept. of Computer Science, Tel Aviv University. A parallel algorithm is presented for one dimension Fourier transform with performance analysis. For a large array of complex numbers to be transformed, an almost linear speed-up is demonstrated. This algorithm can be executed by any number of processors, but generally the number is much less than the length of the input data. 相似文献
8.
《Parallel Computing》1990,15(1-3):133-145
This paper describes a parallel algorithm for the LU decomposition of band matrices using Gaussian elimination. The matrix dimension is n × n with 2r−1 diagonals. In the case when 1 r 2 p an optimal number of the processors,
, is determined according to the equation
. When 2 p r n a number of processors, p, statged by Veldhorst is adopted (see [7]). For band matrix with 2r-1 diagonals (1 r 2p) the task scheduling procedure with the aim to obtain maximal parallelism in system operation, i.e. good load balancing, is defined. The architecture of the system is of MIMD type. The connection between the processors is realised via a common bus. Communication and synchronization is performed by message passing technique. 相似文献
9.
The performance of a parallel program executed on a message passing MIMD computer is determined mainly by the efficiency of the communication among the processors and the efficiency of the calculation carried out in each processor. In this paper we present the results of the experiments related to the efficiency of the communication of a T800 transputer based system. The results of these experiments are used to determine the basic hardware parameters for the communication capabilities of the system. Such parameters are the asymptotic rate of data transfer (r∞) and the message length required to obtain half the asymptotic rate (n1/2). These performance results will help us to evaluate new implementations or new architectures. 相似文献
10.
《Parallel Computing》1986,3(2):93-110
This paper discusses the class of algorithms having global parallelism, i.e. those in which parallelism is introduced at the top of the program structure hierarchy. Such algorithms have performance advantages in a shared-memory. MIMD computational model. A programming environment consisting of FORTRAN, enhanced by some pre-processed macros, has been built to aid in writing programs for such algorithms for the Denelcor HEP multiprocessor. Applications of from tens to hundreds of FORTRAN statements have been written and tested in this environment. A few parallelism constructs suffice to yield understandable programs with a high degree of parallelism. The automatic generation of programs with global parallelism seems to be a promising possibility. 相似文献
11.
A mesh-vertex finite volume scheme for solving the Euler equations on triangular unstructured meshes is implemented on a MIMD (multiple instruction/multiple data stream) parallel computer. Three partitioning strategies for distributing the work load onto the processors are discussed. Issues pertaining to the communication costs are also addressed. We find that the spectral bisection strategy yields the best performance. The performance of this unstructured computation on the Intel iPSC/860 compares very favorably with that on a one-processor CRAY Y-MP/1 and an earlier implementation on the Connection Machine.The authors are employees of Computer Sciences Corporation. This work was funded under contract NAS 2-12961 相似文献
12.
13.
14.
A parallel ray tracing algorithm is presented. It subdivides the seene into 3D regions, the adjacency of which is modelled by a connectivity graph of regions. Since with each region is associated a ray tracing process, this graph becomes a graph of processes, the edges of which represent the communications between processes. This graph of processes is suitably mapped onto a hypercube topology so as to minimize the communication cost. Static load balancing is performed and solutions are brought to the problems of network congestion and termination.This work has been supported byC
3 and by the CCETT (Centre Commun d'Etudes de Télédiffusion et Télécommunications) under contract 86ME46 相似文献
15.
16.
Some aspects of a long-term parallel-processing research project (PACS, Pax, and Qcd Pax) begun in 1977 at Kyoto University and Hitachi Corporation's Nuclear Power Division are discussed. The discussion is based on an analysis of a number of papers, a book detailing this work, several visits to the project laboratory in Japan, and an examination of some programs that now run on cd Pax. The initial name, processor array for continuum simulation (PACS), was soon changed to Processor Array experiment, or Pax. Qcd Pax (for quantum chromodynamics) is the current running computer. The characteristics of the family are described, and the hardware, communication, and memory functions of the host computer, the use of four levels of parallelism programming, and performance are examined 相似文献
17.
van Reeuwijk K. Denissen W. Sips H.J. Paalvast E.M.R.M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(9):897-914
Data parallel languages, like High Performance Fortran (HPF), support the notion of distributed arrays. However, the implementation of such distributed array structures and their access on message passing computers is not straightforward. This holds especially for distributed arrays that are aligned to each other and given a block-cyclic distribution. In this paper, an implementation framework is presented for HPF distributed arrays on message passing computers. Methods are presented for efficient (in space and time) local index enumeration, local storage, and communication. Techniques for local set enumeration provide the basis for constructing local iteration sets and communication sets. It is shown that both local set enumeration and local storage schemes can be derived from the same equation. Local set enumeration and local storage schemes are shown to be orthogonal, i.e., they can be freely combined. Moreover, for linear access sequences generated by our enumeration methods, the local address calculations can be moved out of the enumeration loop, yielding efficient local memory address generation. The local set enumeration methods are implemented by using a relatively simple general transformation rule for absorbing ownership tests. This transformation rule can be repeatedly applied to absorb multiple ownership tests. Performance figures are presented for local iteration overhead, a simple communication pattern, and storage efficiency 相似文献
18.
A three-dimensional electromagnetic particle-in-cell code with Monte Carlo collision (PIC-MCC) is developed for MIMD parallel supercomputers. This code uses a standard relativistic leapfrog scheme incorporating Monte Carlo calculations to push plasma particles and to include collisional effects on particle orbits. A local finite-difference time-domain method is used to update the self-consistent electromagnetic fields. The code is implemented using the General Concurrent PIC (GCPIC) algorithm, which uses domain decomposition to divide the computation among the processors. Particles must be exchanged between processors as they move among subdomains. Message passing is implemented using the Express Cubix library and the PVM. We evaluate the performance of this code using a 512-processor Intel Touchstone Delta, a 512-processor Intel Paragon, and a 256-processor CRAY T3D. It is shown that a high parallel efficiency exceeding 95% has been achieved on all three machines for large problems. We have run PIC-MCC simulations using several hundred million particles with several million collisions per time step. For these large-scale simulations the particle push time achieved is in the range of 90–115 ns/particle/time step, and the collision calculation time in the range of a few hundred nanoseconds per collision. 相似文献
19.
20.