首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The synthetic-perturbation screening (SPS) methodology is based on an empirical approach; SPS introduces artificial perturbations into the MIMD program and captures the effects of such perturbations by using the modern branch of statistics called design of experiments. SPS can provide the basis of a powerful tool for screening MIMD programs for performance bottlenecks. This technique is portable across machines and architectures, and scales extremely well on massively parallel processors. The purpose of this paper is to explain the general approach and to extend it to address specific features that are the main source of poor performance on the shared memory programming model. These include performance degradation due to load imbalance and insufficient parallelism, and overhead introduced by synchronizations and by accessing shared data structures. We illustrate the practicality of SPS by demonstrating its use on two very different case studies: a large image understanding benchmark and a parallel quicksort.  相似文献   

2.
We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver.  相似文献   

3.
Data-parallel,volumerendering algorithms   总被引:1,自引:0,他引:1  
In this presentation, we consider the image-composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders its data by using any existing volume-rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector-processing capabilities. The resulting images from all processors are then combined (composited) in visibility order to form the final image. The major advantage of this approach is that, as viewing and shading parameters change, only 2D partial images, and not 3D volume data, are communicated among processors. Through experimental results and performance analysis, we show that our parallel algorithm is amenable to extremely efficient implementations on distributed memory, multiple instruction-multiple data (MIMD), vector-processor architectures. This algorithm is also very suitable for hardware implementation based on image composition architectures. It supports various volume-rendering algorithms, and it can be extended to provide load-balanced execution.  相似文献   

4.
Ahmed M.  Lester  Reda 《Performance Evaluation》2005,60(1-4):303-325
In studying or designing parallel and distributed systems one should have available a robust analytical model that includes the major parameters that determine the system performance. Jackson networks have been very successful in modeling computer systems. However, the ability of Jackson networks to predict performance with system changes remains an open question, since they do not apply to systems where there are population size constraints. Also, the product-form solution of Jackson networks assumes steady-state and exponential service centers or certain specialized queueing discipline. In this paper, we present a transient model for Jackson networks that is applicable to any population size and any finite workload (no new arrivals). Using several non-exponential distributions we show to what extent the exponential distribution can be used to approximate other distributions and transient systems with finite workloads. When the number of tasks to be executed is large enough, the model approaches the product-form solution (steady-state solution). We also, study the case where the non-exponential servers have queueing (Jackson networks cannot be applied). Finally, we show how to use the model to analyze the performance of parallel and distributed systems.  相似文献   

5.
Message passing interface (MPI) is the de facto standard in writing parallel scientific applications on distributed memory systems. Performance prediction of MPI programs on current or future parallel systems can help to find system bottleneck or optimize programs. To effectively analyze and predict performance of a large and complex MPI program, an efficient and accurate communication model is highly needed. A series of communication models have been proposed, such as the LogP model family, which assume th...  相似文献   

6.
The restricted synchronization structure of so-called structured parallel programming paradigms has an advantageous effect on programmer productivity, cost modeling, and scheduling complexity. However, imposing these restrictions can lead to a loss of parallelism, compared to using a programming approach that does not impose synchronization structure. In this paper we study the potential loss of parallelism when expressing parallel computations into a programming model which limits the computation graph (DAG) to series–parallel topology, which characterizes all well-known structured programming models. We present an analytical model that approximately captures this loss of parallelism in terms of simple parameters that are related to DAG topology and workload distribution. We validate the model using a wide range of synthetic and real-world parallel computations running on shared and distributed-memory machines. Although the loss of parallelism is theoretically unbounded, our measurements show that for all above applications the performance loss due to choosing a series–parallel structured model is invariably limited up to 10%. In all cases, the loss of parallelism is predictable provided the topology and workload variability of the DAG are known.  相似文献   

7.
Two parallel algorithms for determining the convex hull of a set of data points in two dimensional space are presented. Both are suitable for MIMD parallel systems. The first is based on the strategy of divide-and-conquer, in which some simplest convex-hulls are generated first and then the final convex hull of all points is achieved by the processes of merging 2 sub-convex hulls. The second algorithm is by the process of picking up the points that are necessarily in the convex hull and discarding the points that are definitely not in the convex hull. Experimental results on a MIMD parallel system of 4 processors are analysed and presented.  相似文献   

8.
This paper summarizes theoretical and practical investigations into the effect of parallelization by grid-partitioning on the performance of multigrid methods for the solution of partial differential equations on general two-dimensional domains. Particular emphasis will be placed on the algorithmic scalability for MIMD distributed memory systems. Experimental results for two Navier-Stokes test problems, presented in the last section of the paper, show that the theoretically predicted dependency of the combined numerical and parallel efficiencies of multigrid methods on the number of processors employed is in fact very weak. This leads to the conclusion that multigrid is an appropriate candidate for solving partial differential equations on massively parallel machines.  相似文献   

9.
There are two distinct types of MIMD (Multiple Instruction, Multiple Data) computers: the shared memory machine, e.g. Butterfly, and the distributed memory machine, e.g. Hypercubes, Transputer arrays. Typically these utilize different programming models: the shared memory machine has monitors, semaphores and fetch-and-add; whereas the distributed memory machine uses message passing. Moreover there are two popular types of operating systems: a multi-tasking, asynchronous operating system and a crystalline, loosely synchronous operating system.

In this paper I firstly describe the Butterfly, Hypercube and Transputer array MIMD computers, and review monitors, semaphores, fetch-and-add and message passing; then I explain the two types of operating systems and give examples of how they are implemented on these MIMD computers. Next I discuss the advantages and disadvantages of shared memory machines with monitors, semaphores and fetch-and-add, compared to distributed memory machines using message passing, answering questions such as “is one model ‘easier’ to program than the other?” and “which is ‘more efficient‘?”. One may think that a shared memory machine with monitors, semaphores and fetch-and-add is simpler to program and runs faster than a distributed memory machine using message passing but we shall see that this is not necessarily the case. Finally I briefly discuss which type of operating system to use and on which type of computer. This of course depends on the algorithm one wishes to compute.  相似文献   


10.
Cluster algorithms have application in diverse areas, including statistical mechanics of polymer solutions, spin models in physics, and the study of ecological systems. Most parallel cluster labeling algorithms are designed for SIMD and MIMD multiprocessors and based on relaxation methods. We present a parallel 3-D cluster labeling algorithm based on mapping tables, for distributed memory environments. The proposed algorithm focuses on minimizing interprocess communication to enhance execution performance on workstation networks. We implemented the algorithm with the aid of theEcliPSeparallel replication toolkit, exploiting special tree-combining and data reduction features of the system. We report on performance results for experiments conducted on workstation clusters.  相似文献   

11.
For the moment,commercial parallel computer systems with distributed memory architecture are usually provided with parallel FORTRAN or parallel C compliers,which are just traditional sequential FORTRAN or C compilers expanded with communication statements.Programmers suffer from writing parallel programs with communication statements. The Shared Variable Oriented Parallel Precompiler (SVOPP) proposed in this paper can automatically generate appropriate communication statements based on shared variables for SPMD(Single Program Multiple Data) computation model and greatly ease the parallel programming with high communication efficiency.The core function of parallel C precompiler has been successfully verified on a transputer-based parallel computer.Its prominent performance shows that SVOPP is probably a break-through in parallel programming technique.  相似文献   

12.
In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.  相似文献   

13.
王国仁  于戈  叶峰  郑怀远 《计算机学报》1999,22(10):1032-1041
提出了一个基于分布式共享虚拟存储器技术的并行Hash连接算法,然后设计了一个并行连接算法的测试评价基准,并评价和分析了该算法在均匀情况下3个不同负载的性能比较和Zipf顺斜数据分布情况下两种度策略的算法性能。同时与其它并行连接算法进行性能比较与分析。  相似文献   

14.
Parallel processing via the application of MIMD machines offers the promise of high performance, and experience with parallel processing is accumulating rapidly. This paper briefly surveys recent results from three classes of MIMD machines— shared memory systems, non-shared memory systems, and a dataflow system. This data confirms that rapid progress is being made in the application of MIMD machines and that parallel processing can yield high performance. It also confirms that major research issues remain to be addressed.  相似文献   

15.
一种基于任务的机器人全局并行算法研究及实现   总被引:3,自引:0,他引:3  
沈悦明  陈启军 《机器人》2003,25(6):495-500
本文提出了一种基于任务的机器人全局并行算法,结合主从结构的MIMD并行处理平台将机器人控制中的运动学、动力学、控制律等基本计算任务分别进行任务划分,将划分好的子任务统一用工作池方式实现全局的动态调度.采用流水线及集中式动态调度策略,在一个由5个DSP处理器组成的同构型松耦合MIMD并行处理平台上对平面机器人进行了并行实时仿真实验,取得了满意的并行性能指标.  相似文献   

16.
The implementation of the GESIMA mesoscale atmospheric model on message passing, distributed memory parallel computers is presented. Particular emphasis is given to the parallelization of the conjugate gradient solver using pre-conditioning by an incomplete LU factorization. Performance results are presented for the Cray T3D and Cray T3E systems, which show good scalability over a range of problem sizes and numbers of processors.  相似文献   

17.
A block parallel partitioning method for computing the eigenvalues of symmetric tridiagonal matrix is presented. The algorithm is based on partitioning, in a way that ensures load balance during computation. This method is applicable to both shared memory- and distributed memory-MIMD systems. Compared with other parallel tridiagonal eigenvalue algorithms existing in the literature, the proposed algorithm achieves a higher speedup of O(p) on a parallel computer with p-fold parallelism, which is linear, and the data communication between processors is less than that required for other methods. The results were tested and evaluated on an MIMD machine, and were within 62% to 98% of the predicted performance.  相似文献   

18.
In this paper we address the issue of workload decomposition in programming hierarchical distributed‐shared memory parallel systems. The workload decomposition we have devised consists of a two‐stage procedure: a higher‐level decomposition among the computational nodes; and a lower‐level one among the processors of each computational node. By focusing on porting of a case study particle‐in‐cell application, we have implemented the described work decomposition without large programming effort by using and integrating the high‐level language extensions High‐Performance Fortran and OpenMP. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

19.
We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI’s ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are explored: work sharing using a centralized work server and distributed work stealing using explicit polling to service steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional techniques to manage the volume of communication and mitigate runtime overhead. Using additional parameters, we observed an improvement of up to 3–4X in parallel performance. We report results for three distributed memory parallel computer systems and use UTS to characterize the performance and scalability on these systems. Overall, we find that the simpler work sharing approach with a single work server achieves good performance on hundreds of processors and that our distributed work stealing implementation scales to thousands of processors and delivers more robust performance that is less sensitive to the particular workload and load balancing parameters.  相似文献   

20.
The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized, as well as efforts at the NASA Lewis Research Center. The latter efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号