首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The data parallel meta language (DPML) and its associated Fortran source code rewriter (DP77) support architecture independent, high performance climate and weather prediction models. The language allows the data domains over which a program operates, the communication patterns required between elements of those data domains, and some or all of the calculations of a program to be expressed at a very high level. DPML uses explicit data parallelism to express the inherent parallelism of the models, with the result that programs are easily compilable into target machine code. DP77 uses information from the DPML program to translate Fortran routines into the host specific Fortran form required for their parallel execution within the model. This paper describes the general strategy behind the development of DPML, discusses its language features using examples drawn from climate modelling, and provides details of the mechanism it uses for incorporating Fortran into data parallel programs. Encouraging results are reported for DPML versions of the standard weather benchmark models executing on vector, SIMD, and MIMD (shared memory) machines. While the paper is set within the framework of climate modelling, the technique has obvious wider implications.  相似文献   

2.
Writing programs for a distributed-memory system (DMS) is a difficult job. In this paper, a method for parallelising sequential programs for DMS is presented. The input programs are C programs and the output parallel versions are programs containing routines for the Parallel Virtual Machine (PVM). PVM allows a group of computers in a network to be specified as a DMS and provides the routines for task activation and communication. The main task in this parallelisation of program is to process the loops in the source program and determine if there exists any data dependences or not. If the loop iterations are independent, the body will be transformed to tasks that will run independently for PVM.  相似文献   

3.
A model of parallel program that can be effectively interpreted on the development computer guaranteeing the possibility of a sufficiently precise prediction of real run time for a simulated parallel program at the prescribed computer system is studied. The model is worked out for parallel programs with explicit message passing written in the Java language with MPI library access and is included into the composition of ParJava environment. The model is obtained by transforming the program control tree that can be constructed for Java programs by modifying the abstract syntax tree. To model communication functions, the model LogGP is used which allows taking into consideration the specific character of the communication network of the distributed computer system.  相似文献   

4.
Context: Message-passing parallel programs are commonly used parallel programs. Various scheduling sequences contained in these programs, however, increase the difficulty of testing them. Therefore, reducing scheduling sequences by using appropriate approaches can greatly improve the efficiency of testing these programs.Objective: This paper focuses on the problem of reducing scheduling sequences of message-passing parallel programs, and presents a novel approach to reducing scheduling sequences.Method: In this approach, scheduling sequences that affect the target statement are first determined based on the relation between a scheduling sequence and the execution of the target statement. Then, these scheduling sequences are divided into a number of equivalent classes according to the execution of the target statement. Finally, for each scheduling sequence in the same equivalent class, the values of the two proposed indexes are calculated, and the scheduling sequence with the minimal comprehensive value is selected as the one after reduction.Results: To evaluate the performance of the proposed approach, it is applied to test 12 typical message-passing parallel programs. The experimental results show that the proposed approach reduces 63% scheduling sequences on average. And compared with the method without reduction, and the method with randomly selecting scheduling sequences, the proposed approach shortens 67% and 52% execution time of a program for covering the target statement on average, respectively.Conclusion: The proposed approach can greatly reduce scheduling sequences, and shorten execution time of a program for covering the target statement, hence improving the efficiency of testing the program.  相似文献   

5.
This paper considers a model of a parallel program that can efficiently be interpreted on an instrumental computer, providing a means for a sufficiently accurate prediction of the actual time needed for execution of the parallel program on a given parallel computational system. The model is designed for parallel programs with explicit message passing written in Java with calls to the MPI library and is a part of the ParJava environment. The model is obtained by transforming the program control tree, which, for Java programs, can be constructed by modifying the abstract syntax tree. The communication functions are simulated on the basis of the LogGP model, which makes it possible to take into account specific features of the distributed computational system.  相似文献   

6.
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors.  相似文献   

7.
A framework for synthesizing communication-efficient distributed-memory parallel programs for block recursive algorithms such as the fast Fourier transform (FFT) and Strassen's matrix multiplication is presented. This framework is based on an algebraic representation of the algorithms, which involves the tensor (Kronecker) product and other matrix operations. This representation is useful in analyzing the communication implications of computation partitioning and data distributions. The programs are synthesized under two different target program models. These two models are based on different ways of managing the distribution of data for optimizing communication. The first model uses point-to-point interprocessor communication primitives, whereas the second model uses data redistribution primitives involving collective all-to-many communication. These two program models are shown to be suitable for different ranges of problem size. The methodology is illustrated by synthesizing communication-efficient programs for the FFT. This framework has been incorporated into the EXTENT system for automatic generation of parallel/vector programs for block recursive algorithms.  相似文献   

8.
Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of shared-memory models.In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way.The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities.Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.  相似文献   

9.
A popular approach to providing nonexperts in parallel computing with an easy-to-use programming model is to design a software library consisting of a set of preparallelized routines, and hide the intricacies of parallelization behind the library's API. However, for regular domain problems (such as simple matrix manipulations or low-level image processing applications-in which all elements in a regular subset of a dense data field are accessed in turn) speedup obtained with many such library-based parallelization tools is often suboptimal. This is because interoperation optimization (or: time-optimization of communication steps across library calls) is generally not incorporated in the library implementations. We present a simple, efficient, finite state machine-based approach for communication minimization of library-based data parallel regular domain problems. In the approach, referred to as lazy parallelization, a sequential program is parallelized automatically at runtime by inserting communication primitives and memory management operations whenever necessary. Apart from being simple and cheap, lazy parallelization guarantees to generate legal, correct, and efficient parallel programs at all times. The effectiveness of the approach is demonstrated by analyzing the performance characteristics of two typical regular domain problems obtained from the field of low-level image processing. Experimental results show significant performance improvements over nonoptimized parallel applications. Moreover, obtained communication behavior is found to be optimal with respect to the abstraction level of message passing programs.  相似文献   

10.
并行程序Petri网模型的结构性质   总被引:1,自引:0,他引:1  
正确性是并行程序的基础,但是由于它的复杂性,其验证要比串行程序困难得多,因此有必要进行建模并研究其性质.从程序的角度出发,在将基于消息传递的并行程序转换为Petri网模型之后,证明了与并行正确的并行程序对应的Petri网模型应当满足的结构性质,包括强连通性、S-不变量、T-不变量、受控死锁性质以及守恒性,并举例说明了这些性质在并行程序验证中的应用.这些性质可用于并行程序的事前验证,而且避免了使用动态性质进行验证时的状态爆炸问题,从而提高并行程序设计和验证效率.同时这些方法具有良好的可推广性.  相似文献   

11.
Influence of most resource-intensive regular system interrupts on performance of parallel programs is studied. Depending on hardware architecture and OS settings, these interrupts take 0.1–5% of CPU operation time [1, 2], yet they can cause 10–100% decrease of performance of a parallel program, for instance, for bulk operations [3–10]. The way these interrupts influence operation time of a class of parallel programs with synchronization between neighboring processes at each iteration, such as stencil computations, synchronous cellular automaton, and explicit difference scheme, is studied. Results of testing on computing clusters are given. Measures to reduce influence the interrupts exert on performance of a parallel program are discussed.  相似文献   

12.
基于模式的并行编程环境中任务队列模式的研究与实现   总被引:1,自引:0,他引:1  
并行程序的设计是并行计算的难点之一。本文在基于模式的并行编程方法的基础上,对一种典型的并行计算与通信模式-任务队列模式进行了深入的研究,并在基于模式的并行编程环境中对该模式进行了实现。本文将通过两个典型的应用实例说明在基于模式的并行编程环境中使用任务队列模式进行问题的并行求解与并行程序开发的过程,并从实现效率和可编程性方面对使用任务队列模式的并行程序和传统的MPI/PVM实现的并行程序进行了分析与比较。  相似文献   

13.
In modular programs, groups of routines constitute conceptual abstractions. A method for providing execution profiles for such programs is presented. The central idea is that the execution time for a routine is charged to the routines that call it. The implementation of this method by a profiler called gprof is described. The techniques used to gather the necessary information about the timing and structure of the program are given, as is the processing used to propagate routine execution times along arcs of the call graph of the program. The method for displaying the profile to the user is discussed. Experience using the profiles for hand-tuning large programs is summarized. Additional uses for the profiles are suggested.  相似文献   

14.
针对对称逐步超松驰预处理共轭梯度(Symmetric Successive Over Relaxation Preconditioned Conjugate Gradient,SSOR-PCG)法并行化时每步迭代都要并行求解2个三角方程组的困难,采用多色排序技术提高并行度,基于MPI+OpenMP混合编程模型开发适合于分布共享内存计算机的并行程序,通过测试选择有效的MPI通信函数,并给出3种避免共享数据竞争的措施,供不同规模问题和不同内存容量计算机情况选用.  相似文献   

15.
This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture.We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS).We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity.  相似文献   

16.
The tension between software development costs and efficiency is especially high when considering parallel programs intended to run on a variety of architectures. In the domain of shared memory architectures and explicitly parallel programs, the authors have addressed this problem by defining a programming structure that eases the development of effectively portable programs. On each target multiprocessor, an effectively portable program runs almost as efficiently as a program fine-tuned for that machine. Additionally, its software development cost is close to that of a single program that is portable across the targets. Using this model, programs are defined in terms of data structure and partitioning-scheduling abstractions. Low software development cost is attained by writing source programs in terms of abstract interfaces and thereby requiring minimal modification to port; high performance is attained by matching (often dynamically) the interfaces to implementations that are most appropriate to the execution environment. The authors include results of a prototype used to evaluate the benefits and costs of this approach  相似文献   

17.
边缘海静力数值模式是国内针对边缘海特点自主开发的数值预报模式,但该模式因物理求解方程较多且采用不宜并行化的SOR求解算法而程序计算时间过长。针对上述问题,提出基于三维网格和海洋模式特点的SOR并行求解算法,该算法在保留三维网格数据间依赖关系的同时,有效解决了SOR迭代算法难以并行化的问题。同时,引入通信避免算法,采用MPI非阻塞通信方式,细分计算和通信过程,利用计算有效隐藏通信开销,提高了并行程序效率。实验结果表明,并行后的边缘海静力数值模式程序的性能相对串行程序提升了60.71倍,3天(25920计算时间步)预报结果的均方根误差低于0.001,满足海洋数值预报的时效性和精度要求。  相似文献   

18.
Techniques are described for the automatic generation of self-scheduling parallel programs. Both scheduling algorithms and the concurrent components of applications are expressed in a high-level concurrent language. Partitioning and data dependency information are expressed by simple control statements, which may be generated either automatically or manually. A self-scheduling compiler, implemented as a source-to-source transformation, takes application code, control statements, and scheduling routines and generates a new program that can schedule its own execution on a parallel computer. The approach has several advantages compared to previous proposals. It generates programs that are portable over a wide range of parallel computers. There is no need to embed special control structures in application programs. The use of a high-level language to express applications and scheduling algorithms facilitates the development, modification, and reuse of parallel programs  相似文献   

19.
GAGAN AGRAWAL  JOEL SALTZ 《Software》1997,27(5):519-545
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations: placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled usng our system to demonstrate the efficacy of the presented schemes. ©1997 John Wiley & Sons, Ltd.  相似文献   

20.
OpenMP规范了一系列的编译制导、环境变量和运行库,具有简单、可移植、支持增量并行等优点.但同时,采用FORK-JOIN模型所引起的频繁的线程管理开销也是制约OpenMP程序性能的瓶颈之一.本文讨论了如何利用并行区的合并与扩展,实现并行区的重构,并在此基础上利用Open64的IPA优化部件所提供的全局间过程分析能力,实现跨越过程边界的并行块的合并.最终实验表明,该方法有效地改进了OpenMP程序的运行性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号