首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Programming multiprocessor parallel architectures is a complex task. This paper describes a block-structured scientific programming language, BLAZE, designed to simplify this task. BLAZE contains array arithmetic, ‘forall’ loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow.

A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture, while neglecting the remainder. This paper describes the features of BLAZE, and show how this language would be used in typical scientific programming.  相似文献   


2.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

3.
Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques.  相似文献   

4.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines  相似文献   

5.
Nested parallelism appears naturally in many applications. It is required whenever a function performing parallel statements needs to call a subroutine using parallelism. A particular case occurs when the function is recursive. Nested parallelism is to parallel programming as basic as nested loops to sequential programming. Despite this, most existing parallel languages do not provide this feature. This paper presents a new methodology to expand message passing libraries (MPL) with nested parallelism. The tool to support the methodology has processor virtualization, load balancing, pipeline parallelism and collective operations, among other features. The computational results prove that the performance obtained is comparable to that obtained using classical message passing programs. Since the methodology does not force the programmer to leave the MPL environment, all the efficiency and portability of the MPL model is preserved. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

6.
Features of an explicitly parallel programming language targeted for reconfigurable parallel processing systems, where the machine's N processing elements (PEs) are capable of operating in both the SIMD and SPMD modes of parallelism, are described. The SPMD (single program-multiple data) mode of parallelism is a subset of the MIMD mode where all processors execute the same program. By providing all aspects of the language with an SIMD mode version and an SPMD mode version that are syntactically and semantically equivalent, the language facilitates experimentation with and exploitation of hybrid SIMD/SPMD machines. Language constructs (and their implementations) for data management, data-dependent control-flow, and PE-address-dependent control-flow are presented. These constructs are based on experience gained from programming a parallel machine prototype and are being incorporated into a compiler under development. Much of the research presented is applicable to general SIMD machines and MIMD machines  相似文献   

7.
8.
The Portable Parallelizing Fortran Compiler (PPFC) is an additional component for the portable programming environment developed in Tel-Aviv University for scientific code. This environment supports portable and efficient programming of diverse MIMD multiprocessors, both distributed- and shared-memory. Till now this environment has consisted of two tools: the Virtual Machine for MultiProcessors (VMMP) and the Portable Parallelizing Pascal compiler (P3C). We have added the PPFC which is an automatic parallelizer compiler for the Fortran language. The compiler is fully automatic (does not require additional declarations to assist parallelization), which is characterized by loops operating on regular data structures, and produces efficient and portable code for a variety of multiprocessors from the same serial code. The parallel implementation uses the VMMP, which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. The PPFC parallelized 12 out of the 24 Livermore Loops. It was also applied to parallelize all the 14 Fortran application programs that where parallelized by the P3C and achieved the same speed-ups and efficiencies. In most examples the PPFC achieved high speed-ups and efficiencies on all target multiprocessors. The PPFC emphasizes efficiency and code portability. Although PPFC employs a relatively simple data flow analysis, it produces efficient code for various widely used application programs.  相似文献   

9.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

10.
Parallel languages allow the programmer to express parallelism at a high level. The management of parallelism and the generation of interprocessor communication is left to the compiler and the runtime system. This approach to parallel programming is particularly attractive if a suitable widely accepted parallel language is available. High Performance Fortran (HPF) has emerged as the first popular machine independent parallel language, and remarkable progress has been made towards compiling HPF efficiently. However, the performance of HPF programs is often poor and unpredictable, and obtaining adequate performance is a major stumbling block that must be overcome if HPF is to gain widespread acceptance. The programmer is often in the dark about how to improve the performance of an HPF program since poor performance can be attributed to a variety of reasons, including poor choice of algorithm, limited use of parallelism, or an inefficient data mapping. This paper presents a profiling tool that allows the programmer to identify the regions of the program that execute inefficiently, and to focus on the potential causes of poor performance. The central idea is to distinguish the code that is executing efficiently from the code that is executing poorly. Efficient code uses all processors of a parallel system to make progress, while inefficient code causes processors to wait, execute replicated code, idle, communicate, or perform compiler bookkeeping. We designate the latter code as non-scalable, since adding more processors generally does not lead to improved performance for such code. By analogy, the former code is called scalable. The tool presented here separates a program into scalable and non-scalable components and identifies the causes of non-scalability of different components. We show that compiler information is the key to dividing the execution times into logical categories that are meaningful to the programmer. We present the design and implementation of a profiler that is integrated with Fx, a compiler for a variant of HPF. The paper includes two examples that demonstrate how the data reported by the profiler are used to identify and resolve performance bugs in parallel programs. © 1997 John Wiley & Sons, Ltd.  相似文献   

11.
Algorithms from scientific computing often exhibit a two-level parallelism based on potential method parallelism and potential system parallelism. We consider the parallel implementation of those algorithms on distributed memory machines. The two-level potential parallelism of algorithms is expressed in a specification consisting of an upper level hierarchy of multiprocessor tasks each of which has an internal structure of uniprocessor tasks. To achieve an optimal parallel execution time, the parallel execution of such a program requires an optimal scheduling of the multiprocessor tasks and an appropriate treatment of uniprocessor tasks. For an important subclass of structured method parallelism we present a scheduling methodology which takes data redistributions between multiprocessor tasks into account. As costs we use realistic parallel runtimes. The scheduling methodology is designed for an integration into a parallel compiler tool. We illustrate the multitask scheduling by several examples from numerical analysis.  相似文献   

12.
13.
In this paper we show how implicit parallelism in Java programs can be made explicit by a restructuring compiler using the multi-threading mechanism of the language. In particular, we focus on automatically exploiting implicit parallelism in loops and multi-way recursive methods. Expressing parallelism in Java itself clearly has the advantage that the transformed program remains portable. After compilation of the transformed Java program into byte-code, speedup can be obtained on any platform on which the Java byte-code interpreter supports the true parallel execution of threads. Moreover, we will see that the transformations presented in this paper only induce a slight overhead on uni-processors. © 1997 John Wiley & Sons, Ltd.  相似文献   

14.
This paper introduces the JStar parallel programming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallel programming, and giving the compiler and runtime maximum freedom to try alternative parallelisation strategies. We describe the execution semantics and runtime support of the language, several optimisations and parallelism strategies, with some benchmark results.  相似文献   

15.
OpenMP并行程序的编译器优化   总被引:3,自引:0,他引:3       下载免费PDF全文
OpemMP标准以其良好的可移植性和易用性被广泛应用于并行程序设计。该文讨论了OpenMP并行程序的编译器优化算法,在编译过程中通过并行区合并和扩展,实现并行区重构,并在并行区中实现了基于跨处理器相关图的barrier同步优化。分析验证表明,这些优化策略减少了并行区和barrier同步的数目,有效地提高了OpenMP程序的并行性能。  相似文献   

16.
The widespread use of parallel machines has been hampered by the difficulty of mapping applications onto them effectively. The difficulty arises because current programming languages require the programmer to specify a problem to be solved at a low level of abstraction in an imperative form. Thus the programmer must immediately encode an architecture-specific algorithm detailing every communication and calculation. This process is prone to error and complicates the reuse of software.

An alternative approach is to specify the problem to be solved at a high-level in a functional language. Meaning-preserving program transformations can then be used to derive a parallel algorithm. Such algorithms can be run on parallel graph-reduction or dataflow machines which automatically exploit the implicit parallelism in a functional language program. Such automatic decomposition techniques, however, are not yet capable of fully yielding the extra performance offered by the parallel hardware.

We show how, by including an architecture specification with the problem specification, and extending the amount of transformation performed, it is possible to produce functional language code that explicity expresses the calculations and communications to be performed by the processors. This simplifies compilation, yields faster programs and enables parallel software to be developed for a wide variety of parallel computer architectures.

A goal-seeking transformation methodology has been developed which enables a high-level functional specification of the problem and a high-level functional abstraction of the target computer architecture to be systematically manipulated to produce an efficient parallel algorithm tailored to the target architecture. As the transformations start from very high-level specifications, the discovery of new algorithms is facilitated.

A case study is used to demonstrate the effectiveness of the technique. We show how a high level specification for sort can be transformed with a pipeline architecture specification to give a mergesort and how the same specification with a dynamic-message-passing architecture specification can be transformed to a novel parallel quicksort.  相似文献   


17.
Users of small computers must often program in assembler language. Macros are described which assist in the construction of block structured programs in assembler language. The macros are used in practical day-to-day programming in a cardiac electrophysiology laboratory in which the coarse grained control provided by the local FORTRAN compiler is not sufficient for, and even hinders, the writing of clear, easy to understand programs. The macros provide nestable control structures in place of the less structured transfers of conventional assembler language. The arithmetic and input/output control provided by the architecture of the machine is left fully available. The control structures implemented include conditional (IF, CASE), iteration (WHILE, REPEAT/UNTIL, FOR) and subroutine (PROC, CALL, etc.) constructs. No control of variable scope is provided. The macro implementation is discussed along with the code generated. There is a discussion of architectural features which allow the macros to be independent of specific register usage and addressing mode. Experience with use of the macros in a high-speed, real-time data acquisition and display environment is presented. We conclude that these macros are easy to use and assist in program readability and documentation.  相似文献   

18.
We describe a compiler and run-time system that allow data-parallel programs to execute on a network of heterogeneous UNIX workstations. The programming language supported is Dataparallel C, a SIMD language with virtual processors and a global name space. This parallel programming environment allows the user to take advantage of the power of multiple workstations without adding any message-passing calls to the source program. Because the performance of Individual workstations in a multi-user environment may change during the execution of a Dataparallel C program, the run-time system automatically performs dynamic load balancing. We present experimental results that demonstrate the usefulness of dynamic load-balancing In a multi-user environment These results suggest that initially allocating the same amount of work to each processor and letting the dynamic load balancing algorithm adjust the load during program execution yields very good performance. Hence neither the compiler nor the run-time system need a priori knowledge of the speeds of the machines that will participate in a program execution.  相似文献   

19.
The Image Processing applications require both computing and communication power. The object of the GFLOPS project was to study all aspects concerning the design of such computers. The project's aim was to develop a parallel architecture as well as its software environment to implement these applications efficiently. A development environment, especially a C data-parallel language, has been built for this purpose. The C parallel language presented here, simplifies the use of such architectures by providing the programmer with a global name space and a control mechanism to exploit fine and medium grain parallelism of its applications. The main advantage of our paradigm is that it allows a unique framework to express both data and control parallelism. We have implemented this programming environment on the GFLOPS machine which supports up to 512 processor nodes, which are PC mother boards, connected over a scaleable and cost-effective network, via the PCI-bus, at a constant cost per node. The aim is to obtain at low cost a scaleable virtually shared memory machine. In this paper we discuss the design of the GFLOPS machine and its C parallel language, and evaluate the effectiveness of the mechanisms incorporated. The analysis of the architecture's behaviour was conducted with microbenchmarks and image processing algorithms, written in C.  相似文献   

20.
In this paper, we describe our experience with developing Airshed, a large pollution modeling application, in the Fx programming environment. We demonstrate that high level parallel programming languages like Fx and High Performance Fortran offer a simple and attractive model for developing portable and efficient parallel applications. Performance results are presented for the Airshed application executing on Intel Paragon and Cray T3D and T3E parallel computers. The results demonstrate that the application is “performance portable,” i.e., it achieves good and consistent performance across different architectures, and that the performance can be explained and predicted using a simple model for the communication and computation phases in the program. We also show how task parallelism was used to alleviate I/O related bottlenecks, an important consideration in many applications. Finally, we demonstrate how external parallel modules developed using different parallelization methods can be integrated in a relatively simple and flexible way with modules developed in the Fx compiler framework. Overall, our experience demonstrates that a high level parallel programming environment based on a language like HPF is suitable for developing complex multidisciplinary applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号