首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 984 毫秒
1.
For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler  相似文献   

2.
When two or more literals in the body of a Prolog clause are solved in (AND) parallel, their solutions need to bejoined to compute solutions for the clause. This is often a difficult problem in parallel Prolog systems that exploit OR and independent AND parallelism in Prolog programs. In several AND/OR parallel systems proposed recently, this problem is side-stepped at the cost of unexploited OR parallelism in the program, in part due to the complexity of the backtracking algorithm beneath AND parallel branches. In some cases, the data dependency graphs used by these systems cannot represent all the exploitable indenpendent AND parallelism known at compile time.In this paper, we describe the compile time analysis for an optimizedjoin algorithm for supporting independent AND parallelism in logic programs efficiently without leaving any OR parallelism unexploited. We then discuss how this analysis can be used to yield very efficient runtime behavior. We also discuss problems associated with a tree representation of the search space when arbitrarily complex data dependency graphs are permitted. We describe how these problems can be resolved by mapping the search space onto the data dependency graphs themselves. The algorithm has been implemented in a compiler for parallel Prolog based on the Reduce-OR process model. The algorithm is suitable for the implementation of AND/OR systems on both shared and nonshared memory machines. Performance on benchmark programs exhibiting AND and OR parallelism on one shared memory machine and one message passing machine is presented.This work was supported in part by NSF Grants CCR-87-00988 and CCR-89-02496.A shorter version of this paper appears in theProceedings of NACLP 1990.  相似文献   

3.
The SB-PRAM is a shared-memory parallel computer that has been designed according to the PRAM model from theoretical computer science. The SB-PRAM realizes a concurrent-read, concurrent-write PRAM where each processor can access the global memory in unit time. This article describes the programming environment of the SB-PRAM that enables a programmer to develop efficient and portable programs without dealing with architectural details of the machine. In particular, we discuss compiler and operating system issues and show that the runtime functions of the P4 environment and several parallel data structures can be implemented very efficiently by using special features of the SB-PRAM. In contrast to other parallel machines, the synchronization of processors and the management of concurrent accesses to the global memory only require a few machine instructions independent of the number of processors participating in the operation. This efficient implementation of the runtime system is the basis for good performance of many challenging applications.  相似文献   

4.
基于GCC的VLIW编译系统研究   总被引:1,自引:1,他引:0  
VLIW机器在单个机器周期中同时发射并执行多个的并行操作,从而获得较高的指令级并行度,这些操作之间的依赖分析和调度工作则被完全交给相应的编译器执行,因此VLIW的并行性能能否充分发挥取决于VLIW体系结构相关编译器的质量。GNU开发的GCC是被最广泛使用的编译系统之一,它具有多语言、多平台支持的能力和开放的结构,能够运用各种成熟的常规编译优化技术生成高效的代码。文章分析了VLIW及GCC的结构特点,提出了一种基于GCC的VLIW编译系统设计方案,利用GCC进行RTL中间代码一级的体系结构无关优化和少量体系结构相关优化,在汇编代码一级针对VLIW结构进行体系结构相关的优化,从而充分利用GCC的成熟编译技术快速开发高效的VLIW多语言编译系统。  相似文献   

5.
Andorra-I is an experimental parallel Prolog system which transparently exploits both dependent and-parallelism and or-parallelism. One of the main components of Andorra-I is its preprocessor. In order to obtain efficient execution of programs in Andorra-I, the preprocessor includes a compiler for Andorra-I. The compiler includes a determinacy analyser and a clause compiler, and generates code for a specialised abstract machine. In this paper we discuss the main issues in the Andorra-I compiler, presenting its abstract instruction set and describing the algorithms used in its implementation.  相似文献   

6.
一个有效的并行分析算法   总被引:3,自引:0,他引:3  
并行分析在并行编译系统中有着很重要的作用,它的优劣直接影响到编译系统的成败,随着机群系统及其并行开发环境的发展,多数的并行系统可支持多重并行循环的运行。而对只支持一重并行循环的编程系统,选择并行运行效率最高的循环,也是很重要的。为此,本文提出了一个有效的循环并行分析方案,它不但能给出多层循环的并行性,而且能够处理绝大部分实际应用中的并行性问题,本文对传统的并行分析算法进行修改,并给出了一个有效的并  相似文献   

7.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines  相似文献   

8.
Most of the current compiler projects for distributed memory architectures leave the critical and time-consuming problem of finding performance-efficient data distributions and profitable program transformations for a given parallel program almost entirely to the programmer. Performance estimators provide critical performance information to both programmers and parallelizing compilers, the most crucial part of which involves determining the communication overhead induced by a program. In this paper, we present a very practical approach to the problem of compile-time estimation of communication costs for regular codes that includes analytical methods to model the number of messages exchanged, data volume transferred, transfer time, and network contention. In order to achieve high estimation accuracy, our estimator aggressively exploits compiler analysis and optimization information. It is assumed that machine parameters and problem size are known at compile time. We conducted a variety of experiments to validate the estimation accuracy and the ability to support both the programmer and compiler in the effort of performance tuning of parallel programs. We believe that our approach can be automatically applied to a large class of regular codes.  相似文献   

9.
The Portable Parallelizing Fortran Compiler (PPFC) is an additional component for the portable programming environment developed in Tel-Aviv University for scientific code. This environment supports portable and efficient programming of diverse MIMD multiprocessors, both distributed- and shared-memory. Till now this environment has consisted of two tools: the Virtual Machine for MultiProcessors (VMMP) and the Portable Parallelizing Pascal compiler (P3C). We have added the PPFC which is an automatic parallelizer compiler for the Fortran language. The compiler is fully automatic (does not require additional declarations to assist parallelization), which is characterized by loops operating on regular data structures, and produces efficient and portable code for a variety of multiprocessors from the same serial code. The parallel implementation uses the VMMP, which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. The PPFC parallelized 12 out of the 24 Livermore Loops. It was also applied to parallelize all the 14 Fortran application programs that where parallelized by the P3C and achieved the same speed-ups and efficiencies. In most examples the PPFC achieved high speed-ups and efficiencies on all target multiprocessors. The PPFC emphasizes efficiency and code portability. Although PPFC employs a relatively simple data flow analysis, it produces efficient code for various widely used application programs.  相似文献   

10.
Well designed domain specific languages have three important benefits: (1) the easy expression of problems, (2) the application of domain specific optimizations (including parallelization), and (3) dramatic improvements in productivity for their users. In this paper we describe a compiler and parallel runtime system for modeling the complex kinetics of rubber vulcanization and olefin polymerization that achieves all of these goals. The compiler allows the development of a system of ordinary differential equations describing a complex vulcanization reaction or single-site olefin polymerization reaction—a task that used to require months—to be done in hours. A specialized common sub-expression elimination and other algebraic optimizations sufficiently simplify the complex machine generated code to allow it to be compiled—eliminating all but 8.0% of the operations in our largest program and enabling over 60 times faster execution on our largest benchmark codes. The parallel runtime and dynamic load balancing scheme enables fast simulations of the model.  相似文献   

11.
Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to uniform dependency computations with two-dimensional, parallelogram-shaped iteration domain which can be tiled with lines parallel to the domain boundaries. The target architecture is a linear array (or a ring). Our model is developed in two steps. We first abstract each tile by two simple parameters, namely tile periodPtand intertile latencyLt. We formulate and partially resolve the corresponding optimization problem independent of the machine and program. Next, we refine the model with realistic machine and program parameters, yielding a discrete nonlinear optimization problem. We solve this analytically, yielding a closed form solution, which can be used by a compiler before code generation.  相似文献   

12.
This is the first of two papers presenting and evaluating the power of a new framework for combinatorial optimization in graphical models, based on AND/OR search spaces. We introduce a new generation of depth-first Branch-and-Bound algorithms that explore the AND/OR search tree using static and dynamic variable orderings. The virtue of the AND/OR representation of the search space is that its size may be far smaller than that of a traditional OR representation, which can translate into significant time savings for search algorithms. The focus of this paper is on linear space search which explores the AND/OR search tree. In the second paper we explore memory intensive AND/OR search algorithms. In conjunction with the AND/OR search space we investigate the power of the mini-bucket heuristics in both static and dynamic setups. We focus on two most common optimization problems in graphical models: finding the Most Probable Explanation in Bayesian networks and solving Weighted CSPs. In extensive empirical evaluations we demonstrate that the new AND/OR Branch-and-Bound approach improves considerably over the traditional OR search strategy and show how various variable ordering schemes impact the performance of the AND/OR search scheme.  相似文献   

13.
A new grid programming environment for remote procedure call (RPC) based master–worker type task parallelization is presented. The environment is realized through the use of a set of compiler directives, called OpenGR, and is implemented in the present study based on the Omni OpenMP compiler system and Ninf-G grid-enabled RPC system as a parallel execution mechanism. Using OpenGR directives, existing sequential applications can be readily adapted to the grid environment as master–worker parallel programs using the RPC architecture. The combination of OpenGR and OpenMP directives also allows for the hybrid parallelization of sequential programs, supporting both synchronous and asynchronous parallelism.  相似文献   

14.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

15.
《Parallel Computing》1997,23(8):1089-1112
A new generation of data parallel languages have been proposed whereby a user specifies how data structures are to be distributed amongst the processor nodes of a distributed-memory machine. Based on this information, the compiler then generates code for the parallel application. Although this approach significantly simplifies the development of the initial version of a parallel application, selection of good data distributions leading to efficient computations is often quite difficult. Therefore, performance debuggers are needed to yield insights into the data distribution effects. On the other hand, most of the existing approaches to performance debugging are very general and thus do not provide the user feedback in terms of the high level programming model or the source code of the parallel application. In this paper, we describe a novel approach to performance debugging of data parallel programs. The design and implementation of a visual performance debugger is described that is specifically targeted to meet the performance debugging requirements of a data-parallel programming model based on user-specified data distributions. The performance debugger is part of an integrated programming environment, called EPPP, which also supports a data parallel compiler and a parallel architecture simulator. Thus development and performance debugging of an application may be done either on the real hardware or by using the simulator. The code for EPPP can be obtained free of cost (contact web address, http://www.crim.ca/apar).  相似文献   

16.
The bandwidth mismatch between processor and main memory is one major throughput limiting problem. Although streamed computations have predictable access patterns their references have little temporal locality and are generally too long to cache. A memory and compiler co-optimization aimed at reducing low-level memory accesses using software and hardware locality optimizations is presented. We propose a scalable and predictable parallel memory based on a compiler synthesis of storage schemes for multi-dimensional arrays that are accessed by an arbitrary but known set of data access patterns. Using algebra of non-singular Boolean matrices, we present analysis of conflict-free access to (1) parallel memories, and (2) alignment networks. Finding a multi-pattern storage scheme is one NP-complete problem. An effective compiler heuristic is proposed for finding a storage matrix that minimizes overall memory access time. This applies to arbitrary linear patterns and arbitrary alignment networks. It is shown that the proposed storage scheme finds an optimal storage scheme for parallel (1) FFT, and (2) bitonic sorting. The proposed storage scheme outperforms statically optimized storages in the case of power-of-2 multi-stride access. The case of non power-of-2 strides is also addressed. The performance and scalability of the proposed parallel memory and its predictable access time are presented using numerical and multimedia algorithms. It is shown that a memory utilization above 83% is achieved by our storage scheme for 64 memories, which largely outperforms previous proposals. Our approach provides a tool for matching the storage pattern with the data access patterns needed for embedded systems running streamed computations with predictable data access patterns.  相似文献   

17.
The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems. High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific. One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues. We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallell program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementating the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to. An additional level of abstraction is achieved using the programming language C++, an object-oriented language. Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects. We coded BLAS on top of VMMP. This code was run without any modifications on two shared memory machines-the commercial Sequent Symmetry and the experimental Taunop. (The code should run on any machine the VMMP was ported onto, given the availability of a C++ compiler). Performance results for this implementation are given. The speed-up of the BLAS-3 routines, tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89. Application programs (e.g. Cholesky factorization) using the library routines achieved similar efficiency.  相似文献   

18.
Parallel languages allow the programmer to express parallelism at a high level. The management of parallelism and the generation of interprocessor communication is left to the compiler and the runtime system. This approach to parallel programming is particularly attractive if a suitable widely accepted parallel language is available. High Performance Fortran (HPF) has emerged as the first popular machine independent parallel language, and remarkable progress has been made towards compiling HPF efficiently. However, the performance of HPF programs is often poor and unpredictable, and obtaining adequate performance is a major stumbling block that must be overcome if HPF is to gain widespread acceptance. The programmer is often in the dark about how to improve the performance of an HPF program since poor performance can be attributed to a variety of reasons, including poor choice of algorithm, limited use of parallelism, or an inefficient data mapping. This paper presents a profiling tool that allows the programmer to identify the regions of the program that execute inefficiently, and to focus on the potential causes of poor performance. The central idea is to distinguish the code that is executing efficiently from the code that is executing poorly. Efficient code uses all processors of a parallel system to make progress, while inefficient code causes processors to wait, execute replicated code, idle, communicate, or perform compiler bookkeeping. We designate the latter code as non-scalable, since adding more processors generally does not lead to improved performance for such code. By analogy, the former code is called scalable. The tool presented here separates a program into scalable and non-scalable components and identifies the causes of non-scalability of different components. We show that compiler information is the key to dividing the execution times into logical categories that are meaningful to the programmer. We present the design and implementation of a profiler that is integrated with Fx, a compiler for a variant of HPF. The paper includes two examples that demonstrate how the data reported by the profiler are used to identify and resolve performance bugs in parallel programs. © 1997 John Wiley & Sons, Ltd.  相似文献   

19.
In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler.  相似文献   

20.
This paper presents some fundamental properties of independent and- parallelism and extends its applicability by enlarging the class of goals eligible for parallel execution. A simple model of (independent) and-parallel execution is proposed and issues of correctness and efficiency are discussed in the light of this model. Two conditions, “strict” and “nonstrict” independence, are defined and then proved sufficient to ensure correctness and efficiency of parallel execution: If goals which meet these conditions are executed in parallel, the solutions obtained are the same as those produced by standard sequential execution. Also, in the absence of failure, the parallel proof procedure does not generate any additional work (with respect to standard SLD resolution), while the actual execution time is reduced. Finally, in case of failure of any of the goals, no slowdown will occur. For strict independence, the results are shown to hold independently of whether the parallel goals execute in the same environment or in separate environments. In addition, a formal basis is given for the automatic compile-time generation of independent and-parallelism: Compiletime conditions to efficiently check goal independence at run time are proposed and proved sufficient. Also, rules are given for constructing simpler conditions if information regarding the binding context of the goals to be executed in parallel is available to the compiler.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号