首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

2.
ParC is an extension of the C programming language with block-oriented parallel constructs that allow the programmer to express fine-grain parallelism in a shared-memory model. It is suitable for the expression of parallel shared-memory algorithms, and also conducive for the parallelization of sequential C programs. In addition, performance enhancing transformations can be applied within the language, without resorting to low-level programming. The language includes closed constructs to create parallelism, as well as instructions to cause the termination of parallel activities and to enforce synchronization. The parallel constructs are used to define the scope of shared variables, and also to delimit the sets of activities that are influenced by termination or synchronization instructions. The semantics of parallelism are discussed, especially relating to the discrepancy between the limited number of physical processors and the potentially much larger number of parallel activities in a program.  相似文献   

3.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines  相似文献   

4.
A 3D iteration space visualizer (ISV) is presented to analyze the parallelism in loops and to find loop transformations which enhance the parallelism. Using automatic program instrumentation, the iteration space dependency graph (ISDG) is constructed, which shows the exact data dependencies of arbitrarily nested loops. Various graphical operations such as rotation, zooming, clipping, coloring and filtering, permit a detailed examination of the dependence relations. Furthermore, an animated dataflow execution shows the maximal parallelism and the parallel loops are indicated automatically by an embedded data dependence analysis. In addition, the user may discover and indicate additional parallelism for which a suitable unimodular loop transformation is calculated and verified. The ISV has been applied to parallelize algorithmic kernel programs, a computational fluid dynamics (CFD) simulation code, the detection of statement-level parallelism and loop variable privatization. The applications show that the visualizer is a versatile and easy to use tool for the high-performance application programmer.  相似文献   

5.
Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation.  相似文献   

6.
A new technique for estimating and understanding the speed improvement that can result from executing a program on a parallel computer is described. The technique requires no additional programming and minimal effort by a program's author. The analysis begins by tracing a sequential program. A parallelism analyzer uses information from the trace to simulate parallel execution of the program. In addition to predicting parallel performance, the parallelism analyzer measures many aspects of a program's dynamic behavior. Measurements of six substantial programs are presented. These results indicate that the three symbolic programs differ substantially from the numeric programs and, as a consequence, cannot be automatically parallelized with the same compilation techniques  相似文献   

7.
赵捷  赵荣彩  丁锐  黄品丰 《软件学报》2012,23(10):2695-2704
传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上.  相似文献   

8.
This paper describes an implementation of P L for a massively parallel SIMD machine, the M P MP-1. The system is based on a byte code interpreter which can emulate as many virtual processors on each physical processor as desired (within the limits of memory). The implementation makes it possible to activate more virtual processors once execution has begun and this feature can be used to support nested parallelism. Nested parallelism describes the ability to nest data parallel constructs, a feature of P L , C M L , and N ; however, the outer parallel forms usually have to be sequentialized, with only the innermost forms being executed in parallel. N and a subset of P L have been implemented to fully support nested parallelism by flattening nested structures at compile time. To do this the languages must impose various restrictions on both the data and control structures. There is an overhead associated with the runtime technique described here, but it is very versatile and can execute code in parallel that cannot be “flattened.” Hence this technique can be used to effectively support many of the moredifficultaspects of P L .  相似文献   

9.
New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations.  相似文献   

10.
Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).  相似文献   

11.
This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture.We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS).We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity.  相似文献   

12.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

13.
NestStep is a parallel programming language for the BSP (bulk–synchronous–parallel) model of parallel computation.Extending the classical BSP model, NestStep supports dynamically nested parallelism by nesting of supersteps and a hierarchical processor group concept. Furthermore, NestStep adds a virtual shared memory realization in software, where memory consistency is relaxed to superstep boundaries. Distribution of shared arrays is also supported.A prototype for a subset of NestStep has been implemented based on Java as sequential basis language. The prototype implementation is targeted to a set of Java Virtual Machines coupled by Java socket communication to a virtual parallel computer.  相似文献   

14.
多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。  相似文献   

15.
Multicore computers are ubiquitous. Expert developers as well as developers with little experience in parallelism are now asked to create multithreaded software to exploit parallelism in mainstream shared‐memory hardware. However, finding and fixing parallel programming errors is a complex and arduous task. Programmers thus rely on tools such as race detectors that typically focus on reporting errors due to incorrect usage of synchronization constructs or due to missing synchronization. This arsenal of debugging techniques, however, is incomplete. This article presents a new perspective and addresses a largely unexplored direction of defect localization where a wrong usage of nonparallel programming constructs might cause wrong parallel application behavior. In particular, we make a contribution by showing how to use data‐mining techniques to locate defects in multithreaded shared‐memory programs. Our technique analyzes execution anomalies in a condensed representation of the dynamic call graphs of a multithreaded object‐oriented application and identifies methods that contain a defect. Compared with race detectors that concentrate on finding incorrect synchronization, our method is able to reveal a wider range of defects that affect the control flow of a parallel program. Results from controlled experiments show that our data‐mining approach finds not only race conditions in different types of multicore applications but also other errors that cause incorrect parallel program behavior. Data‐mining techniques offer a fruitful new ground for parallel program debugging, and we also discuss long‐term directions for this interesting field. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

16.
In this paper, we show that the classical A.I. planning problem can be modelled using simple database constructs with logic-based semantics. The approach is similar to that used to model updates and nondeterminism in active database rules. We begin by showing that planning problems can be automatically converted to Datalog1S programs with nondeterministic choice constructs, for which we provide a formal semantics using the concept of stable models. The resulting programs are characterized by a syntactic structure (XY-stratification) that makes them amenable to efficient implementation using compilation and fixpoint computation techniques developed for deductive database systems. We first develop the approach for sequential plans, and then we illustrate its flexibility and expressiveness by formalizing a model for parallel plans, where several actions can be executed simultaneously. The characterization of parallel plans as partially ordered plans allows us to develop (parallel) versions of partially ordered plans that can often be executed faster than the original partially ordered plans.  相似文献   

17.
18.
Nested parallelism appears naturally in many applications. It is required whenever a function performing parallel statements needs to call a subroutine using parallelism. A particular case occurs when the function is recursive. Nested parallelism is to parallel programming as basic as nested loops to sequential programming. Despite this, most existing parallel languages do not provide this feature. This paper presents a new methodology to expand message passing libraries (MPL) with nested parallelism. The tool to support the methodology has processor virtualization, load balancing, pipeline parallelism and collective operations, among other features. The computational results prove that the performance obtained is comparable to that obtained using classical message passing programs. Since the methodology does not force the programmer to leave the MPL environment, all the efficiency and portability of the MPL model is preserved. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

19.
We study the problem of guaranteeing correct execution semantics in parallel implementations of logic programming languages in presence of built-in constructs that are sensitive to order of execution. The declarative semantics of logic programming languages permit execution of various goals in any arbitrary order (including in parallel). However, goals corresponding to extra-logical built-in constructs should respect the sequential order of execution to ensure correct semantics. Ensuring this correctness in presence of such built-in constructs, while efficiently exploiting maximum parallelism, is a difficult problem. In this paper, we propose a formalization of this problem in terms of operations on dynamic trees. This abstraction enables us to: (i) show that existing schemes to handle order-sensitive computations used in current parallel systems are sub-optimal; (ii) develop a novel, optimal scheme to handle order-sensitive goals that requires only a constant time overhead per operation. While we present our results in the context of logic programming, they will apply equally well to most parallel non-deterministic systems. Received: 20 April 1998 / 3 April 2000  相似文献   

20.
Although the problem of increasing the speed of rule-based programs has been studied for a long while, so far the level of parallelism achieved under various parallel processing schemes fails to meet expectations of high performance gains. Most of the work has focused on manipulating existing rule programs to accommodate parallel architectures. However, without changing the inherently sequential algorithms encoded in rule form and without using intrinsic parallel rule languages to exploit parallelism, speedup is severely limited. In this paper we demonstrate that to maximize parallelism, manipulations must take place at the algorithmic level in addition to the program level. However, traditional rule languages are not equipped to easily express parallel algorithms. Thus, an inherently parallel rule language must be devised to enable maximum parallelism. We present our initial specification of such a language, named PARULEL. Preliminary performance results are detailed for one test case program. PARULEL is one part of a larger research effort that considers four key approaches to parallel rule processing: inherently parallel execution semantics; the use of redaction meta-rules to allow programmer control of the set of parallel instantiations; an incremental evaluation algorithm that incorporates, in an efficient manner, updates to the initial set of assumptions; and an algorithm that maps a program to a parallel architecture.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号