期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compiling Fortran 90D/HPF for Distributed Memory MIMD Computers

《Journal of Parallel and Distributed Computing》1994,21(1):15-26

This paper describes the design of the Fortran90D/HPF compiler, a source-to-source parallel compiler for distributed memory systems being developed at Syracuse University. Fortran 90D/HPF is a data parallel language with special directives to specify data alignment and distributions. A systematic methodology to process distribution directives of Fortran 90D/HPF is presented. Furthermore, techniques for data and computation partitioning, communication detection and generation, and the run-time support for the compiler are discussed. Finally, initial performance results for the compiler are presented. We believe that the methodology to process data distribution, computation partitioning, communication system design, and the overall compiler design can be used by the implementors of compilers for HPF. 相似文献

2.

Efficient Computation of Address Sequences in Data Parallel Programs Using Closed Forms for Basis Vectors 总被引：1，自引：0，他引：1

Ashwath Thirumalai J. Ramanujam 《Journal of Parallel and Distributed Computing》1996,38(2):188

Arrays are mapped to processors through a two-step process—alignment followed by distribution—in data-parallel languages such as High Performance Fortran. This process of mapping creates disjoint pieces of the array that are locally owned by each processor. An HPF compiler that generates code for array statements must compute the sequence of local memory addresses accessed by each processor and the sequence of sends and receives for a given processor to access nonlocal data. In this paper, we present an approach to the address sequence generation problem using the theory of integer lattices. The set of elements referenced can be generated by integer linear combinations of basis vectors. Unlike other work on this problem, we derive closed form expressions for the basis vectors as a function of the mapping of data. Using these basis vectors and exploiting the fact that there is a repeating pattern in the access sequence, we derive highly optimized code that generates the pattern at runtime. The code generated uses table-lookup of the pattern. Experimental results show that our approach is faster than other solutions to this problem. 相似文献

3.

Efficient Index Set Generation for Compiling HPF Array Statements on Distributed-Memory Machines

S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,38(2):237

In languages such as High Performance Fortran (HPF), array statements are used to express data parallelism. In compiling array statements for distributed-memory machines, efficient enumeration of local index sets and commmunication sets is important. A method based on a virtual processor approach has been proposed for efficient index set enumeration for array statements involving arrays distributed using block-cyclic distributions. The virtual processor approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to the physical processors. The key idea of the method is to first develop closed forms in terms of simple regular sections for the index sets for arrays distributed using block or cyclic distributions. These closed forms are then used with the virtual processor approach to give an efficient solution for arrays with the block-cyclic distribution. HPF supports a two-level mapping of arrays to processors. Arrays are first aligned with a template at an offset and a stride and the template is then distributed among the processors using a regular data distribution. The introduction of a nonunit stride in the alignment creates “holes” in the distributed arrays which leads to memory wastage. In this paper, using simple mathematical properties of regular sections, we extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping. We develop a methodology for translating the closed forms for block and cyclically distributed arrays mapped using a one-level mapping to closed forms for arrays mapped using the two-level mapping. Using these closed forms, the virtual processor approach is extended to handle array statements involving arrays mapped using two-level mappings. Performance results on the Cray T3D are presented to demonstrate the efficacy of the extensions and identify various trade-offs associated with the proposed method. 相似文献

4.

Array Operation Synthesis to Optimize HPF Programs on Distributed Memory Machines

《Journal of Parallel and Distributed Computing》2001,61(4):467-500

An increasing number of programming languages, such as Fortran 90, HPF, and APL, provide a rich set of intrinsic array functions and array expressions. These constructs, which constitute an important part of data parallel languages, provide excellent opportunities for compiler optimizations. The synthesis of consecutive array operations or array expressions into a composite access function of the source arrays at compile time has been shown (A. T. Budd, ACM Trans. Programm. Lang. Syst.6 (July 1984), 297–313; G. H. Hwang et al., in “Proc. of ACM SIGPLAN Conference on Principles and Practice of Parallel Programming, 1995,” pp. 112–122) to be an effective scheme for optimizing programs on flat shared memory parallel architectures. It remains, however, to be studied how the synthesis scheme can be incorporated into optimizing HPF-like programs on distributed memory machines by taking into account communication costs. In this paper, we propose solutions to address this issue. We first show how to perform array operation synthesis on HPF programs, and we demonstrate its performance benefits on distributed memory machines with real applications. In addition, to prevent a situation we call “synthesis anomaly,” we present an optimal solution to guide the array synthesis process on distributed memory machines. Due to the optimal problem being NP-hard, we further develop a practical strategy that compilers can use on distributed memory machines with HPF programs. Our synthesis engine is implemented as a Web-based tool, called Syntool, and experimental results show significant performance improvement over the base codes for HPF code fragments from real appli- cations on parallel machines. Our experiments were performed on three distributed memory machines: an 8-node DEC Alpha Farm, a 16-node IBM SP-2, and a 16-node nCUBE/2. 相似文献

5.

SPiDER—An advanced symbolic debugger for Fortran 90/HPF programs

T. Fahringer K. Sowa‐Piek&#x;o P. Czerwi&#x;ski P. Brezany M. Bubak R. Koppler R. Wismüller 《Concurrency and Computation》2002,14(2):103-136

Debuggers play an important role in developing parallel applications. They are used to control the state of many processes, to present distributed information in a concise and clear way, to observe the execution behavior, and to detect and locate programming errors. More sophisticated debugging systems also try to improve understanding of global execution behavior and intricate details of a program. In this paper we describe the design and implementation of SPiDER, which is an interactive source‐level debugging system for both regular and irregular High‐Performance Fortran (HPF) programs. SPiDER combines a base debugging system for message‐passing programs with a high‐level debugger that interfaces with an HPF compiler. SPiDER, in addition to conventional debugging functionality, allows a single process of a parallel program to be expected or the entire program to be examined from a global point of view. A sophisticated visualization system has been developed and included in SPiDER to visualize data distributions, data‐to‐processor mapping relationships, and array values. SPiDER enables a programmer to dynamically change data distributions as well as array values. For arrays whose distribution can change during program execution, an animated replay displays the distribution sequence together with the associated source code location. Array values can be stored at individual execution points and compared against each other to examine execution behavior (e.g. convergence behavior of a numerical algorithm). Finally, SPiDER also offers limited support to evaluate the performance of parallel programs through a graphical load diagram. SPiDER has been fully implemented and is currently being used for the development of various real‐world applications. Several experiments are presented that demonstrate the usefulness of SPiDER. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

6.

Performance of the Prototype Fortran D Compiler

CHAU-WEN TSENG 《Software》1997,27(7):763-796

Fortran D is a version of Fortran enhanced with data decomposition specifications. Case studies illustrate strengths and weaknesses of the prototype Fortran D compiler when compiling linear algebra codes and whole programs. Statement groups, execution conditions, inter-loop communication optimizations, multi-reductions, and array kills for replicated arrays are identified as new compilation issues. On the Intel iPSC/860, the output of the prototype Fortran D compiler approaches the performance of hand-optimized code for parallel computations, but needs improvement for linear algebra and pipelined codes. The Fortran D compiler outperforms the CM Fortran compiler (2.1 beta) by a factor of four or more on the TMC CM-5 when not using vector units. Its performance is comparable to the DEC and IBM HPF compilers on a Alpha cluster and SP-2. Better analysis, run-time support, and flexibility are required for the prototype compiler to be useful for a wider range of programs. © 1997 John Wiley & Sons, Ltd. 相似文献

7.

Scalarization Using Loop Alignment and Loop Skewing

Zhao Yuan Kennedy Ken 《The Journal of supercomputing》2005,31(1):5-46

Array syntax, which is supported in many technical programming languages, adds expressive power by allowing operations on and assignments to whole arrays and array sections. To compile an array assignment statement to a uniprocessor, the language processor must convert the statement into a loop that has the same meaning. This process is called scalarization.Scalarization presents a significant technical problem because an array assignment needs to be implemented as if all inputs are fetched before any outputs are stored. Since a loop intermixes loads and stores, the compiler typically allocates a temporary array to hold the intermediate result. Because these extra temporary arrays can cause performance problems in cache, many techniques have been developed to avoid their use or minimize their size.In this paper, we present a novel application of two compiler strategies—loop alignment and loop skewing—to address this problem. We show that these strategies can achieve the asymptotically minimal memory allocation for stencil computations. Our experiments with loop alignment and loop skewing demonstrate that it is extremely effective in improving memory hierarchy performance of Fortran 90 array code on standard uniprocessors. The result should be applicable to other array languages, such as MATLAB. 相似文献

8.

A performance debugging tool for high performance Fortran programs

Takashi Suzuoka Jaspal Subhlok Thomas Gross 《Concurrency and Computation》1997,9(10):927-945

Parallel languages allow the programmer to express parallelism at a high level. The management of parallelism and the generation of interprocessor communication is left to the compiler and the runtime system. This approach to parallel programming is particularly attractive if a suitable widely accepted parallel language is available. High Performance Fortran (HPF) has emerged as the first popular machine independent parallel language, and remarkable progress has been made towards compiling HPF efficiently. However, the performance of HPF programs is often poor and unpredictable, and obtaining adequate performance is a major stumbling block that must be overcome if HPF is to gain widespread acceptance. The programmer is often in the dark about how to improve the performance of an HPF program since poor performance can be attributed to a variety of reasons, including poor choice of algorithm, limited use of parallelism, or an inefficient data mapping. This paper presents a profiling tool that allows the programmer to identify the regions of the program that execute inefficiently, and to focus on the potential causes of poor performance. The central idea is to distinguish the code that is executing efficiently from the code that is executing poorly. Efficient code uses all processors of a parallel system to make progress, while inefficient code causes processors to wait, execute replicated code, idle, communicate, or perform compiler bookkeeping. We designate the latter code as non-scalable, since adding more processors generally does not lead to improved performance for such code. By analogy, the former code is called scalable. The tool presented here separates a program into scalable and non-scalable components and identifies the causes of non-scalability of different components. We show that compiler information is the key to dividing the execution times into logical categories that are meaningful to the programmer. We present the design and implementation of a profiler that is integrated with Fx, a compiler for a variant of HPF. The paper includes two examples that demonstrate how the data reported by the profiler are used to identify and resolve performance bugs in parallel programs. © 1997 John Wiley & Sons, Ltd. 相似文献

9.

HPF计算划分的算法实现

下载免费PDF全文

仲跻冬李晓明《计算机工程与科学》1997,19(2):55-58

ＨＰＦ（ＨｉｇｈＰｅｒｆｏｒｍａｎｃｅＦｏｒｔｒａｎ）是基于数据划分说明的并行语言。如何由数据划分确定程序的计算划分是ＨＰＦ编译器需要首先解决的基本问题。本文介绍了ＨＰＦ的数据划分和计算划分的概念。以三层嵌套循环为例，直观地提出了一种求得计算划分的算法相似文献

10.

An integrated runtime and compile-time approach for parallelizingstructured and block structured applications

Agrawal G. Sussman A. Saltz J. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):747-754

In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library 相似文献

11.

Parallel Sparse Supports for Array Intrinsic Functions of Fortran 90

Chang Rong-Guey Chuang Tyng-Ruey Lee Jenq Kuen 《The Journal of supercomputing》2001,18(3):305-339

Fortran 90 provides a rich set of array intrinsic functions. Each of these array intrinsic functions operates on the elements of multi-dimensional array objects concurrently. They provide a rich source of parallelism and play an increasingly important role in automatic support of data parallel programming. However, there is no such support if these intrinsic functions are applied to sparse data sets. In this paper, we address this open gap by presenting an efficient library for parallel sparse computations with Fortran 90 array intrinsic operations. Our method provides both compression schemes and distribution schemes on distributed memory environments applicable to higher-dimensional sparse arrays. This way, programmers need not worry about low-level system details when developing sparse applications. Sparse programs can be expressed concisely using array expressions, and parallelized with the help of our library. Our sparse libraries are built for array intrinsics of Fortran 90, and they include an extensive set of array operations such as CSHIFT, EOSHIFT, MATMUL, MERGE, PACK, SUM, RESHAPE, SPREAD, TRANSPOSE, UNPACK, and section moves. Our work, to our best knowledge, is the first work to give sparse and parallel sparse supports for array intrinsics of Fortran 90. In addition, we provide a complete complexity analysis for our sparse implementation. The complexity of our algorithms is in proportion to the number of nonzero elements in the arrays, and that is consistent with the conventional design criteria for sparse algorithms and data structures. Our current testbed is an IBM SP2 workstation cluster. Preliminary experimental results with numerical routines, numerical applications, and data-intensive applications related to OLAP (on-line analytical processing) show that our approach is promising in speeding up sparse matrix computations on both sequential and distributed memory environments if the programs are expressed with Fortran 90 array expressions. 相似文献

12.

Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines

S.K.S. Gupta S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,32(2):155

Array statements are often used to express data-parallelism in scientific languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, efficient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be efficiently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with avirtual processor approachto give an efficient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to asvirtual-blockorvirtual-cyclicviews, depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits different schemes based on the combination of the virtual processor views chosen for the different arrays involved in an array statement. These virtualization schemes have different indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays. 相似文献

13.

Cluster环境下p—HPF编译器支持的并行计算范式 总被引：2，自引：0，他引：2

胡长军余华山丁文魁许卓群《计算机研究与发展》2001,38(8):954-959

p-HPF是研制的一个符合HPF（high performance Fortran)规范的并行编译系统,以HPF为核心实现多范式并行计算是开发大型并行应用系统的基础。首先论述了Cluster环境下的并行运行范式,包括farm parallel范式、流水线并行、流循环并行、基于数据并行和组合数据并行等,抽象分析了它们的性能,接着给出了利用p-HPF的外部过程机制、任务并行机制以以FORALL,INDEPENDENT DO等典型并行语句实现几种典型并行范式的方法,给出了实例程序,对实例进行了实际运行并对运行结果进行了分析。相似文献

14.

An implementation framework for HPF distributed arrays onmessage-passing parallel computer systems

van Reeuwijk K. Denissen W. Sips H.J. Paalvast E.M.R.M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(9):897-914

Data parallel languages, like High Performance Fortran (HPF), support the notion of distributed arrays. However, the implementation of such distributed array structures and their access on message passing computers is not straightforward. This holds especially for distributed arrays that are aligned to each other and given a block-cyclic distribution. In this paper, an implementation framework is presented for HPF distributed arrays on message passing computers. Methods are presented for efficient (in space and time) local index enumeration, local storage, and communication. Techniques for local set enumeration provide the basis for constructing local iteration sets and communication sets. It is shown that both local set enumeration and local storage schemes can be derived from the same equation. Local set enumeration and local storage schemes are shown to be orthogonal, i.e., they can be freely combined. Moreover, for linear access sequences generated by our enumeration methods, the local address calculations can be moved out of the enumeration loop, yielding efficient local memory address generation. The local set enumeration methods are implemented by using a relatively simple general transformation rule for absorbing ownership tests. This transformation rule can be repeatedly applied to absorb multiple ownership tests. Performance figures are presented for local iteration overhead, a simple communication pattern, and storage efficiency 相似文献

15.

The Fork95 parallel programming language: Design, implementation, application

Christoph W. Keßler Helmut Seidl 《International journal of parallel programming》1997,25(1):17-50

Fork95 is an imperative parallel programming language intended to express algorithms for synchronous shared memory machines (PRAMs). It is based on ANSI C and offers additional constructs to hierarchically divide processor groups into subgroups and manage shared and private address subspaces. Fork95 makes the assembly-level synchronicity of the underlying hardware available to the programmer at the language level. Nevertheless, it supports locally asynchronous computation where desired by the programmer. We present a one pass compiler, fcc, which compiles Fork95 and C programs to the SB-PRAM machine. The SB-PRAM is a lock-step synchronous, massively parallel multiprocessor currently being built at Saarbrücken University, with a physically shared memory and uniform memory access time. We examine three important types of parallel computation frequently used for the parallel solution of real-world problems. While farming and parallel divide-and-conquer are directly supported by Fork95 language constructs, pipelining can be easily expressed using existing language features; an additional language construct for pipelining is not required. 相似文献

16.

A generic interface for parallel access to large data sets from HPF applications

Peter Przemys aw Marianne 《Future Generation Computer Systems》2001,17(8):977-985

A critical performance issue for a number of scientific and engineering applications is the efficient transfer of data to secondary storage. Languages such as High Performance Fortran (HPF) have been introduced to allow programming distributed-memory systems at a relatively high level of abstraction. However, the present version of HPF does not provide appropriate constructs for controlling the parallel I/O capabilities of these systems. In this paper, constructs to specify parallel I/O operations on multidimensional arrays in the context of HPF are proposed. The paper also presents implementation concepts that are based on the HPF compiler VFC and the parallel I/O run-time system Panda. Experimental performance results are discussed in the context of financial management and traffic simulation applications. 相似文献

17.

Compiling High Performance Fortran for distributed-memory architectures

Siegfried Benkner Hans Zima 《Parallel Computing》1999,25(13-14)

High Performance Fortran (HPF) is a data-parallel language that provides a high-level interface for programming scientific applications, while delegating to the compiler the task of generating explicitly parallel message-passing programs. This paper provides an overview of HPF compilation and runtime technology for distributed-memory architectures, and deals with a number of topics in some detail. In particular, we discuss distribution and alignment processing, the basic compilation scheme and methods for the optimization of regular computations. A separate section is devoted to the transformation and optimization of independent loops with irregular data accesses. The paper concludes with a discussion of research issues and outlines potential future development paths of the language. 相似文献

18.

Efficient parallel implementations of near Delaunay triangulation with High Performance Fortran

Min‐Bin Chen Tyng‐Ruey Chuang Jan‐Jan Wu 《Concurrency and Computation》2004,16(12):1143-1159

Unstructured mesh generation exposes highly irregular computation patterns, which imposes a challenge in implementing triangulation algorithms on parallel machines. This paper reports on an efficient parallel implementation of near Delaunay triangulation with High Performance Fortran (HPF). Our algorithm exploits embarrassing parallelism by performing sub‐block triangulation and boundary merge independently at the same time. The sub‐block triangulation is a divide & conquer Delaunay algorithm known for its sequential efficiency, and the boundary triangulation is an incremental construction algorithm with low overhead. Compared with prior work, our parallelization method is both simple and efficient. In the paper, we also describe a solution to the collinear points problem that usually arises in large data sets. Our experiences with the HPF implementation show that with careful control of the data distribution, we are able to parallelize the program using HPF's standard directives and extrinsic procedures. Experimental results on several parallel platforms, including an IBM SP2 and a DEC Alpha farm, show that a parallel efficiency of 42–86% can be achieved for an eight‐node distributed memory system. We also compare efficiency of the HPF implementation with that of a similarly hand‐coded MPI implementation. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

19.

A Loop Transformation Algorithm for Communication Overlapping

Kazuaki Ishizaki Hideaki Komatsu Toshio Nakatani 《International journal of parallel programming》2000,28(2):135-154

Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization. 相似文献

20.

A comparison of two Fortran dialects for expressing parallel solutions for a problem in linear algebra

M. Clint J. S. Weston C. W. Bleakney 《Parallel Computing》1992,18(12):1325-1333

Recently, AMT has issued an extended version of Fortran Plus [1] which allows software to be developed without the developer needing to take explicit accout of the grid size of the target processor. Fortran-Plus and its extension, Fortran Plus Enhanced [2], have been developed for use on the AMT DAP 510 array processor. This machine has 1024 processors arranged in a square grid with nearest neighbour and wraparound connections. It is interesting to enquire whether the performance of code generated by the Fortran-Plus Enhanced compiler is, for a particular application, superior to that generated by the Fortran-Plus compiler from a program which recognises and is tailored to fit the characteristic features of the DAP 510. In this paper the performances of two implementations of an algorithm for the eigensolution of real tridiagonal symmetric matrices are compared. The algorithm is characterised by its heavy use of matrix operations, all of which can be efficiently implemented on an array processor. Some of the constituent operations commonly occur in other applications while others are specific to the problem being addressed. 相似文献