期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Data management and control-flow aspects of an SIMD/SPMD parallellanguage/compiler

Nichols M.A. Siegel H.J. Dietz H.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(2):222-234

Features of an explicitly parallel programming language targeted for reconfigurable parallel processing systems, where the machine's N processing elements (PEs) are capable of operating in both the SIMD and SPMD modes of parallelism, are described. The SPMD (single program-multiple data) mode of parallelism is a subset of the MIMD mode where all processors execute the same program. By providing all aspects of the language with an SIMD mode version and an SPMD mode version that are syntactically and semantically equivalent, the language facilitates experimentation with and exploitation of hybrid SIMD/SPMD machines. Language constructs (and their implementations) for data management, data-dependent control-flow, and PE-address-dependent control-flow are presented. These constructs are based on experience gained from programming a parallel machine prototype and are being incorporated into a compiler under development. Much of the research presented is applicable to general SIMD machines and MIMD machines 相似文献

2.

一个数据并行语言的设计及其实现

陈斯愈黄林鹏《计算机工程》1997,23(3):3-6

数据并行模型应用到ＭＩＭＤ机器上，实现ＳＰＭＤ模式的松散同步的方式越来越受到人们的重视。文中提出了一个以屏构并行系统为环境的数据并行语言Ｍｕｌｔｉ－ｃ的设计和实现。正在实现的Ｍｕｌｉｔｉ－ｃ编译器，以预编译的方式接受ＳＩＭＤ形式的程序说明，放宽同步要求，产生能以ＳＰＭＫ方式在并行系统上运行的Ｃ程序。相似文献

3.

Eliminating memory for fragmentation within partitionable SIMD/SPMDmachines

Nichols M.A. Siegel H.J. Dietz H.G. Quong R.W. Nation W.G. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(3):290-303

Efficient data layout is an important aspect of the compilation process. A model for the creation of perfect memory maps for large-scale parallel machines capable of user-controlled partitionable single-instruction-multiple data/single-program-multiple data (SIMD/SPMD) operation is developed. The term perfect implies that no memory fragmentation occurs and ensures that the memory map size is kept to a minimum. A major constraint on solving this problem is based on the single program nature of both the SIMD and SPMD modes of parallelism. It is assumed that all processors within the same submachine used identical addresses to access corresponding data items in each of their local memories. Necessary and sufficient conditions are derived for being able to create perfect memory maps, and results are applied to several partitionable interconnection networks 相似文献

4.

Data-parallel programming on a network of heterogeneous workstations

Nenad Nedeljkovlc Michael J. Quinn 《Concurrency and Computation》1993,5(4):257-268

We describe a compiler and run-time system that allow data-parallel programs to execute on a network of heterogeneous UNIX workstations. The programming language supported is Dataparallel C, a SIMD language with virtual processors and a global name space. This parallel programming environment allows the user to take advantage of the power of multiple workstations without adding any message-passing calls to the source program. Because the performance of Individual workstations in a multi-user environment may change during the execution of a Dataparallel C program, the run-time system automatically performs dynamic load balancing. We present experimental results that demonstrate the usefulness of dynamic load-balancing In a multi-user environment These results suggest that initially allocating the same amount of work to each processor and letting the dynamic load balancing algorithm adjust the load during program execution yields very good performance. Hence neither the compiler nor the run-time system need a priori knowledge of the speeds of the machines that will participate in a program execution. 相似文献

5.

Parallel Implementations of Block-Based Motion Vector Estimation for Video Compression on Four Parallel Processing Systems

Min Tan Janet M. Siegel Howard Jay Siegel 《International journal of parallel programming》1999,27(3):195-225

Parallel algorithms, based on a distributed memory machine model, for an exhaustive search technique for motion vector estimation in video compression are being designed and evaluated. Results from the execution on a 16,384 processor MasPar MP-1 (an SIMD machine), a 140 node Intel Paragon XP/S and a 16 node IBM SP2 (two M IMD machines), and the 16 processor PASM prototype (a partitionable SIMD/MIMD mixed-mode machine) are presented. The trade-offs of using different modes of parallelism (SIMD, SPMD, and mixed-mode) and different data partitioning schemes (the rectangular and stripe subimage methods) are examined. The analytical and experimental results shown in this application study will help practitioners to predict and contrast the performance of different approaches to parallel implementation of this important video compression technique. The results presented are also applicable to a large class of image and video processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping a set of independent application tasks, or the subtasks of a single application task, onto a heterogeneous suite of parallel machines. 相似文献

6.

Compilation Techniques for High Level Parallel Code

Benedict R. Gaster Tim Bainbridge David Lacey David Gardner 《International journal of parallel programming》2010,38(1):4-18

This paper describes methods to adapt existing optimizing compilers for sequential languages to produce code for parallel processors. In particular it looks at targeting data-parallel processors using SIMD (single instruction multiple data) or vector processors where users need features similar to high-level control flow across the data-parallelism. The premise of the paper is that we do not want to write an optimizing compiler from scratch. Rather, a method is described that allows a developer to take an existing compiler for a sequential language and modify it to handle SIMD extensions. As well as modifying the front-end, the intermediate representation and the code generation to handle the parallelism, specific optimizations are described to target the architecture efficiently. 相似文献

7.

一类非规则并行应用问题的通信集生成算法

胡长军李静王珏姚广利李永红丁良李建江《计算机学报》2008,31(1):120-126

非规则计算是大规模并行应用中普遍存在和影响效率的关键问题.在基于分布式内存的数据并行范例中,如何针对非规则数组引用,有效地生成本地内存访问序列和通信集,是并行编译生成SPMD结点程序所必须解决的重要问题.文中针对两重嵌套循环中,下一层循环边界是上一层循环变量的线性或非线性函数,数组下标是两层循环变量的非线性函数这样一类包含非规则数组引用的并行应用问题,提出了一种在编译时生成通信集的代数算法.并且针对cyclic(k)数据分布和线性对齐模板,借助整数格概念,给出了编译时全局地址和本地地址之间的转换方法.文中还给出了相应的经过通信优化的SPMD结点程序.最后通过实例验证了算法的正确性.该算法的意义在于避免了传统Inspector/Executor非规则计算模型中的Inspector阶段,从而节省了运行时Inspector阶段通过穷举下标生成通信集的巨大开销. 相似文献

8.

A Data Parallel Scientific Modeling Language

《Journal of Parallel and Distributed Computing》1994,21(1):46-60

The data parallel meta language (DPML) and its associated Fortran source code rewriter (DP77) support architecture independent, high performance climate and weather prediction models. The language allows the data domains over which a program operates, the communication patterns required between elements of those data domains, and some or all of the calculations of a program to be expressed at a very high level. DPML uses explicit data parallelism to express the inherent parallelism of the models, with the result that programs are easily compilable into target machine code. DP77 uses information from the DPML program to translate Fortran routines into the host specific Fortran form required for their parallel execution within the model. This paper describes the general strategy behind the development of DPML, discusses its language features using examples drawn from climate modelling, and provides details of the mechanism it uses for incorporating Fortran into data parallel programs. Encouraging results are reported for DPML versions of the standard weather benchmark models executing on vector, SIMD, and MIMD (shared memory) machines. While the paper is set within the framework of climate modelling, the technique has obvious wider implications. 相似文献

9.

A Comparison of Implicitly Parallel Multithreaded and Data-Parallel Implementations of an Ocean Model

《Journal of Parallel and Distributed Computing》1998,48(1):1-51

Two parallel implementations of a state-of-the-art ocean model are described and analyzed: one is written in the implicitly parallel language Id for the Monsoon multithreaded dataflow architecture, and the other in data-parallel CM Fortran for the CM-5. The multithreaded programming model is inherently more expressive than the data-parallel model but is not especially adapted to regular data structures common to many scientific codes. One goal of this study is to understand what, if any, are the performance penalties of multithreaded execution when implementing a program that is well suited for data-parallel execution. To avoid technology and machine configuration issues, the two implementations are compared in terms of overhead cycles perrequiredfloating point operation. When flows in complex geometries typical of ocean basins are simulated, the data-parallel model only remains efficient if redundant computations are performed over land. The generality of the Id programming model, however, allows one to easily and transparently implement a parallel code that computes only in the ocean. When ocean basins with complex and irregular geometry are simulated the normalized performance on Monsoon is comparable with that of the CM-5. For more regular geometries that map well to the computational domain, the data-parallel approach proves to be a better match. We conclude by examining the extent to which clusters of mainstream symmetric multiprocessor (SMP) systems offer a scientific computing environment which can capitalize on and combine the strengths of the two paradigms. 相似文献

10.

Pipelined data parallel algorithms-II: design

King C.-T. Chou W.-H. Ni L.M. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(4):486-499

For pt.I see ibid., p.470-85. A methodology for designing pipelined data-parallel algorithms on multicomputers is studied. The design procedure starts with a sequential algorithm which can be expressed as a nested loop with constant loop-carried dependencies. The procedure's main focus is on partitioning the loop by grouping related iterations together. Grouping is necessary to balance the communication overhead with the available parallelism and to produce pipelined execution patterns, which result in pipelined data-parallel computations. The grouping should satisfy dependence relationships among the iterations and also allow the granularity to be controlled. Various properties of grouping are studied, and methods for generating communication-efficient grouping are given. Given a grouping and an assignment of the groups to the processors, an analytic model is combined with the grouping results to describe the behavior and to estimate the performance of the resultant parallel program. Expressions characterizing the performance are derived 相似文献

11.

Performance Measures for Evaluating Algorithms for SIMD Machines

《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331

This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献

12.

SPMT:一种支持混合型问题求解的并行执行方式

张宏莉胡铭曾李东《小型微型计算机系统》2000,21(3):225-227

并行执行方式是影响并行编译器效率的关键因素之一。本文首先介绍两种典型的并行执行方式：支持数据并行语言的ＳＰＭＤ方式和支持任务并行语言的ＭＰＭＤ方式,然后,分析这两种并行执行方式在实现多范式并行语言时所存在的问题。最后,提出一种并的并行执行方式：ＳＰＭＴ。相似文献

13.

PARULEL: Parallel rule processing using meta-rules for redaction

Salvatore J. Stolfo Ouri Wolfson Philip K. Chan Hasanat M. Dewan Leland Woodbury Jason S. Glazier David A. Ohsie 《Journal of Parallel and Distributed Computing》1991,13(4)

Although the problem of increasing the speed of rule-based programs has been studied for a long while, so far the level of parallelism achieved under various parallel processing schemes fails to meet expectations of high performance gains. Most of the work has focused on manipulating existing rule programs to accommodate parallel architectures. However, without changing the inherently sequential algorithms encoded in rule form and without using intrinsic parallel rule languages to exploit parallelism, speedup is severely limited. In this paper we demonstrate that to maximize parallelism, manipulations must take place at the algorithmic level in addition to the program level. However, traditional rule languages are not equipped to easily express parallel algorithms. Thus, an inherently parallel rule language must be devised to enable maximum parallelism. We present our initial specification of such a language, named PARULEL. Preliminary performance results are detailed for one test case program. PARULEL is one part of a larger research effort that considers four key approaches to parallel rule processing: inherently parallel execution semantics; the use of redaction meta-rules to allow programmer control of the set of parallel instantiations; an incremental evaluation algorithm that incorporates, in an efficient manner, updates to the initial set of assumptions; and an algorithm that maps a program to a parallel architecture. 相似文献

14.

Guaranteeing good memory bounds for parallel programs

Burton F.W. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(10):762-773

The amount of memory required by a parallel program may be spectacularly larger than the memory required by an equivalent sequential program, particularly for programs that use recursion extensively. Since most parallel programs are nondeterministic in behavior, even when computing a deterministic result, parallel memory requirements may vary from run to run, even with the same data. Hence, parallel memory requirements may be both large (relative to memory requirements of an equivalent sequential program) and unpredictable. Assume that each parallel program has an underlying sequential execution order that may be used as a basis for predicting parallel memory requirements. We propose a simple restriction that is sufficient to ensure that any program that will run in n units of memory sequentially can run in mn units of memory on m processors, using a scheduling algorithm that is always within a factor of two of being optimal with respect to time. Any program can be transformed into one that satisfies the restriction, but some potential parallelism may be lost in the transformation. Alternatively, it is possible to define a parallel programming language in which only programs satisfying the restriction can be written 相似文献

15.

Data-parallel programming on multicomputers

Quinn M.J. Hatcher P.J. 《Software, IEEE》1990,7(5):69-76

The inadequacies of conventional parallel languages for programming multicomputers are identified. The C* language is briefly reviewed, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented. Results illustrating the efficiency of executing data-parallel programs on a hypercube multicomputer are reported. They show the speedup achieved by three hand-compiled C* programs executing on an N-Cube 3200 multicomputer. The first two programs, Mandelbrot set calculation and matrix multiplication, have a high degree of parallelism and a simple control structure. The C* compiler can generate relatively straightforward code with performance comparable to hand-written C code. Results for a C* program that performs Gaussian elimination with partial pivoting are also presented and discussed 相似文献

16.

An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

Vydyanathan Naga Krishnamoorthy Sriram Sabin Gerald M. Catalyurek Umit V. Kurc Tahsin Sadayappan Ponnuswamy Saltz Joel H. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(8):1158-1172

Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two (also called mixed parallelism) has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the task graph, the runtime estimates and scalability characteristics of the tasks, and the intertask data communication volumes. A locality-conscious scheduling strategy is used to improve intertask data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications and synthetic graphs shows that our algorithm consistently generates schedules with a lower makespan as compared to Critical Path Reduction (CPR) and Critical Path and Allocation (CPA), two previously proposed scheduling algorithms. Our algorithm also produces schedules that have a lower makespan than pure task- and data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches. 相似文献

17.

The BLAZE language: A parallel language for scientific programming

Piyush Mehrotra John Van Rosendale 《Parallel Computing》1987,5(3):339-361

Programming multiprocessor parallel architectures is a complex task. This paper describes a block-structured scientific programming language, BLAZE, designed to simplify this task. BLAZE contains array arithmetic, ‘forall’ loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow.

A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture, while neglecting the remainder. This paper describes the features of BLAZE, and show how this language would be used in typical scientific programming. 相似文献

18.

Control structures for data-parallel SIMD languages: semantics and implementation 总被引：1，自引：0，他引：1

Luc Boug Jean-Luc Levaire 《Future Generation Computer Systems》1992,8(4):363-378

We define a simple language which encapsulates the main concepts of SIMD data-parallel programming, and we give its operational semantics. This language includes a unique data-parallel control structure called multitype conditioning and escape. We show that it suffices to express all data-parallel extensions of usual scalar control structures of C, as found in C^*, MPL, POMPC, etc. Moreover, we give a formal correctness proof for two different implementations of this new statement, respectively by a single context stack, and by a set of counters. Thus, this simple language appears as an interesting basis to study data-parallel SIMD programming methodology. 相似文献

19.

Tlib—a library to support programming with hierarchical multi-processor tasks

《Journal of Parallel and Distributed Computing》2005,65(3):347-360

The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are mapped. The result is a multi-level group SPMD computation model with varying processor group structures. The advantage of this kind of mixed task and data parallelism is a potential to reduce the communication overhead and to increase scalability. We present a runtime library to support the coordination of hierarchically structured multi-processor tasks. The library exploits an extended parallel group SPMD programming model and manages the entire task execution including the dynamic hierarchy of processor groups. The library is built on top of MPI, has an easy-to-use interface, and leads to only a marginal overhead while allowing static planning and dynamic restructuring. 相似文献

20.

A Heterogeneous Mixed-Mode Execution Model for Massively Parallel Systems

《Journal of Parallel and Distributed Computing》1999,56(1):2-16

In this paper, we consider a massively parallel system that is composed of heterogeneous processors, that is, processors with different processing power, and that combines the advantages of the SIMD and MIMD architectures. The heterogeneous mixed-mode (HeMM) execution model is composed of two main components, which operate in the well-known SIMD and MIMD paradigms. The main computing power comes from a component that is composed of a massive number of processors and operates in a data parallel manner. The other component is composed of a few (or even one) fast processors which operate in the MIMD paradigm. The operation of a small number of processors in an MIMD paradigm has been well demonstrated through actual systems. The processors in this component add flexibility to the execution of the parallel programs such that it adjusts to the changing parallelism of the program to enhance the performance. Based on this execution model we analyze the gains in performance that is obtainable by this new system. We show that substantial performance gains can be obtained by using the HeMM system. 相似文献