首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem  相似文献   

2.
The paper presents a parallel programming methodology that ensures easy programming, efficiency and portability of programs to different machines belonging to the class of the general-purpose, distributed-memory, MIMD architectures. The methodology is based on the definition of a new, high-level, explicitly parallel language, called P3 L, and of a set of static tools that automatically adapt the program features for each target architecture. P3 L does not require programmers to specify process activations, the actual parallelism degree, scheduling, or interprocess communications, i.e. all those features that need to be adjusted to harness each specific target machine. Parallelism is, on the other hand, expressed in a structured and qualitative way, by hierarchical composition of a restricted set of language constructs, corresponding to those forms of parallelism that are frequently encountered in parallel applications, and that can be efficiently implemented. The efficient portability of P3 L applications is guaranteed by the compiler along with the novel structure of the support. The compiler automatically adapts the program features for each specific architecture, using the costs (in terms of performance) of the low-level mechanisms exported by the architecture itself. In our methodology, these costs, along with other features of the architecture, are viewed through an abstract machine, whose interface is used by the compiler to produce the final object code.  相似文献   

3.
Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism.  相似文献   

4.
Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their operands making the programs free of false dependencies. Compiling for a queue machine differs from traditional compilation methods for register machines. The queue compiler is responsible for scheduling the program in level-order manner to expose natural parallelism and calculating instructions relative offset values to access their operands. This paper describes the phases and data structures used in the queue compiler to compile C programs into assembly code for the QueueCore, an embedded queue processor. Experimental results demonstrate that our compiler produces good code in terms of parallelism and code size when compared to code produced by a traditional compiler for a RISC processor.  相似文献   

5.
Features of an explicitly parallel programming language targeted for reconfigurable parallel processing systems, where the machine's N processing elements (PEs) are capable of operating in both the SIMD and SPMD modes of parallelism, are described. The SPMD (single program-multiple data) mode of parallelism is a subset of the MIMD mode where all processors execute the same program. By providing all aspects of the language with an SIMD mode version and an SPMD mode version that are syntactically and semantically equivalent, the language facilitates experimentation with and exploitation of hybrid SIMD/SPMD machines. Language constructs (and their implementations) for data management, data-dependent control-flow, and PE-address-dependent control-flow are presented. These constructs are based on experience gained from programming a parallel machine prototype and are being incorporated into a compiler under development. Much of the research presented is applicable to general SIMD machines and MIMD machines  相似文献   

6.
This paper presents the results of parallelizing a three-dimensional Navier-Stokes solver on a 32K-processor Thinking Machines CM-2, a 128-node Intel iPSC/860, and an 8-processor CRAY Y-MP. The main objective of this work is to study the performance of the flow solver, INS3D-LU code, on two distributed-memory machines, a massively parallel SIMD machine (CM-2) and a moderately parallel MIMD machine (iPSC/860), and compare it with its performance on a shared-memory MIMD machine with a small number of processors (Y-MP). The code is based on a Lower-Upper Symmetric-Gauss-Seidel implicit scheme for the pseudocompressibility formulation of the three-dimensional incompressible Navier-Stokes equations. The code was rewritten in CMFORTRAN with shift operations and run on the CM-2 using the slicewise model. The code was also rewritten with distributed data and Intel message-passing calls and run on the iPSC/860. The timing results for two grid sizes are presented and analyzed using both 32-bit and 64-bit arithmetic. Also, the impact of communication and load balancing on the performance of the code is outlined. The results show that reasonable performance can be achieved on these parallel machines. However, the CRAY Y-MP outperforms the CM-2 and iPSC/860 for this particular algorithm.The author is an employee of Computer Sciences Corporation. This work was funded through NASA Contract NAS 2-12961.  相似文献   

7.
The four general methods of parallel processing are described. These are: single instruction single data stream (SISD), multiple instruction single data stream (MISD), single instruction multiple data stream (SIMD), and multiple instruction multiple data stream (MIMD). Most single computers in use are SISD machines, while most parallel processing applications use the MIMD aproach. The paper outlines and compares the four basic MIMD architectures: tightly coupled, loosely coupled, voting, and peripheral processing. The latter is one of the most practical methods using existing minicomputers. Many modern programmable intelligent I/O controllers are peripheral processors. For a computer manufacturer to remain competitive, it is concluded that he will have to include such devices in his hardware.  相似文献   

8.
《Parallel Computing》1988,7(2):211-225
An adaptive task partitioning scheme for MIMD architectures is investigated. For many serial adaptive procedures this methodology provides a direct translation into reasonably efficient parallel versions.A proto-type of a two-dimensional integration method has been implemented in this manner. This was facilitated by using a set of high-level macros, layered over the Argonne macro package, which provides the primitives in the adaptive partitioning scheme.Because of the portable nature of the Argonne macro package our code should be readily ported to other MIMD machines with shared memory. A version of the macros for other MIMD architectures is also possible.  相似文献   

9.
Adam is a high-level language for parallel processing. It is intended for programming resource scheduling applications, in particular supervisory packages for run-time scheduling of multiprocessing systems. An important design goal was to provide support for implementation of Ada and its run-time environment. Adam has been used to implement Ada task supervision and also as a high-level target language for compilation of Ada tasking. Adam provides facilities corresponding to the Ada sequential constructs (including subprograms, packages, exceptions, generics). In addition, it provides specialized module constructs for implementation of packages that may be shared between parallel processes, and new predefined types for scheduling. The parallel processing constructs of Adam are more primitive than Ada tasking. Strong restrictions are enforced on the ways in which parallel processes can interact. A compiler for Adam has been implemented in MacLisp on DEC PDP-10 computers. Runtime support packages in Adam for scheduling (on a single CPU) and I/O are also provided. The compiler contains a library manipulation facility for separate compilation. The Adam compiler has been used to build an Ada compiler for most of the July 1980 Ada, including task types and rendezvous constructs. This was achieved by implementing the translation of Ada tasking into Adam parallel processing as a preprocessor to the Adam compiler. This present Ada compiler, which has been operational since December 1980, uses a procedure call implementation of tasking. It can be easily modified to other implementations. Compilation of Ada tasking into a high-level target language such as Adam facilitates studying questions of correctness and efficiency of various compilation algorithms, and code optimizations specific to tasking, e.g. elimination of unnecessary threads of control. This paper gives an overview of Adam and examples of its use. Emphasis is placed on the differences from Ada. Experience using Adam to build the experimental Ada system is evaluated. Design of a run-time supervisor in Adam is discussed in detail.  相似文献   

10.
The multiflow trace scheduling compiler   总被引:3,自引:0,他引:3  
The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks.  相似文献   

11.
The lack of high-level languages and good compilers for parallel machines hinders their widespread acceptance and use. Programmers must address issues such as process decomposition, synchronization, and load balancing. We have developed a parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference. A process decomposition is obtained by specializing the program for each processor to the data that resides on that processor. If this analysis fails, the compiler falls back to a simple but inefficient scheme called run-time resolution. Each process's role in the computation is determined by examining the data required for execution at run-time. Thus, our approach to process decomposition is data-driven rather than program-driven. We discuss several message optimizations that address the issues of overhead and synchronization in message transmission. Accumulation reorganizes the computation of a commutative and associative operator to reduce message traffic. Pipelining sends a value as close to its computation as possible to increase parallelism. Vectorization of messages combines messages with the same source and the same destination to reduce overhead. Our results from experiments in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Intel iPSC/2, show a speedup within 60% to 70% of handwritten code  相似文献   

12.
多核平台下XEN虚拟机动态调度算法研究   总被引:1,自引:0,他引:1  
虚拟机调度算法对并行任务的执行效率考虑不够充分。现代处理器平台具备了多个可用的计算核心,使多个虚拟机并发执行成为了现实。针对多核平台下的并行虚拟机调度优化问题,提出一种基于任务特征虚拟机CON-Credit调度算法。该算法在调度并行任务时,使用动态方式对计算机核心进行分配,采用传统的虚拟机调度算法为执行普通任务的虚拟机进行分配;采用定制的同步算法给执行并行任务的虚拟机分进分配。相关实验显示,CON-Credit调度算法能显著提高并行任务的执行效率。  相似文献   

13.
提出了面向科学计算的64位流体系结构——MASA,它具有强局域性、并行性、解耦合访存操作和计算操作等特征,特别适合于计算密集型的并行应用.作者使用时钟精确的模拟器评测了流体力学中的典型应用在MASA上的运行性能,结果表明MASA在500MHz的情况下能够获得比1.6GHz的Iantium2近4倍的加速,证实了流体系结构在高性能计算领域的极大潜力.  相似文献   

14.
传统的静态编译器优化存在着各种限制,为此,提出了一种运行期动态优化的对策。在程序的执行过程中,持续检测程序运行的profile信息,并根据这些信息对程序代码进行优化变换,创建并运行程序代码的优化版本。这种运行期动态优化操作是直接针对程序的二进制代码的,不针对程序语言或编译器。这不仅带来优化的透明性,还使得老版本的源代码即遗留代码也可以从优化技术中获得性能提升。  相似文献   

15.
An environment that lets system applications be expressed as virtual machines, through which architecture-independent multiple-instruction, multiple-data stream (MIMD) programs are written, is described. The virtual machine hides the hardware configuration from the programmer so that the MIMD programming environment always appears the same, regardless of the actual hardware. The data-definition and procedural high-level languages used in the environment and the generation of object code in the environment are discussed. The runtime configuration of the system and an implemented prototype of the system are described  相似文献   

16.
The structure and implementation of P3C, a compiler that uses relatively simple parallelization techniques to produce highly efficient code for both shared-memory and distributed-memory MIMD multiprocessors, are discussed. P3C is a fully automatic, portable parallelizing Pascal compiler for scientific code, which is characterized by loops operating on regular data structures. P3 C compiles the same source code to all target machines without modification. P3C gives an accurate estimate about whether the parallel code will actually reduce execution time over serial code, taking into account the associated overhead. It derives the estimate by statically analyzing the program at compile time, referring to a table of the target machine's parameters. Speedups of up to 24 have been achieved on a Sequent Symmetry with 25 active processors. The estimated speedup was within 8% of the actual figure  相似文献   

17.
A survey of parallel computer architectures   总被引:1,自引:0,他引:1  
Duncan  R. 《Computer》1990,23(2):5-16
An attempt is made to place recent architectural innovations in the broader context of parallel architecture development by surveying the fundamentals of both newer and more established parallel computer architectures and by placing these architectural alternatives in a coherent framework. The primary emphasis is on architectural constructs rather than specific parallel machines. Three categories of architecture are defined and discussed: synchronous architectures, comprising vector, SIMD (single-instruction-stream, multiple-data-stream) and systolic machines; MIMD (multiple-instruction-stream, multiple-data-stream) with either distributed or shared memory; and MIMD-based paradigms, comprising MIMD/SIMD hybrid, dataflow, reduction, and wavefront types  相似文献   

18.
Data-parallel languages allow programmers to easily express parallel computations by means of high-level constructs. To reduce overheads, the compiler partitions the computations among the processors at compile-time, on the basis of the static data distribution suggested by the programmer. When execution costs are nonuniform and unpredictable, some processors may be assigned more work than others. Workload imbalance can be mitigated by cyclically distributing data and associated computations or by employing adaptive strategies which build a more balanced schedule at run-time, on the basis of the actual execution costs. This paper discusses static and hybrid (static+dynamic) scheduling strategies which can be used to balance the workloads derived from the execution of nonuniform parallel loops. A multidimensional flame simulation kernel has been used to evaluate different implementation strategies on a Cray T3E. We fed the benchmark code with synthetic input data sets built on the basis of a load imbalance model and we report and compare the results obtained.  相似文献   

19.
Although considerable technology has been developed for debugging and developing sequential programs, producing verifiably correct parallel code is a much harder task. In view of the large number of possible scheduling sequences, exhaustive testing is not a feasible method for determining whether a given parallel program is correct; nor have there been sufficient theoretical developments to allow the automatic verification of parallel programs. PTOOL, a tool being developed at Rice University in collaboration with users at Los Alamos National Laboratory, provides an alternative mechanism for producing correct parallel code. PTOOL is a semi-automatic tool for detecting implicit parallelism in sequential Fortran code. It uses vectorizing compiler techniques to identify dependences preventing the parallelization of sequential regions. According to the model supported by PTOOL, a programmer should first implement and test his program using traditional sequential debugging techniques. Then, using PTOOL, he can select loop bodies that can be safely executed in parallel. At Los Alamos, we have been interested in examining the role of dependence-analysis tools in the parallel programming process. Therefore, we have used PTOOL as a static debugging tool to analyze parallel Fortran programs. Our experiences using PTOOL lead us to conclude that dependence-analysis tools are useful to today's parallel programmers. Dependence-analysis is particularly useful in the development of asynchronous parallel code. With a tool like PTOOL, a programmer can guarantee that processor scheduling cannot affect the results of his parallel program. If a programmer wishes to implement a partially parallelized region through the use of synchronization primitives, however, he will find that dependence analysis is less useful. While a dependence-analysis tool can greatly simplify the task of writing synchronization code, the ultimate responsibility of correctness is left to the programmer.This work was performed under the auspices of the U.S. Department of Energy.  相似文献   

20.
We are concerned here with the parallelisation of the N3S-MUSCL industrial code which aims to solve the three-dimensional compressible Navier–Stokes equations by implementing a mixed finite element/finite volume formulation on unstructured tetrahedral meshes. Defining a good strategy for the parallelisation of an unstructured mesh based solver is a challenge, particularly when one aims at reaching a high level of performance while maintaining portability of the source code between scalar, vector and parallel machines. The parallelisation strategy adopted in this study combines mesh partitioning techniques and a message-passing programming model. The targeted parallel architectures are of MIMD type.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号