首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The implementation of two compilers for the data-parallel programming language Dataparallel C is described. One compiler generates code for Intel and nCUBE hypercube multicomputers; the other generates code for Sequent multiprocessors. A suite of Dataparallel C programs has been compiled and executed, and their execution times and speedups on the Intel iPSC/2, the nCUBE 3200 and the Sequent Symmetry are presented  相似文献   

2.
For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler  相似文献   

3.
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs.  相似文献   

4.
Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster  相似文献   

5.
The research focus in parallel logic programming is shifting rapidly from theoretical considerations and simulation on uniprocessors to implementation on true multiprocessors. This report presents performance figures from such a system,Boplog, for OR-parallel Horn clause logic programs on the BBN Butterfly Parallel Processor. Boplog is designed expressly for a large scale shared memory multiprocessor with an Omega interconnect. The target machine and underlying execution model are described briefly. The primary focus of the paper is on detailed statistics taken from the execution of benchmark programs to assess the performance of the model and clarify the impact of design and architecture decisions. They show that while speedup of this implementation on highly OR-parallel problems is very good, overall performance is poor. Despite its speed drawback, many aspcts of the implementation and its performance can prove useful in designing future systems for similar machines. A binding model that prohibits constant time access to bindings, and the inability of the machine to support an ambitious use of machine memory appear to be most damaging factors.This work was carried out at the University of Utah, Salt Lake City, Utah. It was supported by a University of Utah Graduate Research Fellowship, the National Science Foundation under Grant DCR-856000, and by an unrestricted gift from L. M. Ericsson Telefon AB, Stockholm, Sweden, Production of the document was supported by the Rockwell International Science Center.  相似文献   

6.
7.
As moderate-scale multiprocessors become widely used, we foresee an increased demand for effective compiler parallelization and efficient management of parallelism. While parallelizing compilers are achieving success at identifying parallelism, they are less adept at predetermining the degree of parallelism in different program phases. Thus, a compiler-parallelized application may execute on more processors than it can effectively use – a waste of computational resources that becomes more acute as the number of processors increases, particularly for systems used as multiprogrammed compute servers. This paper examines the dynamic parallelism behavior of multiprogrammed workloads using programs from the Specfp95 and Nas benchmark suites, automatically parallelized by the Stanford SUIF compiler. Our results demonstrate that even the programs with good overall speedups display wide variability in the number of processors each phase (or loop) can exploit. We propose and evaluate a run-time system mechanism that dynamically adjusts the number of processors used by a compiler-parallelized application, responding to observed performance during the program's execution. Programs can thus adapt processor usage as they run, responding both to poor parallelism within certain parts of their code, and also to heavy multiprogramming loads during the execution. This mechanism improves workload performance up to 33% over one-at-a-time runs of the workload programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

8.
Parallel programming is orders of magnitudes more complex than writing sequential programs. This is particularly true for programming distributed memory multiprocessor architectures based on message passing programming models. Apart from understanding the sequential parts of the parallel program, new degrees of freedom lead to additional problems. Understanding the synchronization and communication behavior of parallel programs is the most critical issue in programming distributed memory multiprocessors. The paper describes methods and tools for visualization and animation of the dynamic execution of parallel programs. Based on an evaluation and classification of existing visualization environments, the visualization and animation tool VISTOP (VISualization TOol for Parallel Systems) is presented as part of the integrated tool environment TOPSY S (TOols for Parallel SYStems) for programming distributed memory multiprocessors. VISTOP supports the interactive on-line visualization of message passing programs based on various views; in particular, a process graph based concurrency view for detecting synchronization and communication bugs.  相似文献   

9.
Programming multiprocessor parallel architectures is a complex task. This paper describes a block-structured scientific programming language, BLAZE, designed to simplify this task. BLAZE contains array arithmetic, ‘forall’ loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow.

A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture, while neglecting the remainder. This paper describes the features of BLAZE, and show how this language would be used in typical scientific programming.  相似文献   


10.
We propose a new paradigm for programming multiprocessor systems, 2DT (two-dimensional transformations). 2DT programs are composed of local computations on linear data (columns) and global transformations on 2-dimensional combinations of columns (2D-arrays). Local computations can be expressed in a functional or imperative base language; a typed variant of Backus' FP is chosen for 2DT-FP. The level of abstraction of 2DT makes it suitable for programming a relevant set of algorithms, in general any algorithms, whose data can be easily mapped to 2-dimensional representations. A set of examples tries to prove this claim. An interleaving semantics for 2DT-FP is given, exposing the potential for parallel execution of 2DT-FP programs. The claim is proved that any sequential and thus any parallel execution will deliver the same result. The implementation realized on the Intel Hypercube is described. Supported by the Deutsche Forschungsgemeinschaft, DFG  相似文献   

11.
Hill  M.D. 《Computer》1998,31(8):28-34
In the future, many computers will contain multiple processors, in part because the marginal cost of adding a few additional processors is so low that only minimal performance gain is needed to make the additional processors cost effective. Intel, for example, now makes cards containing four Pentium Pro processors that can easily be incorporated into a system. Multiple processor cards like Intel's will help multiprocessing spread from servers to the desktop. But how will these multiprocessors be programmed? The evolution of the programming model is already under way. One important function of the programming model is to describe how memory operates. For a multiprocessor, a reasonable model is sequential consistency (SC), which makes a multiprocessor behave like a multitasking uniprocessor. Nevertheless, many commercial multiprocessors support more relaxed memory models. The author argues that multiprocessors should support SC because-with speculative execution, relaxed models do not provide sufficient additional performance to justify exposing their complexity to the authors of low level software  相似文献   

12.
We describe a binding environment for the AND and OR parallel execution of logic programs that is suitable for both shared and nonshared memory multiprocessors. The binding environment was designed with a view of rendering a compiler using this binding environment machine independent. The binding environment is similar to closed environments proposed by J. Conery. However, unlike Conery's scheme, it supports OR and independent AND parallelism on both types of machines. The term representation, the algorithms for unification and the join algorithms for parallel AND branches are presented in this paper. We also detail the differences between our scheme and Conery's scheme. A compiler based on this binding environment has been implemented on a platform for machine independent parallel programming called the Chare Kernel  相似文献   

13.
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.  相似文献   

14.
Because multicore CPUs have become the standard with all major hardware manufacturers, it becomes increasingly important for programming languages to provide programming abstractions that can be mapped effectively onto parallel architectures. Stream processing is a programming paradigm where computations are expressed as independent actors that communicate via FIFO data-channels. The coarse-grained parallelism exposed in stream programs facilitates such an efficient mapping of actors onto the underlying multicore hardware. We propose a stream-parallel programming abstraction that extends object-oriented languages with stream-programming facilities. StreamPI consists of a class hierarchy for actor-specification together with a language-independent runtime system that supports the execution of stream programs on multicore architectures. We show that the language-specific part of StreamPI, i.e., the class hierarchy, can be implemented as a library-level programming language extension. A library-level extension has the advantage that an existing programming language implementation need not be touched. Legacy-code can be mixed with a stream-parallel application, and the use of sequential legacy code with actors is supported. Unlike previous approaches, StreamPI allows dynamic creation and subsequent execution of stream programs. StreamPI actors are typed. Type-safety is achieved through type-checks at stream graph creation time. We have implemented StreamPI??s language-independent runtime system and language interfaces for Ada?2005 and C++ for Intel multicore architectures. We have evaluated StreamPI for up to 16 cores on a two?CPU 8-core Intel Xeon X7560 server, and we provide a performance comparison with StreamIt?(Gordon et al. in International Conference on Architectural Support for Programming Languages and Operating Systems, 2006), which is the de facto standard for stream-parallel programming. Although our approach provides greater programming flexibility than StreamIt, the performance of StreamPI compares favorably to the static compilation model of StreamIt.  相似文献   

15.
16.
Apara-functional programming language is a functional language that has been extended with special annotations that provide an extra degree of control over parallel evaluation. Of most interest are annotations that allow one to express the dynamic mapping of a program onto a known multiprocessor topology. Since it is quite desirable to provide a precise semantics for any programming language, in this paper adenotational semantics is given for a simple para-functional programming language with mapping annotations. A precise meaning is given not only to the normalfunctional behavior of the program (i.e., the answer), but also to theoperational notion of where (i.e., on what processor) expressions are evaluated. The latter semantics is accomplished through an abstract entity called anexecution tree.This research was supported in part by the National Science Foundation under Grants DCR-8403304 and DCR-8451415, and the Department of Energy under Grant DE-FG02-86ER25012.  相似文献   

17.
Algorithms from scientific computing often exhibit a two-level parallelism based on potential method parallelism and potential system parallelism. We consider the parallel implementation of those algorithms on distributed memory machines. The two-level potential parallelism of algorithms is expressed in a specification consisting of an upper level hierarchy of multiprocessor tasks each of which has an internal structure of uniprocessor tasks. To achieve an optimal parallel execution time, the parallel execution of such a program requires an optimal scheduling of the multiprocessor tasks and an appropriate treatment of uniprocessor tasks. For an important subclass of structured method parallelism we present a scheduling methodology which takes data redistributions between multiprocessor tasks into account. As costs we use realistic parallel runtimes. The scheduling methodology is designed for an integration into a parallel compiler tool. We illustrate the multitask scheduling by several examples from numerical analysis.  相似文献   

18.
The Muse (multiple sequential Prolog engines) approach has been used to make a simple and efficient OR-parallel implementation of the full Prolog language. The performance results of the Muse system on bus-based multiprocessor machines have been presented in previous papers. This paper discusses the implementation and performance results of the Muse system on switch-based multiprocessors (the BBN Butterfly GP1000 and TC2000). The results of Muse execution show that high real speedups can be achieved for Prolog programs that exhibit coarse-grained parallelism. The scheduling overhead is equivalent to around 8–26 Prolog procedure calls per task on the TC2000. The paper also compares the Muse results with corresponding results for the Aurora OR-parallel Prolog system. For a large set of benchmarks, the results are in favor of the Muse system.  相似文献   

19.
The Portable Parallelizing Fortran Compiler (PPFC) is an additional component for the portable programming environment developed in Tel-Aviv University for scientific code. This environment supports portable and efficient programming of diverse MIMD multiprocessors, both distributed- and shared-memory. Till now this environment has consisted of two tools: the Virtual Machine for MultiProcessors (VMMP) and the Portable Parallelizing Pascal compiler (P3C). We have added the PPFC which is an automatic parallelizer compiler for the Fortran language. The compiler is fully automatic (does not require additional declarations to assist parallelization), which is characterized by loops operating on regular data structures, and produces efficient and portable code for a variety of multiprocessors from the same serial code. The parallel implementation uses the VMMP, which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. The PPFC parallelized 12 out of the 24 Livermore Loops. It was also applied to parallelize all the 14 Fortran application programs that where parallelized by the P3C and achieved the same speed-ups and efficiencies. In most examples the PPFC achieved high speed-ups and efficiencies on all target multiprocessors. The PPFC emphasizes efficiency and code portability. Although PPFC employs a relatively simple data flow analysis, it produces efficient code for various widely used application programs.  相似文献   

20.
Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号