期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compiling parallel programs by optimizing performance

Marina Chen Young-Il Choo Jingke Li 《The Journal of supercomputing》1988,2(2):171-207

This paper describes how Crystal, a language based on familiar mathematical notation and lambda calculus, addresses the issues of programmability and performance for parallel supercomputers. Some scientifc programmers and theoreticians may ask, “What is new about Crystal?” or “How is it different from existing functional languages?” The answers lie in its model of parallel computation and a theory of parallel program optimization, and we examine this in the text to follow. We illustrate the power of our approach with benchmarks of compiled parallel code from Crystal source. The target machines are hypercube multiprocessors with distributed memory, on which it is considered difficult for functional programs to achieve high efficiency. 相似文献

2.

Adaptive execution techniques of parallel programs for multiprocessors

Jaejin Lee Jung-Ho Park Honggyu Kim Changhee Jung Daeseob Lim SangYong Han 《Journal of Parallel and Distributed Computing》2010

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems. 相似文献

3.

Retargeting sequential image-processing programs for data parallel execution

Baumstark L.B. Jr. Wills L.M. 《IEEE transactions on pattern analysis and machine intelligence》2005,31(2):116-136

New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations. 相似文献

4.

Efficient parallel execution of irregular recursive programs

《Parallel and Distributed Systems, IEEE Transactions on》2002,13(2):167-178

Programs whose parallelism stems from multiple recursion form an interesting subclass of parallel programs with many practical applications. The highly irregular shape of many recursion trees makes it difficult to obtain good load balancing with small overhead. We present a system, called REAPAR, that executes recursive C programs in parallel on SMP machines. Based on data from a single profiling run of the program, REAPAR selects a load-balancing strategy that is both effective and efficient and it generates parallel code implementing that strategy. The performance obtained by REAPAR on a diverse set of benchmarks matches that published for much more complex systems requiring high-level problem-oriented explicitly parallel constructs. A case study even found REAPAR to be competitive to handwritten (low-level, machine-oriented) thread-parallel code 相似文献

5.

Space efficient execution of deterministic parallel programs

Simpson D.J. Burton F.W. 《IEEE transactions on pattern analysis and machine intelligence》1999,25(6):870-882

We model a deterministic parallel program by a directed acyclic graph of tasks, where a task can execute as soon as all tasks preceding it have been executed. Each task can allocate or release an arbitrary amount of memory (i.e., heap memory allocation can be modeled). We call a parallel schedule “space efficient” if the amount of memory required is at most equal to the number of processors times the amount of memory required for some depth-first execution of the program by a single processor. We describe a simple, locally depth-first scheduling algorithm and show that it is always space efficient. Since the scheduling algorithm is greedy, it will be within a factor of two of being optimal with respect to time. For the special case of a program having a series-parallel structure, we show how to efficiently compute the worst case memory requirements over all possible depth-first executions of a program. Finally, we show how scheduling can be decentralized, making the approach scalable to a large number of processors when there is sufficient parallelism 相似文献

6.

A parallel execution model of logic programs

Chen A.C. Wu C.-I. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(1):79-92

A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented. This model employs a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth. For interpreting a nondeterministic logic program, this model (1) performs frame inheritance for newly created goals, (2) creates data-dependency graphs (DDGs) that represent relationships among the goals, and (3) constructs appropriate process structures based on the DDGs. (1) The use of frame inheritance serves to increase modularity. In contrast to most previous parallel models that have a large single process structure, frame inheritance facilitates the dynamic construction of multiple independent process structures, and thus permits further manipulation of each process structure. (2) The dynamic determination of data dependency serves to reduce computational complexity. In comparison to models that exploit brute-force parallelism and models that have fixed execution sequences, this model can reduce the number of unification and/or merging steps substantially. In comparison to models that exploit only AND parallelism, this model can selectively exploit demand-driven computation, according to the binding of the query and optional annotations. (3) The construction of appropriate process structures serves to reduce communication complexity. Unlike other methods that map DDGs directly onto process structures, this model can significantly reduce the number of data sent to a process and/or the number of communication channels connected to a process 相似文献

7.

Stochastic bounds on execution times of parallel programs

Yazia-Pekergin N. Vincent J.-M. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(10):1005-1012

Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by an acyclic directed graph called a task graph. The execution times of tasks are considered to be independently identically distributed (i.i.d.) random variables. The performance measure of interest is the overall execution of the considered parallel program (task graph). Stochastic bound methods are applied to obtain lower and upper bounds on this measure. Another upper bound is obtained for parallel programs having `new better than used in expectation' (NBUE) random variables as task execution times. NBUE random variables are replaced with exponential random variables of the same mean to derive this upper bound 相似文献

8.

Using true concurrency to model execution of parallel programs

Yosi Ben-Asher Eitan Farchi 《International journal of parallel programming》1994,22(4):375-407

Parallel execution of a programR (intuitively regarded as a partial order) is usually modeled by sequentially executing one of the total orders (interleavings) into which it can be embedded. Our work deviates from this serialization principle by usingtrue concurrency to model parallel execution. True concurrency is represented via completions ofR tosemi total orders, called time diagrams. These orders are characterized via a set of conditions (denoted byCt), yielding orders or time diagrams which preserve some degree of the intended parallelism inR. Another way to express semi total orders is to use re-writing or derivation rules (denoted byCx) which for any programR generates a set of semi-total orders. This paper includes a classification of parallel execution into three classes according to three different types ofCt conditions. For each class a suitableCx is found and a proof of equivalence between the set of all time diagrams satisfyingCt and the set of all terminalCx derivations ofR is devised. This equivalence between time diagram conditions and derivation rules is used to define a novel notion of correctness for parallel programs. This notion is demonstrated by showing that a specific asynchronous program enforces synchronous execution, which always halts, showing that true concurrency can be useful in the context of parallel program verification. 相似文献

9.

OR parallel execution of Prolog programs with side effects

Laxmikant V. Kale David A. Padua David C. Sehr 《The Journal of supercomputing》1988,2(2):209-223

With the growing availability of multiprocessors, a great deal of attention has been given to executing Prolog in parallel. A question that naturally arises is how to execute standard sequential Prolog programs with side effects in parallel. The problem of performing side effects in AND parallel systems has been considered elsewhere. This paper presents a method that generates sequential semantics of side effect predicates in an OR parallel system. First, a general method is given for performing data side effects such as read and write. This method is then extended to control side effects such as asserta, assertz, and retract. Finally, a constant-time algorithm for performing cut is presented.The work of L. V. Kale was supported by the National Science Foundation under Grant NSF-CCR-8700988. The work of D. A. Padua and D. C. Sehr was supported in part by the National Science Foundation under Grant NSF-MIP-8410110, the Department of Energy under Grant DOE DE-FG02-85ER25001, and a donation from the IBM Corporation to the Center for Supercomputing Research and Development. D. C. Sehr holds a fellowship from the Office of Naval Research. 相似文献

10.

A survey of parallel execution strategies for transitive closure and logic programs

Filippo Cacace Stefano Ceri Maurice Houtsma 《Distributed and Parallel Databases》1993,1(4):337-382

An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In particular, hash-based fragmentation is used to distribute data to disks under the control of different processors in order to perform selections and joins in parallel. With the development of new query languages, and in particular with the definition of transitive closure queries and of more general logic programming queries, the new dimension of recursion has been added to query processing. Recursive queries are complex; at the same time, their regular structure is particularly suited for parallel execution, and parallelism may give a high efficiency gain. We survey the approaches to parallel execution of recursive queries that have been presented in the recent literature. We observe that research on parallel execution of recursive queries is separated into two distinct subareas, one focused on the transitive closure of Relational Algebra expressions, the other one focused on optimization of more general Datalog queries. Though the subareas seem radically different because of the approach and formalism used, they have many common features. This is not surprising, because most typical Datalog queries can be solved by means of the transitive closure of simple algebraic expressions. We first analyze the relationship between the transitive closure of expressions in Relational Algebra and Datalog programs. We then review sequential methods for evaluating transitive closure, distinguishing iterative and direct methods. We address the parallelization of these methods, by discussing various forms of parallelization. Data fragmentation plays an important role in obtaining parallel execution; we describe hash-based and semantic fragmentation. Finally, we consider Datalog queries, and present general methods for parallel rule execution; we recognize the similarities between these methods and the methods reviewed previously, when the former are applied to linear Datalog queries. We also provide a quantitative analysis that shows the impact of the initial data distribution on the performance of methods. Recommended by: Patrick Valduriez 相似文献

11.

Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors

下载免费PDF全文

Shen Meiming Tian Xinmin Wang Dingxing Zheng Weimin Wen Dongchan 《计算机科学技术学报》1993,8(3):43-52

In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler. 相似文献

12.

Algebra of algorithms and dynamic parallel execution of sequential programs

V. M. Glushkov Yu. V. Kapitonova A. A. Letichevskii 《Cybernetics and Systems Analysis》1982,18(5):533-542

相似文献

13.

DIALOG — A dataflow model for parallel execution of logic programs

Kang Zhang Ray Thomas 《Future Generation Computer Systems》1991,6(4):373-388

The paper presents a dataflow execution model, DIALOG, for logic programs which operates on an intermediate virtual machine. The virtual machine is granulated at clause argument level to exploit argument parallelism through unification. The model utilises a new variable binding scheme that eliminates dereference operations for accessing variables, and therefore supports OR-parallelism in the highly distributed dataflow environment. The model has been implemented in Occam. A conventional dataflow architecture in support of the model has been simulated as a testbed for the evaluation. The simulation indicates some encouraging results and suggests future improvements. 相似文献

14.

Compiling programs for distributed-memory multiprocessors 总被引：1，自引：0，他引：1

David Callahan Ken Kennedy 《The Journal of supercomputing》1988,2(2):151-169

We describe a new approach to programming distributed-memory computers. Rather than having each node in the system explicitly programmed, we derive an efficient message-passing program from a sequential shared-memory program annotated with directions on how elements of shared arrays are distributed to processors. This article describes one possible input language for describing distributions and then details the compilation process and the optimization necessary to generate an efficient program.Research supported by Intel. 相似文献

15.

A wide instruction word architecture for parallel execution of logic programs coded in BSL

Kemal Ebcioĝlu Manoj Kumar 《New Generation Computing》1990,7(2-3):219-242

This paper begins by describing BSL, a new logic programming language fundamentally different from Prolog. BSL is a nondeterministic Algol-class language whose programs have a natural translation to first order logic; executing a BSL program without free variables amounts to proving the corresponding first order sentence. A new approach is proposed for parallel execution of logic programs coded in BSL, that relies on advanced compilation techniques for extracting fine grain parallelism from sequential code. We describe a new “Very Long Instruction Word” (VLIW) architecture for parallel execution of BSL programs. The architecture, now being designed at the IBM Thomas J. Watson Research Center, avoids the synchronization and communication delays (normally associated with parallel execution of logic programs on multiprocessors), by determining data dependences between operations at compile time, and by coupling the processing elements very tightly, via a single central shared register file. A simulator for the architecture has been implemented and some simulation results are reported in the paper, which are encouraging. 相似文献

16.

Compiling quantum programs 总被引：4，自引：0，他引：4

Paolo Zuliani 《Acta Informatica》2005,41(7-8):435-474

In this paper we study a possible compiler for a high-level imperative programming language for quantum computation, the quantum Guarded-Command Language (qGCL). It is important because it liberates us from thinking of quantum algorithms at the data-flow level, in the same way as happened for standard computation a few decades ago.We make use of the normal-form approach to compiler design, introduced by Hoare, Jifeng and Sampaio. In this approach a source program is transformed, by means of algebraic manipulations, into a particular form which can be directly executed by a target machine. This entails the definition of a simple quantum hardware architecture, derived from Hoare et al.’s computing model.Our work provides a general framework for the construction of a compiler for qGCL, focusing mainly on the correctness of the design. Here we do not deal with other topics such as efficiency of compiled code, factorisation of unitary transformations and compilation of quantum data structures. 相似文献

17.

Compiling communication-efficient programs for massively parallelmachines

Li J. Chen M. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(3):361-376

A method of generating parallel target code with explicit communication for massively parallel distributed-memory machines is presented. The source programs are shared-memory parallel programs with explicit control structures. The method extracts syntactic reference patterns from a program with shared address space, selects appropriate communication routines, places these routines in appropriate locations in the target program text and sets up correct conditions for invoking these routines. An explicit communication metric is used to guide the selection of data layout strategies 相似文献

18.

An execution profiler for modular programs

Susan L. Graham Peter B. Kessler Marshall K. McKusick 《Software》1983,13(8):671-685

In modular programs, groups of routines constitute conceptual abstractions. A method for providing execution profiles for such programs is presented. The central idea is that the execution time for a routine is charged to the routines that call it. The implementation of this method by a profiler called gprof is described. The techniques used to gather the necessary information about the timing and structure of the program are given, as is the processing used to propagate routine execution times along arcs of the call graph of the program. The method for displaying the profile to the user is discussed. Experience using the profiles for hand-tuning large programs is summarized. Additional uses for the profiles are suggested. 相似文献

19.

Optimization methods of parallel execution of numerical programs in the LuNA fragmented programming system

Victor Malyshkin Vladislav Perepelkin 《The Journal of supercomputing》2012,61(1):235-248

The organization of high-performance execution of a fragmented program has encountered with the problem of choosing of an acceptable way of its execution. The potentialities of optimizing the execution at the stages of fragmented program development, compilation and execution are considered. The methods and algorithms of such an optimization are proposed to be included into the LuNA fragmented programming language, compiler, generator and run-time system. 相似文献

20.

Compiling global name-space parallel loops for distributedexecution

Koelbel C. Mehrotra P. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(4):440-451

Compiler support required to allow programmers to express their algorithms using a global name-space is discussed. A general method for the analysis of a high-level source program and its translation into a set of independently executing tasks that communicate using messages is presented. It is shown that if the compiler has enough information, the translation can be carried out at compile time; otherwise; run-time code is generated to implement the required data movement. The analysis required in both situations is described, and the performance of the generated code on the Intel iPSC/2 hypercube is presented 相似文献