期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Memory requirements for parallel programs

F. Warren Burton David J. Simpson 《Parallel Computing》2000,26(13-14)

Parallel execution is normally used to decrease the amount of time required to run a program. However, the parallel execution may require far more space than that required by the sequential execution. Worse yet, the parallel space requirement may be very much more difficult to predict than the sequential space requirement because there are more factors to consider. These include essentially nondeterministic factors that can influence scheduling, which in turn may dramatically influence space requirements. We survey some scheduling algorithms that attempt to place bounds on the amount of time and space used during parallel execution. We also outline a direction for future research. This direction takes us into the area of functional programming, where the declarative nature of the languages can help the programmer to produce correct parallel programs, a feat that can be difficult with procedural languages. Currently the high-level nature of functional languages can make it difficult for the programmer to understand the operational behavior of the program. We look at some of the problems in this area, with the goal of achieving a programming environment that supports correct, efficient parallel programs. 相似文献

2.

Scaling parallel programs for multiprocessors: methodology andexamples

Singh J.P. Hennessy J.L. Gupta A. 《Computer》1993,26(7):42-50

Models for the constraints under which an application should be scaled, including constant problem-size scaling, memory-constrained scaling, and time-constrained scaling, are reviewed. A realistic method is described that scales all relevant parameters under considerations imposed by the application domain. This method leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size. The primary example application is a simulation of galaxies using the Barnes-Hut hierarchical N-body method 相似文献

3.

PProto: An environment for prototyping parallel programs

Ramón D. Acosta 《Journal of Systems Integration》1991,1(3-4):339-365

This paper describes Parallel Proto (PProto), an integrated environment for constructing prototypes of parallel programs. Using functional and performance modeling of dataflow specifications, PProto assists in analysis of high-level software and hardware architectural tradeoffs. Facilities provided by PProto include a visual language and an editor for describing hierarchical dataflow graphs, a resource modeling tool for creating parallel architectures, mechanisms for mapping software components to hardware components, an interactive simulator for prototype interpretation, and a reuse capability. The simulator contains components for instrumenting, animating, debugging, and displaying results of functional and performance models. The Pproto environment is built on top of a substrate for managing user interfaces and database objects to provide consistent views of design objects across system tools. 相似文献

4.

On-line monitoring: a tutorial

Schroeder B.A. 《Computer》1995,28(6):72-78

Online monitoring can complement formal techniques to increase application dependability. The article outlines the concepts and identifies the activities that comprise event based monitoring, describing several representative monitoring systems. It focuses on four areas: dependability, performance enhancement, correctness checking, and security 相似文献

5.

Compiling lisp programs for parallel execution

James R. Larus 《LISP and Symbolic Computation》1991,4(1):29-99

Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics). 相似文献

6.

Locality analysis for parallel C programs

Yingchun Zhu Hendren L.J. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):99-114

相似文献

7.

Asynchronous parallel simulation of parallel programs

Prakash S. Deelman E. Bagrodia R. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):385-400

Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model 相似文献

8.

Estimating and optimizing performance for parallel programs 总被引：1，自引：0，他引：1

Fahringer T. 《Computer》1995,28(11):47-56

The article describes P³T, a parameter-based performance prediction tool that estimates performance for parallel programs running on distributed-memory parallel architectures. P³T has been carefully designed to address all of the above performance estimation issues. To achieve high estimation accuracy, P ³T aggressively exploits compiler analysis and optimization information. Our method is based on modeling loop iteration spaces, array access patterns, and data distributions by intersection and volume operations on n-dimensional polytopes. The most critical architecture-specific factors, such as cache line sizes, number of cache lines available, routing policy, start-up times, message transfer time per byte, and so forth, are modeled to reflect the performance impact of the target machine. P³T has been developed in the context of the Vienna Fortran Compilation Systems (VFCS), a state-of-the-art parallelization tool for distributed-memory systems. VFCS translates Fortran programs into explicitly parallel message-passing programs. P ³T successfully guides the interactive and automatic restructuring of programs under this system. The article describes the underlying compilation and programming model and discusses the most critical design decisions made for P³T; in addition, it outlines the implementation of the parallel program parameters. Also described are the VFCS context under which P³T is applied and the P³T graphical user interface 相似文献

9.

Guaranteeing good memory bounds for parallel programs

Burton F.W. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(10):762-773

The amount of memory required by a parallel program may be spectacularly larger than the memory required by an equivalent sequential program, particularly for programs that use recursion extensively. Since most parallel programs are nondeterministic in behavior, even when computing a deterministic result, parallel memory requirements may vary from run to run, even with the same data. Hence, parallel memory requirements may be both large (relative to memory requirements of an equivalent sequential program) and unpredictable. Assume that each parallel program has an underlying sequential execution order that may be used as a basis for predicting parallel memory requirements. We propose a simple restriction that is sufficient to ensure that any program that will run in n units of memory sequentially can run in mn units of memory on m processors, using a scheduling algorithm that is always within a factor of two of being optimal with respect to time. Any program can be transformed into one that satisfies the restriction, but some potential parallelism may be lost in the transformation. Alternatively, it is possible to define a parallel programming language in which only programs satisfying the restriction can be written 相似文献

10.

Low-latency, concurrent checkpointing for parallel programs 总被引：2，自引：0，他引：2

Kai Li Naughton J.F. Plank J.S. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):874-879

Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed 相似文献

11.

Livelocks in parallel programs

《国际计算机数学杂志》2012,89(3-4):201-212

This paper is the second of a two-part series exploring the subtle correctness criterion of the absence of livelocks in parallel programs. In this paper we are concerned with the issue of proving this correctness criterion. It is shown that livelocks are not preserved by reduction, implying that reduction cannot be used directly in proving the absence of livelocks. Two applicable proof techniques are also presented. One is based on the notion of establishing sufficient conditions for livelock-freedom; the other is an extension of the well-founded set method for proving termination in sequential programs. 相似文献

12.

Interrupt replay: a debugging method for parallel programs with interrupts

KMR Audenaert LJ Levrouw 《Microprocessors and Microsystems》1994,18(10):601-612

The behaviour of programs for multiprocessors may be indeterminate, due to processor timing variations. This poses a problem for cyclic debugging, since a bug may disappear from one execution to another. Replay is an elegant solution to this problem, in which ‘sufficient’ information is recorded in a log. This information is then used to control subsequent executions of the same program so that repeatability is guaranteed. Interrupts are another source of non-determinism, even in sequential programs. This paper presents an extension of the well-known Instant Replay method, termed Interrupt Replay, for replaying programs in the presence of interrupts. The correctness of Interrupt Replay is based on the assumption that there are no interrupt races: an interrupt service routine must not access data that is also accessed by the foreground process whenever the interrupt is enabled. If such races are present then replay may fail to produce deterministic results. This assumption is similar to the basic assumption of Instant Replay that shared variables are properly protected by mutual exclusion. Also as in Instant Replay, it is assumed that the behaviour of the environment (input data, external interrupts) is replayed by some other tracing mechanism. 相似文献

13.

Dynamic models of parallel computations for macropipelined programs

A. E. Doroshenko 《Cybernetics and Systems Analysis》1995,31(6):818-834

Conclusion We have considered dynamic models of parallel computations and transformation algorithms for macropipelined programs that increase the internal exchange asynchronism of the parallel components. The models generalize the well-known CSP synchronous exchange paradigm (the rendezvous mechanism) [18] and provide a theoretical justification for the development of a whole range of interconnected semantic models of asynchronous computation with increasing degree of exchange asynchronism. Some publications ensure asynchronous exchange only by buffering [15, 16]. The dynamic model approach developed in this study combines buffering and analysis of data dependences of the exchange operators. It not only reduces losses associated with synchronization of communicating parallel processes, but also ensures automatic resolution of some classes of data exchange deadlocks. Experience with dynamic models as a means of increasing the efficiency of macropipelined programs and multilevel memory in multiprocessor systems [12, 17] can be applied in other programming systems and parallel program design systems of both compiling (S1, S2) and interpreting (S3) type. Translated from Kibernetika i Sistemnyi Analiz, No. 6, pp. 45–65, November–December, 1995. 相似文献

14.

Adaptive execution techniques of parallel programs for multiprocessors

Jaejin Lee Jung-Ho Park Honggyu Kim Changhee Jung Daeseob Lim SangYong Han 《Journal of Parallel and Distributed Computing》2010

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems. 相似文献

15.

Generating data transfers for distributed GPU parallel programs

F. Silber-Chaussumier A. Muller R. Habel 《Journal of Parallel and Distributed Computing》2013

Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes. 相似文献

16.

An axiomatic proof technique for parallel programs I 总被引：3，自引：1，他引：3

Susan Owicki David Gries 《Acta Informatica》1976,6(4):319-340

Summary A language for parallel programming, with a primitive construct for synchronization and mutual exclusion, is presented. Hoare's deductive system for proving partial correctness of sequential programs is extended to include the parallelism described by the language. The proof method lends insight into how one should understand and present parallel programs. Examples are given using several of the standard problems in the literature. Methods for proving termination and the absence of deadlock are also given.This research was partially supported by National Science Foundation grant GJ-42512. 相似文献

17.

Designing checkers for programs that run in parallel

R. Rubinfeld 《Algorithmica》1996,15(4):287-301

Program correctness for parallel programs is an even more problematic issue than for serial programs. We extend the theory of program result checking to parallel programs, and find general techniques for designing such result checkers that work for many basic problems in parallel computation. These result checkers are simple to program and are more efficient than the actual computation of the result. For example, sorting, multiplication, parity, the all-pairs shortest-path problem and majority all have constant depth result checkers, and the result checkers for all but the last problem use a linear number of processors. We show that there are P-complete problems (evaluating straight-line programs, linear programming) that have very fast, even constant depth, result checkers.This research was done while at the Computer Science Division, University of California, Berkeley, and the International Computer Science Institute, Berkeley, California. Supported in part by an IBM Graduate Fellowship and NSF Grant No. CCR 88-13632. 相似文献

18.

Retargeting sequential image-processing programs for data parallel execution

Baumstark L.B. Jr. Wills L.M. 《IEEE transactions on pattern analysis and machine intelligence》2005,31(2):116-136

New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations. 相似文献

19.

Recursive assertions and parallel programs

Krzysztof R. Apt 《Acta Informatica》1981,15(3):219-232

Summary We prove that recursive assertions are enough for proofs of parallel programs considered in Owicki and Gries [7]. In other words, we prove that for any parallel program S and recursive assertions p and q if {p} S{q} is true under the standard interpretation in natural numbers then all intermediate assertions needed in the proof can be chosen recursive. Finally, we show that if auxiliary variables are used only as program counters then the above result does not hold. 相似文献

20.

Visualizing parallel programs and performance

Hackstadt S.T. Malony A.D. 《Computer Graphics and Applications, IEEE》1995,15(4):12-14

Performance visualization uses graphical display techniques to analyze performance data and improve understanding of complex performance phenomena. Current parallel performance visualizations are predominantly two-dimensional. A primary goal of our work is to develop new methods for rapidly prototyping multidimensional performance visualizations. By applying the tools of scientific visualization, we can prototype these next-generation displays for performance visualization-if not implement them as end user tools-using existing software products and graphical techniques that physicists, oceanographers, and meteorologists have used for several years 相似文献