首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
Parallel execution is normally used to decrease the amount of time required to run a program. However, the parallel execution may require far more space than that required by the sequential execution. Worse yet, the parallel space requirement may be very much more difficult to predict than the sequential space requirement because there are more factors to consider. These include essentially nondeterministic factors that can influence scheduling, which in turn may dramatically influence space requirements. We survey some scheduling algorithms that attempt to place bounds on the amount of time and space used during parallel execution. We also outline a direction for future research. This direction takes us into the area of functional programming, where the declarative nature of the languages can help the programmer to produce correct parallel programs, a feat that can be difficult with procedural languages. Currently the high-level nature of functional languages can make it difficult for the programmer to understand the operational behavior of the program. We look at some of the problems in this area, with the goal of achieving a programming environment that supports correct, efficient parallel programs.  相似文献   

2.
This paper describes Parallel Proto (PProto), an integrated environment for constructing prototypes of parallel programs. Using functional and performance modeling of dataflow specifications, PProto assists in analysis of high-level software and hardware architectural tradeoffs. Facilities provided by PProto include a visual language and an editor for describing hierarchical dataflow graphs, a resource modeling tool for creating parallel architectures, mechanisms for mapping software components to hardware components, an interactive simulator for prototype interpretation, and a reuse capability. The simulator contains components for instrumenting, animating, debugging, and displaying results of functional and performance models. The Pproto environment is built on top of a substrate for managing user interfaces and database objects to provide consistent views of design objects across system tools.  相似文献   

3.
Singh  J.P. Hennessy  J.L. Gupta  A. 《Computer》1993,26(7):42-50
Models for the constraints under which an application should be scaled, including constant problem-size scaling, memory-constrained scaling, and time-constrained scaling, are reviewed. A realistic method is described that scales all relevant parameters under considerations imposed by the application domain. This method leads to different conclusions about the effectiveness and design of large multiprocessors than the naive practice of scaling only the data set size. The primary example application is a simulation of galaxies using the Barnes-Hut hierarchical N-body method  相似文献   

4.
Schroeder  B.A. 《Computer》1995,28(6):72-78
Online monitoring can complement formal techniques to increase application dependability. The article outlines the concepts and identifies the activities that comprise event based monitoring, describing several representative monitoring systems. It focuses on four areas: dependability, performance enhancement, correctness checking, and security  相似文献   

5.
The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High‐Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade‐off between failure‐free and recovery performances. In this paper we describe our contribution in fault tolerance for high‐level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure‐free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

6.
7.
Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics).  相似文献   

8.
Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model  相似文献   

9.
《国际计算机数学杂志》2012,89(3-4):201-212
This paper is the second of a two-part series exploring the subtle correctness criterion of the absence of livelocks in parallel programs. In this paper we are concerned with the issue of proving this correctness criterion. It is shown that livelocks are not preserved by reduction, implying that reduction cannot be used directly in proving the absence of livelocks. Two applicable proof techniques are also presented. One is based on the notion of establishing sufficient conditions for livelock-freedom; the other is an extension of the well-founded set method for proving termination in sequential programs.  相似文献   

10.
The amount of memory required by a parallel program may be spectacularly larger than the memory required by an equivalent sequential program, particularly for programs that use recursion extensively. Since most parallel programs are nondeterministic in behavior, even when computing a deterministic result, parallel memory requirements may vary from run to run, even with the same data. Hence, parallel memory requirements may be both large (relative to memory requirements of an equivalent sequential program) and unpredictable. Assume that each parallel program has an underlying sequential execution order that may be used as a basis for predicting parallel memory requirements. We propose a simple restriction that is sufficient to ensure that any program that will run in n units of memory sequentially can run in mn units of memory on m processors, using a scheduling algorithm that is always within a factor of two of being optimal with respect to time. Any program can be transformed into one that satisfies the restriction, but some potential parallelism may be lost in the transformation. Alternatively, it is possible to define a parallel programming language in which only programs satisfying the restriction can be written  相似文献   

11.
Low-latency, concurrent checkpointing for parallel programs   总被引:2,自引:0,他引:2  
Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed  相似文献   

12.
Estimating and optimizing performance for parallel programs   总被引:1,自引:0,他引:1  
Fahringer  T. 《Computer》1995,28(11):47-56
The article describes P3T, a parameter-based performance prediction tool that estimates performance for parallel programs running on distributed-memory parallel architectures. P3 T has been carefully designed to address all of the above performance estimation issues. To achieve high estimation accuracy, P 3T aggressively exploits compiler analysis and optimization information. Our method is based on modeling loop iteration spaces, array access patterns, and data distributions by intersection and volume operations on n-dimensional polytopes. The most critical architecture-specific factors, such as cache line sizes, number of cache lines available, routing policy, start-up times, message transfer time per byte, and so forth, are modeled to reflect the performance impact of the target machine. P3T has been developed in the context of the Vienna Fortran Compilation Systems (VFCS), a state-of-the-art parallelization tool for distributed-memory systems. VFCS translates Fortran programs into explicitly parallel message-passing programs. P 3T successfully guides the interactive and automatic restructuring of programs under this system. The article describes the underlying compilation and programming model and discusses the most critical design decisions made for P3T; in addition, it outlines the implementation of the parallel program parameters. Also described are the VFCS context under which P3T is applied and the P3T graphical user interface  相似文献   

13.
The behaviour of programs for multiprocessors may be indeterminate, due to processor timing variations. This poses a problem for cyclic debugging, since a bug may disappear from one execution to another. Replay is an elegant solution to this problem, in which ‘sufficient’ information is recorded in a log. This information is then used to control subsequent executions of the same program so that repeatability is guaranteed. Interrupts are another source of non-determinism, even in sequential programs. This paper presents an extension of the well-known Instant Replay method, termed Interrupt Replay, for replaying programs in the presence of interrupts. The correctness of Interrupt Replay is based on the assumption that there are no interrupt races: an interrupt service routine must not access data that is also accessed by the foreground process whenever the interrupt is enabled. If such races are present then replay may fail to produce deterministic results. This assumption is similar to the basic assumption of Instant Replay that shared variables are properly protected by mutual exclusion. Also as in Instant Replay, it is assumed that the behaviour of the environment (input data, external interrupts) is replayed by some other tracing mechanism.  相似文献   

14.
An axiomatic proof technique for parallel programs I   总被引:4,自引:1,他引:3  
Summary A language for parallel programming, with a primitive construct for synchronization and mutual exclusion, is presented. Hoare's deductive system for proving partial correctness of sequential programs is extended to include the parallelism described by the language. The proof method lends insight into how one should understand and present parallel programs. Examples are given using several of the standard problems in the literature. Methods for proving termination and the absence of deadlock are also given.This research was partially supported by National Science Foundation grant GJ-42512.  相似文献   

15.
Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes.  相似文献   

16.
聚合物挤出在线超声监测传感器研究   总被引:2,自引:0,他引:2  
共混聚合物在挤出过程中需要对其在螺杆挤出机机筒内的混合状态实时监测,针对这一情况,设计了由超声探头、聚醚醚酮(PEEK)缓冲杆和冷却装置组成的超声传感器,并将其应用于螺杆挤出机中聚合物的混合状态监测.文章分析了钢杆和PEEK杆超声波信噪比情况,PEEK缓冲杆具有较高的信噪比.本文的研究说明设计的传感器结构适合于较低温度的聚合物混合状态的在线监测.  相似文献   

17.
New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations.  相似文献   

18.
R. Rubinfeld 《Algorithmica》1996,15(4):287-301
Program correctness for parallel programs is an even more problematic issue than for serial programs. We extend the theory of program result checking to parallel programs, and find general techniques for designing such result checkers that work for many basic problems in parallel computation. These result checkers are simple to program and are more efficient than the actual computation of the result. For example, sorting, multiplication, parity, the all-pairs shortest-path problem and majority all have constant depth result checkers, and the result checkers for all but the last problem use a linear number of processors. We show that there are P-complete problems (evaluating straight-line programs, linear programming) that have very fast, even constant depth, result checkers.This research was done while at the Computer Science Division, University of California, Berkeley, and the International Computer Science Institute, Berkeley, California. Supported in part by an IBM Graduate Fellowship and NSF Grant No. CCR 88-13632.  相似文献   

19.
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.  相似文献   

20.
Conclusion We have considered dynamic models of parallel computations and transformation algorithms for macropipelined programs that increase the internal exchange asynchronism of the parallel components. The models generalize the well-known CSP synchronous exchange paradigm (the rendezvous mechanism) [18] and provide a theoretical justification for the development of a whole range of interconnected semantic models of asynchronous computation with increasing degree of exchange asynchronism. Some publications ensure asynchronous exchange only by buffering [15, 16]. The dynamic model approach developed in this study combines buffering and analysis of data dependences of the exchange operators. It not only reduces losses associated with synchronization of communicating parallel processes, but also ensures automatic resolution of some classes of data exchange deadlocks. Experience with dynamic models as a means of increasing the efficiency of macropipelined programs and multilevel memory in multiprocessor systems [12, 17] can be applied in other programming systems and parallel program design systems of both compiling (S1, S2) and interpreting (S3) type. Translated from Kibernetika i Sistemnyi Analiz, No. 6, pp. 45–65, November–December, 1995.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号