首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The amount of memory required by a parallel program may be spectacularly larger than the memory required by an equivalent sequential program, particularly for programs that use recursion extensively. Since most parallel programs are nondeterministic in behavior, even when computing a deterministic result, parallel memory requirements may vary from run to run, even with the same data. Hence, parallel memory requirements may be both large (relative to memory requirements of an equivalent sequential program) and unpredictable. Assume that each parallel program has an underlying sequential execution order that may be used as a basis for predicting parallel memory requirements. We propose a simple restriction that is sufficient to ensure that any program that will run in n units of memory sequentially can run in mn units of memory on m processors, using a scheduling algorithm that is always within a factor of two of being optimal with respect to time. Any program can be transformed into one that satisfies the restriction, but some potential parallelism may be lost in the transformation. Alternatively, it is possible to define a parallel programming language in which only programs satisfying the restriction can be written  相似文献   

2.
Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by an acyclic directed graph called a task graph. The execution times of tasks are considered to be independently identically distributed (i.i.d.) random variables. The performance measure of interest is the overall execution of the considered parallel program (task graph). Stochastic bound methods are applied to obtain lower and upper bounds on this measure. Another upper bound is obtained for parallel programs having `new better than used in expectation' (NBUE) random variables as task execution times. NBUE random variables are replaced with exponential random variables of the same mean to derive this upper bound  相似文献   

3.
Performance visualization uses graphical display techniques to analyze performance data and improve understanding of complex performance phenomena. Current parallel performance visualizations are predominantly two-dimensional. A primary goal of our work is to develop new methods for rapidly prototyping multidimensional performance visualizations. By applying the tools of scientific visualization, we can prototype these next-generation displays for performance visualization-if not implement them as end user tools-using existing software products and graphical techniques that physicists, oceanographers, and meteorologists have used for several years  相似文献   

4.
Visualizing the performance of parallel programs   总被引:1,自引:0,他引:1  
ParaGraph, a software tool that provides a detailed, dynamic, graphical animation of the behavior of message-passing parallel programs and graphical summaries of their performance, is presented. ParaGraph animates trace information from actual runs to depict behavior and obtain the performance summaries. It provides twenty-five perspectives on the same data, lending insight that might otherwise be missed. ParaGraph's features are described, its use is explained, its software design is briefly discussed, and its displays are examined in some detail. Future work on ParaGraph is indicated  相似文献   

5.
尚月强 《计算机工程与设计》2007,28(13):3100-3102,3129
网络并行计算是并行计算与分布式计算技术非常重要的发展方向之一,结合具体的数值试验,探讨了Windows操作系统下基于PVM的网络并行数值计算中影响PVM并行程序性能的几个重要因素,包括负载平衡、通信开销、网络性能、任务粒度、处理机个数、精度要求及处理机内存容量问题等,并提出了提高PVM并行程序性能的相应策略,以高效快速地实现问题的求解.  相似文献   

6.
This paper describes how Crystal, a language based on familiar mathematical notation and lambda calculus, addresses the issues of programmability and performance for parallel supercomputers. Some scientifc programmers and theoreticians may ask, “What is new about Crystal?” or “How is it different from existing functional languages?” The answers lie in its model of parallel computation and a theory of parallel program optimization, and we examine this in the text to follow. We illustrate the power of our approach with benchmarks of compiled parallel code from Crystal source. The target machines are hypercube multiprocessors with distributed memory, on which it is considered difficult for functional programs to achieve high efficiency.  相似文献   

7.
Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.  相似文献   

8.
This paper describes Parallel Proto (PProto), an integrated environment for constructing prototypes of parallel programs. Using functional and performance modeling of dataflow specifications, PProto assists in analysis of high-level software and hardware architectural tradeoffs. Facilities provided by PProto include a visual language and an editor for describing hierarchical dataflow graphs, a resource modeling tool for creating parallel architectures, mechanisms for mapping software components to hardware components, an interactive simulator for prototype interpretation, and a reuse capability. The simulator contains components for instrumenting, animating, debugging, and displaying results of functional and performance models. The Pproto environment is built on top of a substrate for managing user interfaces and database objects to provide consistent views of design objects across system tools.  相似文献   

9.
Estimating and optimizing performance for parallel programs   总被引:1,自引:0,他引:1  
Fahringer  T. 《Computer》1995,28(11):47-56
The article describes P3T, a parameter-based performance prediction tool that estimates performance for parallel programs running on distributed-memory parallel architectures. P3 T has been carefully designed to address all of the above performance estimation issues. To achieve high estimation accuracy, P 3T aggressively exploits compiler analysis and optimization information. Our method is based on modeling loop iteration spaces, array access patterns, and data distributions by intersection and volume operations on n-dimensional polytopes. The most critical architecture-specific factors, such as cache line sizes, number of cache lines available, routing policy, start-up times, message transfer time per byte, and so forth, are modeled to reflect the performance impact of the target machine. P3T has been developed in the context of the Vienna Fortran Compilation Systems (VFCS), a state-of-the-art parallelization tool for distributed-memory systems. VFCS translates Fortran programs into explicitly parallel message-passing programs. P 3T successfully guides the interactive and automatic restructuring of programs under this system. The article describes the underlying compilation and programming model and discusses the most critical design decisions made for P3T; in addition, it outlines the implementation of the parallel program parameters. Also described are the VFCS context under which P3T is applied and the P3T graphical user interface  相似文献   

10.
A technology of the iterative development of parallel programs and the corresponding development tools on the basis of the ParJava environment are considered. A benefit of the ParJava environment is that the the most part of the work can be done on a development computer using a model of the parallel program to be developed. This considerably reduces the the overheads in terms of time and resources used. The proposed technology is illustrated using the development of a parallel program for simulating intense atmospheric vortices as an example.  相似文献   

11.
A key index of the performance of a rule based program used in real time monitoring and control is its response time, defined by the longest program execution time before a fixed point of the program is reached from a start state. Previous work in computing the response time bounds for rule based programs effectively assumes that all rules take the same amount of firing time. It is also assumed that if two rules are enabled, then either one of them may be scheduled first for firing. These assumptions can result in loose bounds, especially in the case programmers choose to impose a priority structure on the set of rules. We remove the uniform firing cost assumption and discuss how to get tighter bounds by taking rule priority information into account. We show that the rule suppression relation we previously introduced can be extended to incorporate rule priority information. A bound derivation algorithm for programs whose potential trigger relations satisfy an acyclicity condition is presented, followed by its correctness proof and an analysis example  相似文献   

12.
Martonosi  M. Gupta  A. Anderson  T.E. 《Computer》1995,28(4):32-40
To improve program memory performance, programmers and compiler writers can transform the application so that its memory-referencing behavior better exploits the memory hierarchy. The challenge in achieving these program transformations is overcoming the difficulty of statically analyzing or reasoning about an application's referencing behavior and interactions. In addition, many performance-monitoring tools collect high-level information that is inadequately detailed to analyze specific memory performance bugs. We describe MemSpy, a performance-monitoring tool we designed to help programmers discern where and why memory bottlenecks occur. MemSpy guides programmers toward program transformations that improve memory performance through detailed statistics on cache-miss causes and frequency. Because of the natural link between data-reference patterns and memory performance, MemSpy helps programmers comprehend data structure and code segment interactions by displaying statistics in terms of both the program's data and code structures, rather than for code structures alone  相似文献   

13.
Structural symmetries in stochastic well-formed colored Petri nets (SWN's) lead to behavioral symmetries that can be exploited by using the symbolic reachability graph (SRG) construction algorithm. The SRC allows one to compute an aggregated reachability graph (RG) and a “lumped” continuous time Markov chain (CTMC) that contain all the information needed to study the qualitative properties and the performance of the modeled system, respectively. Some models exhibit qualitative behavioral symmetries that are not completely reflected at the CTMC level. We call them quasi-lumpable SWN models. In these cases, exact performance indices can be obtained by avoiding the aggregation of those markings that are qualitatively, but not quantitatively, equivalent. An alternative approach consists of aggregating all the qualitatively equivalent states and computing approximated performance indices. In this paper, a technique is proposed to compute bounds on the performance of SWN models of this kind, using the results we have presented elsewhere. The technique is based on the Courtois and Semal bounded aggregation method  相似文献   

14.
《Parallel Computing》1997,22(13):1747-1770
To provide high-level graphical support for PVM (Parallel Virtual Machine) based program development, a complex programming environment (GRADE) is being developed. GRADE currently provides tools to construct, execute, debug, monitor and visualize message-passing parallel programs. It offers a high-level graphical programming abstraction mechanism to construct parallel applications by introducing a new graphical language called GRAPNEL. GRADE also provides the programmer with the same graphical user interface during the program design and debugging stages. A distributed debugging engine (DDBG) assists the user in debugging GRAPNEL programs on distributed memory computer architectures. Tape/PVM and PROVE support the performance monitoring and visualization of parallel programs developed in the GRADE environment.  相似文献   

15.
16.
We have studied the interaction between process-based parallel programs whose characteristics change in various ways at run time and the operation of load-balancing, as implemented by process migration. In order to do this, we propose a simple performance model, whose parameters represent features of the program's execution such as the frequency and regularity of the changes in computational characteristics, and conduct a series of experiments involving simulated executions of synthetic programs with controlled parameter values. From these we can deduce the relative importance of the parameters from the point of view of their influence on performance. We can explain our observations in terms of a simplified stochastic model that relates local changes in load to global behaviour. We show that the dynamics of load-balancing can be represented approximately by a first-order difference equation, and that the distributed process migration algorithm is consistent with a behaviour on the global scale which can be regarded as that of a traditional feedback controller.  相似文献   

17.
This paper reports on the memory performance of parallel scientific algorithms, written in both pure and impure functional styles. The Id programming language is used, since it allows both pure and impure parallel functional programs to be expressed. The non-strict storage model of Id is introduced. The study focuses on two algorithms: the Dongarra Sorensen Eignensolver and the NAS FT three dimensional heat equation solver, based on FFTs.This study verifies the claim that functional languages allow a composition of programs from modules, exploiting the inter- and intra-module parallelism without the need for rewrinting these modules. But it also shows that memory use of pure functional programs can be excessive, and theat impure functional programs can be as memory-efficient as imperative implementations.  相似文献   

18.
Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model  相似文献   

19.
Many real‐world optimization problems in the scientific and engineering fields can be solved by genetic algorithms (GAs) but it still requires a long execution time for complex problems. At the same time, there are many under‐utilized workstations on the Internet. In this paper, we present a self‐adaptive parallel GA system named APGAIN, which utilizes the spare power of the heterogeneous workstations on the Internet to solve complex optimization problems. In order to maintain a balance between exploitation and exploration, we have devised a novel probabilistic rule‐driven adaptive model (PRDAM) to adapt the GA parameters automatically. APGAIN is implemented on an Internet Computing system called DJM. In the implementation, we discover that DJM's original load balancing strategy is insufficient. Hence the strategy is extended with the job migration capability. The performance of the system is evaluated by solving the traveling salesman problem with data from a public database. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

20.
In order to improve a parallel program's performance it is critical to evaluate how even the work contained in a program is distributed over all processors dedicated to the computation. Traditional work distribution analysis is commonly performed at the machine level. The disadvantage of this method is that it cannot identify whether the processors are performing useful or redundant (replicated) work. The paper describes a novel method of statically estimating the useful work distribution of distributed-memory parallel programs at the program level, which carefully distinguishes between useful and redundant work. The amount of work contained in a parallel program, which correlates with the number of loop iterations to be executed by each processor, is estimated by accurately modeling loop iteration spaces, array access patterns and data distributions. A cost function defines the useful work distribution of loops, procedures and the entire program. Lower and upper bounds of the described parameter are presented. The computational complexity of the cost function is independent of the program's problem size, statement execution and loop iteration counts. As a consequence, estimating the work distribution based on the described method is considerably faster than simulating or actually compiling and executing the program. Automatically estimating the useful work distribution is fully implemented as part of P3T, which is a static parameter based performance prediction tool under the Vienna Fortran Compilation System (VFCS). The Lawrence Livermore Loops are used as a test case to verify the approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号