首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A common debugging strategy involves reexecuting a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of nondeterminacy, different runs on the given input may produce different results. This nonrepeatability is a serious debugging problem, since an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs. By tracing the order in which messages are delivered, a reexecution can be forced to deliver messages in their original order, reproducing the original execution. To reduce the overhead of such a scheme, we show that the delivery'order of only messages involved inraces need be traced (and not every message). Our technique makes run-time decisions to detect and trace racing messages and is usuallyoptimal in the sense that the minimal number of racing messages is traced. Experiments indicate that only 1% of the messages are often traced, gaining a reduction of two orders of magnitude over traditional techniques that trace every message. These traces allow an execution to be reproduced any number of times for debugging. Our work is novel in that we adaptively decide what to trace, and trace only those messages that introduce nondeterminacy. With our strategy, large reductions in trace size allow long-running programs to be replayed that were previously unmanageable. In addition, the reduced tracing requirements alleviate tracing bottle-necks, allowing executions to be debugged with substantially lower execution time overhead.This work was supported in part by National Science Foundation grants CCR-8815928 and CCR-9100968, Office of Naval Research grant N00014-89-J-1222, and a grant from Sequent Computer Systems, Inc.  相似文献   

2.
We describe a novel technique for bounded analysis of asynchronous message-passing programs with ordered message queues. Our bounding parameter does not limit the number of pending messages, nor the number of “context-switches” between processes. Instead, we limit the number of process communication cycles, in which an unbounded number of messages are sent to an unbounded number of processes across an unbounded number of contexts. We show that remarkably, despite the potential for such vast exploration, our bounding scheme gives rise to a simple and efficient program analysis by reduction to sequential programs. As our reduction avoids explicitly representing message queues, our analysis scales irrespectively of queue content and variation.  相似文献   

3.
针对分布内存结构的并行化将串行程序转变为在各处理节点上运行的SPMD并行程序,节点程序包含该节点所执行的运算和与其它节点交换信息的通信操作。讨论了在已知数据分解和计算划分的前提下生成分布内存结构下的消息传递并行程序的算法,以Lam提出的线性不等式基本框架为基础,在Paraguin工作基础上进行了有效的改进:第一在代码生成算法中引入了数据分布;第二将处理器空间由一维扩展到多维;第三将虚拟处理器到物理处理器的映射关系引入代码生成算法,从而减少了节点间通信的数量,提高了生成并行代码的性能。  相似文献   

4.
Employing genetic algorithms to generate test data for path coverage has been an important method in software testing. Previous work, however, is suitable mainly for serial programs. Automatic test data generation for path coverage of message-passing parallel programs without non-determinacy is investigated in this study by using co-evolutionary genetic algorithms. This problem is first formulated as a single-objective optimization problem, and then a novel co-evolutionary genetic algorithm is proposed to tackle the formulated optimization problem. This method employs the alternate co-evolution of two kinds of populations to generate test data that meet path coverage. The proposed method is applied to seven parallel programs, and compared with the other three methods. The experimental results show that the proposed method has the best success rate and the least number of evaluated individuals and time consumption.  相似文献   

5.
Today's massively parallel machines are typically message-passing systems consisting of hundreds or thousands of processors. Implementing parallel applications efficiently in this environment is a challenging task, and poor parallel design decisions can be expensive to correct. Tools and techniques that allow the fast and accurate evaluation of different parallelization strategies would significantly improve the productivity of application developers and increase throughput on parallel architectures. This paper investigates one of the major issues in building tools to compare parallelization strategies: determining what type of performance models of the application code and of the computer system are sufficient for a fast and accurate comparison of different strategies. The paper is built around a case study employing the performance prediction tool (PerPreT) to predict performance of the parallel spectral transform shallow water model code (PSTSWM) on the Intel Paragon. PSTSWM is a parallel application code that was designed to evaluate different parallel strategies for the spectral transform method as it is used in climate modeling and weather forecasting. Multiple parallel algorithms and algorithm variants are embedded in the code. PerPreT uses a relatively simple algebraic model to predict execution time for SPMD (single program multiple data) parallel applications. Applications are modeled through parameterized formulae for communication and computation, where the parameters include the problem size, the number of processors used to execute the program, and system characteristics (e.g. setup times for communication, link bandwidth and sustained computing performance per processor). In this paper we describe performance models that predict the performance of the different algorithms in PSTSWM accurately enough to allow them to be compared, establishing the feasibility of such a demanding application of performance modeling. We also discuss issues in generating and validating the performance models, emphasizing the practical importance of tools such as PerPreT in such studies. © 1998 John Wiley & Sons, Ltd.  相似文献   

6.
Large-scale scientific and engineering computation problems are usually complex and consequently the development of parallel programs for solving these problems is a difficult task. In this paper, we describe the graph-oriented programming (GOP) model and environment for building and evaluating parallel applications. The GOP model provides higher level abstractions for message-passing parallel programming and the software environment offers tools which can ease programmers for parallelizing, writing, and deploying scientific and engineering computing applications. We discuss the motivations and various issues in developing the model and the software environment, present the design of the system architecture and the components, and describe the evaluation of the environment implemented on top of MPI with a sample parallel scientific application program. With the support of the high-level abstractions provided by the proposed GOP environment, programming of parallel applications on various parallel architectures can be greatly simplified.  相似文献   

7.
8.
In many real applications, for example, those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP and LogGP models to account for the impact of network contention and network interface DMA behavior on the performance of message passing programs. We validate LoGPC by analyzing three applications implemented with Active Messages on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50 percent of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify trade-offs between synchronous and asynchronous message passing styles  相似文献   

9.
Modern multicomputer components such as the inmos T800 and the Texas Instruments TMS320C40 provide hardware which supports efficient communication between directly connected processors. The paper describes techniques which have been developed to provide efficient global communications for machines based on these and similar processors. This service is implemented by a packet-routing kernel adopting a standard, portable communications interface. The design of this kernel is discussed with particular emphasis on the manner in which it guarantees reliable, lightweight, deadlock-free communication for arbitrary interconnection topologies. The kernel has been developed and tested on transputers. Extensive results on the performance of the router are presented to demonstrate that the substantial advantages associated with its well founded, structured design are not at odds with high efficiency. This router has been used to support several multi-transputer environments, including a commercially available POSIX-conformant operating system and a distributed occam environment freed from conventional configuration constraints.  相似文献   

10.
This paper discusses the compile time task scheduling of parallel program running on cluster of SMP workstations. Firstly, the problem is stated formally and transformed into a graph parti-tion problem and proved to be NP-Complete. A heuristic algorithm MMP-Solver is then proposed to solve the problem. Experiment result shows that the task scheduling can reduce communication over-head of parallel applications greatly and MMP-Solver outperforms the existing algorithms.  相似文献   

11.
F-MPJ: scalable Java message-passing communications on parallel systems   总被引:1,自引:0,他引:1  
This paper presents F-MPJ (Fast MPJ), a scalable and efficient Message-Passing in Java (MPJ) communication middleware for parallel computing. The increasing interest in Java as the programming language of the multi-core era demands scalable performance on hybrid architectures (with both shared and distributed memory spaces). However, current Java communication middleware lacks efficient communication support. F-MPJ boosts this situation by: (1) providing efficient non-blocking communication, which allows communication overlapping and thus scalable performance; (2) taking advantage of shared memory systems and high-performance networks through the use of our high-performance Java sockets implementation (named JFS, Java Fast Sockets); (3) avoiding the use of communication buffers; and (4) optimizing MPJ collective primitives. Thus, F-MPJ significantly improves the scalability of current MPJ implementations. A performance evaluation on an InfiniBand multi-core cluster has shown that F-MPJ communication primitives outperform representative MPJ libraries up to 60 times. Furthermore, the use of F-MPJ in communication-intensive MPJ codes has increased their performance up to seven times.  相似文献   

12.
We develop a parallel algorithm for partitioning the vertices of a graph intop2 sets in such a way that few edges connect vertices in different sets. The algorithm is intended for a message-passing multiprocessor system, such as the hypercube, and is based on the Kernighan-Lin algorithm for finding small edge separators on a single processor.(1) We use this parallel partitioning algorithm to find orderings for factoring large sparse symmetric positive definite matrices. These orderings not only reduce fill, but also result in good processor utilization and low communication overhead during the factorization. We provide a complexity analysis of the algorithm, as well as some numerical results from an Intel hypercube and a hypercube simulator.Publication of this report was partially supported by the National Science Foundation under Grant DCR-8451385 and by AT&T Bell Laboratories through their Ph.D scholarship program.  相似文献   

13.
D.A.  P.D. 《Performance Evaluation》2005,60(1-4):165-187
We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology.  相似文献   

14.
We present a parallel algorithm for computing an optimal sequence alignment in efficient space. The algorithm is intended for a message-passing architecture with one-dimensional-array topology. The algorithm computes an optimal alignment of two sequences of lengthsM andN inO((M+N) 2 /P) time andO((M+N)/P) space per processor, where the number of processors isP>=max(M, N). Thus, whenP=max(M, N) it achieves linear speedup and requires constant space per processor. Some experimental results on an Intel hypercube are provided.This research was supported by NIH Grant LM05110 from the National Library of Medicine.  相似文献   

15.
16.
A message-passing class library C++ for portable parallel programming   总被引:1,自引:1,他引:0  
An object-oriented message-passing class library in C++, called PPI++, for portable parallel programming has been developed. PPI++ (parallel portability interface in C++) is designed to serve as a stable (unchanging) interface between the client parallel code and the rapidly evolving distributed computing environments. By taking advantage of encapsulation, inheritance, and polymorphism supported by C++, PPI++ provides a clean and consistent programming interface, which helps improve the clarity and expressiveness of client parallel codes and hides implementation details and complexity from the user to ease parallel programming tasks. In addition, the use of strong type-checking in C++ allows the detection of potential misuses of the library at compile time, and thus promotes code reliability. This paper describes the object-oriented design and implementation of PPI++. Evaluation of PPI++ on important performance issues, such as portability, ease-of-use, extensibility, and efficiency, is also discussed.  相似文献   

17.
Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model  相似文献   

18.
Scheduling the tasks of a parallel algorithm onto a network of processors to minimize the completion time of the task graph is an NP-hard problem, and heuristic methods are commonly used to solve this problem. Published works in this area, however, do not take advantage of the following aspects of the problem: (i) the availability of the full knowledge of the data that is being transferred during inter-task communication, and (ii) the availability of full duplex high-speed communication links in many multiprocessors (such as transputers). The scheduling approach presented in this paper, the data token heuristic (DTH) approach, exploits the above features, leading to a reduced schedule length. This is achieved by checking the pool of data tokens in the processors, and routing the required data token to the processor through the dynamic shortest path. The DTH approach is then used to find the best transputer network topology that gives the minimum schedule length for the parallel implementation of the Kalman algorithm. Quantitative results of scheduling the Kalman algorithm on a 4-transputer network with T-805 transputers are presented.  相似文献   

19.
《Performance Evaluation》1987,7(2):111-124
We analyze a probabilistic model of two processors sharing a common task, such as distributed simulation, proceeding at possibly different rates and with each allowed its own clock and ‘virtual time’. When the task requires it, the processors communicate by messages which are stamped with the local time at the point of origination. This feature is also used to synchronize the processors by ‘rollback’, in which technique a processor receiving a message in its past is required to set its clock and state back as necessary to guarantee correctness. The probabilistic model incorporates for each processor: (i) a distribution for the amount by which the local clock is incremented at each state transition, and (ii) a probability indicating the likelihood of generating a message at each state transition.For the case of exponential distributions we obtain by the method of Wiener-Hopf factorization explicit formulas for the steady-state distribution of the difference in virtual times, the average rate of progress and the average amount of rollback per state transition. We introduce a performance measure which reflects both the average rate of progress and the cost of implementing rollback. This performance measure is optimized with respect to the system parameters and an explicit solution is obtained. This result provides remarkably explicit insights into optimal operations.  相似文献   

20.
《国际计算机数学杂志》2012,89(3-4):201-212
This paper is the second of a two-part series exploring the subtle correctness criterion of the absence of livelocks in parallel programs. In this paper we are concerned with the issue of proving this correctness criterion. It is shown that livelocks are not preserved by reduction, implying that reduction cannot be used directly in proving the absence of livelocks. Two applicable proof techniques are also presented. One is based on the notion of establishing sufficient conditions for livelock-freedom; the other is an extension of the well-founded set method for proving termination in sequential programs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号