首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Parallel programs present some features such as concurrency, communication and synchronization that make the test a challenging activity. Because of these characteristics, the direct application of traditional testing is not always possible and adequate testing criteria and tools are necessary. In this paper we investigate the challenges of validating message‐passing parallel programs and present a set of specific testing criteria. We introduce a family of structural testing criteria based on a test model. The model captures control and data flow of the message‐passing programs, by considering their sequential and parallel aspects. The criteria provide a coverage measure that can be used for evaluating the progress of the testing activity and also provide guidelines for the generation of test data. We also describe a tool, called ValiPar, which supports the application of the proposed testing criteria. Currently, ValiPar is configured for parallel virtual machine (PVM) and message‐passing interface (MPI). Results of the application of the proposed criteria to MPI programs are also presented and analyzed. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

2.
《Parallel Computing》1997,23(10):1405-1420
Performance prediction is necessary in order to deal with multi-dimensional performance effects on parallel systems. The compiler-generated analytical model developed in this paper accounts for the effects of cache behavior, CPU execution time and message passing overhead for real programs written in high level data-parallel languages. The performance prediction technique is shown to be effective in analyzing several non-trivial data-parallel applications as the problem size and number of processors vary. We leverage technology from the Maple symbolic manipulation system and the S-PLUS statistical package in order to present users with critical performance information necessary for performance debugging, architectural enhancement and procurement of parallel systems. The usability of these results is improved through specifying confidence intervals as well as predicted execution times for data-parallel applications.  相似文献   

3.
Parallel programming is orders of magnitudes more complex than writing sequential programs. This is particularly true for programming distributed memory multiprocessor architectures based on message passing programming models. Apart from understanding the sequential parts of the parallel program, new degrees of freedom lead to additional problems. Understanding the synchronization and communication behavior of parallel programs is the most critical issue in programming distributed memory multiprocessors. The paper describes methods and tools for visualization and animation of the dynamic execution of parallel programs. Based on an evaluation and classification of existing visualization environments, the visualization and animation tool VISTOP (VISualization TOol for Parallel Systems) is presented as part of the integrated tool environment TOPSY S (TOols for Parallel SYStems) for programming distributed memory multiprocessors. VISTOP supports the interactive on-line visualization of message passing programs based on various views; in particular, a process graph based concurrency view for detecting synchronization and communication bugs.  相似文献   

4.
Several large real‐world applications have been developed for distributed and parallel architectures. We examine two different program development approaches. First, the usage of a high‐level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the cost of a reduced performance; a source‐to‐source compiler, has been employed to automatically compile programs—written in a high‐level programming paradigm—into message passing codes. Second, a manual program development by using a low‐level programming paradigm—such as message passing—enables the programmer to fully exploit a given architecture at the cost of a time‐consuming and error‐prone effort. Performance tools play a central role in supporting the performance‐oriented development of applications for distributed and parallel architectures. SCALA—a portable instrumentation, measurement, and post‐execution performance analysis system for distributed and parallel programs—has been used to analyze and to guide the application development, by selectively instrumenting and measuring the code versions, by comparing performance information of several program executions, by computing a variety of important performance metrics, by detecting performance bottlenecks, and by relating performance information back to the input program. We show several experiments of SCALA when applied to real‐world applications. These experiments are conducted for a NEC Cenju‐4 distributed‐memory machine and a cluster of heterogeneous workstations and networks. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

5.
This paper considers a model of a parallel program that can efficiently be interpreted on an instrumental computer, providing a means for a sufficiently accurate prediction of the actual time needed for execution of the parallel program on a given parallel computational system. The model is designed for parallel programs with explicit message passing written in Java with calls to the MPI library and is a part of the ParJava environment. The model is obtained by transforming the program control tree, which, for Java programs, can be constructed by modifying the abstract syntax tree. The communication functions are simulated on the basis of the LogGP model, which makes it possible to take into account specific features of the distributed computational system.  相似文献   

6.
We proposed in Ref. 5) a new,message-oriented implementation technique for Moded Flat GHC that compiled unification for data transfer into message passing. The technique was based on constraint-based program analysis, and significantly improved the performance of programs that used goals and streams to implement reconfigurable data structures. In this paper we discuss how the technique can be parallelized. We focus on a method for shared-memory multiprocessors, called theshared-goal method, though a different method could be used for distributed-memory multiprocessors. Unlike other parallel implementations of concurrent logic languages which we callprocess-oriented, the unit of parallel execution is not an individual goal but a chain of message sends caused successively by an initial message send. Parallelism comes from the existence of different chains of message sends that can be executed independently or in a pipelined manner. Mutual exclusion based on busy waiting and on message buffering controls access to individual, shared goals. Typical goals allowlast-send optimization, the message-oriented counterpart of last-call optimization. We have built an experimental implementation on Sequent Symmetry. In spite of the simple scheduling currently adopted, preliminary evaluation shows good parallel speedup and good absolute performance for concurrent operations on binary process trees.  相似文献   

7.
Benchmark evaluation of the IBM SP2 for parallel signal processing   总被引:1,自引:0,他引:1  
This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library (MPL) through STAP (Space-Time Adaptive Processing) benchmark experiments. Only coarse-grain parallelism was exploited on the SP2 due to its high communication overhead. A new parallelization scheme is developed for programming message passing multicomputers. Parallel STAP benchmark structures are illustrated with domain decomposition, efficient mapping of partitioned programs, and optimization of collective communication operations. We measure the SP2 performance in terms of execution time, Gflop/s rate, speedup over a single SP2 node, and overall system utilization. With 256 nodes, the Maul SP2 demonstrated the best performance of 23 Gflop/s in executing the High-Order Post-Doppler program, corresponding to a 34% system utilization. We have conducted a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size. Important lessons learned from these parallel processing benchmark experiments are discussed in the context of real-time, adaptive, radar signal processing on massively parallel processors (MPP)  相似文献   

8.
The Computer Aided Parallelisation Tools (CAPTools) [Ierotheou, C, Johnson SP, Cross M, Leggett PF, Computer aided parallelisation tools (CAPTools)—conceptual overview and performance on the parallelisation of structured mesh codes, Parallel Computing, 1996;22:163–195] is a set of interactive tools aimed to provide automatic parallelisation of serial FORTRAN Computational Mechanics (CM) programs. CAPTools analyses the user's serial code and then through stages of array partitioning, mask and communication calculation, generates parallel SPMD (Single Program Multiple Data) messages passing FORTRAN.The parallel code generated by CAPTools contains calls to a collection of routines that form the CAPTools communications Library (CAPLib). The library provides a portable layer and user friendly abstraction over the underlying parallel environment. CAPLib contains optimised message passing routines for data exchange between parallel processes and other utility routines for parallel execution control, initialisation and debugging. By compiling and linking with different implementations of the library, the user is able to run on many different parallel environments.Even with today's parallel systems the concept of a single version of a parallel application code is more of an aspiration than a reality. However for CM codes the data partitioning SPMD paradigm requires a relatively small set of message-passing communication calls. This set can be implemented as an intermediate ‘thin layer’ library of message-passing calls that enables the parallel code (especially that generated automatically by a parallelisation tool such as CAPTools) to be as generic as possible.CAPLib is just such a ‘thin layer’ message passing library that supports parallel CM codes, by mapping generic calls onto machine specific libraries (such as CRAY SHMEM) and portable general purpose libraries (such as PVM an MPI). This paper describe CAPLib together with its three perceived advantages over other routes:
  • •as a high level abstraction, it is both easy to understand (especially when generated automatically by tools) and to implement by hand, for the CM community (who are not generally parallel computing specialists);
  • •the one parallel version of the application code is truly generic and portable;
  • •the parallel application can readily utilise whatever message passing libraries on a given machine yield optimum performance.
  相似文献   

9.
PARC++ is a system that supports object-oriented parallel programming in C++. PARC++ provides the user with a set of predefined C++ classes that can easily be used for the construction of parallel C++ programs. With the help of PARC++ objects, the programmer is able to create and start new processes (threads), to synchronize their activities (Blocklock, Monitor) and to manage communication via message passing (Mailbox). PARC++ is written in C++ and currently runs on top of the EMEX operating system on a FORCE machine with 11 processing elements and an EDS (European Declarative System) with 28 processing elements. The paper also contains information about the run-time system model, the implementation and some performance measurements.  相似文献   

10.
Advanced application domains such as computer-aided design, computer-aided software engineering, and office automation are characterized by their need to store, retrieve, and manage large quantities of data having complex structures. A number of object-oriented database management systems (OODBMS) are currently available that can effectively capture and process the complex data. The existing implementations of OODBMS outperform relational systems by maintaining and querying cross-references among related objects. However, the existing OODBMS still do not meet the efficiency requirements of advanced applications that require the execution of complex queries involving the retrieval of a large number of data objects and relationships among them. Parallel execution can significantly improve the performance of complex OO queries. In this paper, we analyze the performance of parallel OO query processing algorithms for various benchmark application domains. The application domains are characterized by specific mixes of queries of different semantic complexities. The performance of the application domains has been analyzed for various system and data parameters by running parallel programs on a 32-node transputer based parallel machine developed at the IBM Research Center at Yorktown Heights. The parallel processing algorithms, data routing techniques, and query management and control strategies have been implemented to obtain accurate estimation of controlling and processing overheads. However, generation of large complex databases for the study was impractical. Hence, the data used in the simulation have been parameterized. The parallel OO query processing algorithms analyzed in this study are based on a query graph approach rather than the traditional query tree approach. Using the query graph approach, a query is processed by simultaneously initiating the execution at several object classes, thereby, improving the parallelism. During processing, the algorithms avoid the execution of time-consuming join operations by making use of the object references among the objects. Further, the algorithms do not generate any temporary data, thereby, reducing disk accesses. This is accomplished by marking the selected objects and by employing a two-phase query processing strategy.  相似文献   

11.
《Parallel Computing》1997,22(13):1837-1851
The PAPS (Performance Analysis of Parallel Systems) toolset is a testbed for the model based performance prediction of message passing parallel applications executed on private memory multiprocessor computer systems. PAPS allows to describe the execution behavior of the computer hardware and operating system software resources up to a very detailed level. This enables very accurate performance prediction of parallel applications even in the case of substantial performance degradation due to contention for shared resources. In this paper the fundamental design principles and implementation methodologies for the development of the PAPS toolset are presented and the PAPS parallel system specification formalisms are described. A simplified performance study of a parallel Gaussian elimination application on the nCUBE 2 multiprocessor system is used to demonstrate the usage of the tool.  相似文献   

12.
In this paper we present a new environment called MERPSYS that allows simulation of parallel application execution time on cluster-based systems. The environment offers a modeling application using the Java language extended with methods representing message passing type communication routines. It also offers a graphical interface for building a system model that incorporates various hardware components such as CPUs, GPUs, interconnects and easily allows various formulas to model execution and communication times of particular blocks of code. A simulator engine within the MERPSYS environment simulates execution of the application that consists of processes with various codes, to which distinct labels are assigned. The simulator runs one Java thread per label and scales computations and communication times adequately. This approach allows fast coarse-grained simulation of large applications on large-scale systems. We have performed tests and verification of results from the simulator for three real parallel applications implemented with C/MPI and run on real HPC clusters: a master-slave code computing similarity measures of points in a multidimensional space, a geometric single program multiple data parallel application with heat distribution and a divide-and-conquer application performing merge sort. In all cases the simulator gave results very similar to the real ones on configurations tested up to 1000 processes. Furthermore, it allowed us to make predictions of execution times on configurations beyond the hardware resources available to us.  相似文献   

13.
This research defines and analyzes a methodology for deriving a performance model for SPMD hybrid parallel applications. Hybrid parallelism combines shared memory and message passing computing models. This work extends the current practice of application performance modelling by development of a methodology for hybrid applications with these procedures.
  • Creation of a model based on complexity analysis of an application code and its data structures.
  • Enhancement of a static complexity model by dynamic factors to capture execution time phenomena, such as memory hierarchy effects.
  • Quantitative analysis of model characteristics and the effects of perturbations in measured parameters.
These research results are presented in the context of a hybrid parallel implementation of a sparse linear algebra kernel. A model for this kernel is derived and analyzed using the methodology. Application of the model on two large parallel computing platforms provides case studies for the methodology. Operating system issues, machine balance factor, and memory hierarchy effects on model accuracy are examined. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

14.
The VMMP (virtual machine for multiprocessors) software package is presented. It provides a coherent set of services for parallel application programs running on diverse multiple input multiple data (MIMD) multiprocessors, including shared memory and message passing multiprocessors. The communication, synchronization, and data distribution requirements of parallel algorithms are analyzed. Related languages and tools are described. VMMP services are identified. VMMP implementation, coding and portability are discussed. Some measurements of the performance of VROMP application programs and VMMP overhead are given. Several hints for improving the performance of application programs are described  相似文献   

15.
The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Eleven representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. Based on the measurement results, both the computation and communication behavior of these parallel programs were investigated. The various time interval distributions were modeled by statistical functions which were verified by a nonlinear regression technique using the empirical data. The temporal and spatial localities of message destinations were also studied. A model for the temporal locality of message length was introduced and used to analyze the communication traces. A trace-drive simulation environment, which uses the communication patterns of the parallel programs as inputs, was developed to study the behavior of the communication hardware under real workload. Simulation results on DMA and link utilizations are reported  相似文献   

16.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

17.
The exploitation of parallelism in a message passing platform implies a previous modeling phase of the parallel application as a task graph, which properly reflects its temporal behavior. In this paper, we analyze the classical task graph models of the literature and their drawbacks when modeling message passing programs with an arbitrary task structure. We define a new task graph model called temporal task interaction graph (TTIG) that integrates the classical models used in the literature. The TTIG allows us to explicitly capture the ability of concurrency of adjacent tasks for applications where adjacent tasks can communicate at any point inside them. A mapping strategy is developed from this model, which minimizes the expected execution time by properly exploiting task parallelism. The effectiveness of this approach has been proved in different experimentation scopes for a wide range of message passing applications.  相似文献   

18.
D.A.  P.D. 《Performance Evaluation》2005,60(1-4):165-187
We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology.  相似文献   

19.
Solving a system of linear simultaneous equations representing an electrical circuit is one of the most time consuming tasks for large scale circuit simulations. In order to facilitate a multiprocessor implementation of the circuit simulation program SPICE, decomposition algorithms should be employed to partition a sparse matrix equation of the overall circuit into a number of subcircuit equations for parallel processing. In this paper, the performance of a parallel direct method matrix equation solving routine was studied in several contexts: the theoretical lower bound on performance was derived and the tradeoff between parallelism and communication is presented; various implementation and performance tuning issuing is also reported. This routine is written in such a manner that the data structure is compatible with SPICE Version 3c1. The speedup obtained from the simulation of two test circuits on a message passing multiprocessor system built on Transputers will be reported. Finally, the factors affecting the performance of the multiprocessor system are outlined and the overheads affecting the system performance in the implementation are identified.  相似文献   

20.
New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号