首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Research on multiprocessor interconnection networks has primarily focused on wormhole switching, virtual channel flow control, and routing algorithms to enhance their performance. The rationale behind this research is that by alleviating the network latency for high network loads, the overall system performance would improve; many studies have used synthetic workloads to support this claim. However, such workloads may not necessarily capture the behavior of real applications. In this paper, we have used parallel applications for a closer examination of the network behavior. In particular, the performance benefit from enhancing a 2D mesh with virtual channels (VCs) and a fully adaptive routing algorithm is examined with a set of shared-memory and message passing applications. Execution time and average message latency of shared memory applications are measured using execution-driven simulation and by varying many architectural attributes that affect the network workload. The communication traces of message passing applications, collected on an IBM-SP2, are used to run a trace-driven simulation of the mesh architecture to obtain message latency. Simulation results show that VCs and adaptive routing can reduce the network latency to varying degrees depending on the application. However, these modest benefits do not translate to significant improvements in the overall execution time because the load on the network is not high enough to exploit the advantages of the network enhancements. Moreover, this benefit may be negated if the architectural enhancements increase the network cycle time. Rather, emphasis should be placed on improving the raw network bandwidth and faster network interfaces  相似文献   

2.
We consider the time-dependent demands for data movement that a parallel program makes on the architecture that executes it. The result is an architecture-independent metric that represents the temporal behavior of data-movement requirements. Programs are described as series of computations and data movements, and while message passing is not ruled out, we focus on explicit parallel programs using a fixed number of processes in a distributed shared-memory environment. Operations are assumed to be explicitly allocated to processors when the metric is applied, which might correspond to intermediate code in a parallelizing compiler. The metric is called the interprocess read (IR) temporal metric. A key to developing an architecture-independent temporal metric is modeling program execution time in an architecture-independent way. This is possible because well-synchronized parallel programs make coordinated progress above a certain level of granularity. Our execution time characterization takes into account barrier synchronization and critical sections. We illustrate the metric using instruction count on simple code fragments and then from multiprocessor program traces (Splash benchmarks). Results of running the benchmarks on simulated network architectures show that the IR metric for the time scale of network response predicts performance better than whole program measures.  相似文献   

3.
Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model  相似文献   

4.
多任务环境—并行处理仿真中的核心模块   总被引:2,自引:0,他引:2  
  相似文献   

5.
In this paper, we describe the traffic simulation system which analyzes a traffic flow in a road network. The system is constructed and operated on the parallel computer AP1000. A database of a road network is implemented with the object-oriented programming technique. Elements of a road network are regarded as objects, and vehicles which move on the lanes are regarded as attributive data. The database is divided into sub-databases and assigned to each processor. The system calculates the behavior of vehicles in parallel. Sending and receiving data among the objects are carried out with message passing on the communication network. Moreover, we show the results of the vehicles behavior, and evaluate the parallel efficiency.  相似文献   

6.
The exploitation of parallelism in a message passing platform implies a previous modeling phase of the parallel application as a task graph, which properly reflects its temporal behavior. In this paper, we analyze the classical task graph models of the literature and their drawbacks when modeling message passing programs with an arbitrary task structure. We define a new task graph model called temporal task interaction graph (TTIG) that integrates the classical models used in the literature. The TTIG allows us to explicitly capture the ability of concurrency of adjacent tasks for applications where adjacent tasks can communicate at any point inside them. A mapping strategy is developed from this model, which minimizes the expected execution time by properly exploiting task parallelism. The effectiveness of this approach has been proved in different experimentation scopes for a wide range of message passing applications.  相似文献   

7.
Parallel programming is orders of magnitudes more complex than writing sequential programs. This is particularly true for programming distributed memory multiprocessor architectures based on message passing programming models. Apart from understanding the sequential parts of the parallel program, new degrees of freedom lead to additional problems. Understanding the synchronization and communication behavior of parallel programs is the most critical issue in programming distributed memory multiprocessors. The paper describes methods and tools for visualization and animation of the dynamic execution of parallel programs. Based on an evaluation and classification of existing visualization environments, the visualization and animation tool VISTOP (VISualization TOol for Parallel Systems) is presented as part of the integrated tool environment TOPSY S (TOols for Parallel SYStems) for programming distributed memory multiprocessors. VISTOP supports the interactive on-line visualization of message passing programs based on various views; in particular, a process graph based concurrency view for detecting synchronization and communication bugs.  相似文献   

8.
A practical methodology for evaluating and comparing the performance of distributed memory Multiple Instruction Multiple Data (MIMD) systems is presented. The methodology determines machine parameters and program parameters separately, and predicts the performance of a given workload on the machines under consideration. Machine parameters are measured using benchmarks that consist of parallel algorithm structures. The methodology takes a workload-based approach in which a mix of application programs constitutes the workload. Performance of different systems are compared, under the given workload, using the ratio of their speeds. In order to validate the methodology, an example workload has been constructed and the time estimates have been compared with the actual runs, yielding good predicted values. Variations in the workload are analysed in terms of increase in problem sizes and changes in the frequency of particular algorithm groups. Utilization and scalability are used to compare the systems when the number of processors is increased. It has been shown that performance of parallel computers is sensitive to the changes in the workload and therefore any evaluation and comparison must consider a given user workload. Performance improvement that can be obtained by increasing the size of a distributed memory MIMD system depends on the characteristics of the workload as well as the parameters that characterize the communication speed of the parallel system.  相似文献   

9.
A popular approach to providing nonexperts in parallel computing with an easy-to-use programming model is to design a software library consisting of a set of preparallelized routines, and hide the intricacies of parallelization behind the library's API. However, for regular domain problems (such as simple matrix manipulations or low-level image processing applications-in which all elements in a regular subset of a dense data field are accessed in turn) speedup obtained with many such library-based parallelization tools is often suboptimal. This is because interoperation optimization (or: time-optimization of communication steps across library calls) is generally not incorporated in the library implementations. We present a simple, efficient, finite state machine-based approach for communication minimization of library-based data parallel regular domain problems. In the approach, referred to as lazy parallelization, a sequential program is parallelized automatically at runtime by inserting communication primitives and memory management operations whenever necessary. Apart from being simple and cheap, lazy parallelization guarantees to generate legal, correct, and efficient parallel programs at all times. The effectiveness of the approach is demonstrated by analyzing the performance characteristics of two typical regular domain problems obtained from the field of low-level image processing. Experimental results show significant performance improvements over nonoptimized parallel applications. Moreover, obtained communication behavior is found to be optimal with respect to the abstraction level of message passing programs.  相似文献   

10.
We present a fast parallel SAT-solver on a message based MIMD machine. The input formula is dynamically divided into disjoint subformulas. Small subformulas are solved by a fast sequential SAT-solver running on every processor, which is based on the Davis-Putnam procedure with a special heuristic for variable selection. The algorithm uses optimized data structures to modify Boolean formulas. Additionally efficient workload balancing algorithms are used, to achieve a uniform distribution of workload among the processors. We consider the communication network topologiesd-dimensional processor grid and linear processor array. Tests with up to 256 processors have shown very good efficiency-values (>0.95).This research was supported by the Federal State of Nordrhein-Westfalen in theForschungsverbund Paralleles Rechnen, Az., IV A3-107 021 91.  相似文献   

11.
The performance evaluation of large file systems, such as storage and media streaming, motivates scalable generation of representative traces. We focus on two key characteristics of traces, popularity and temporal locality. The common practice of using a system-wide distribution obscures per-object behavior, which is important for system evaluation. We propose a model based on delayed renewal processes which, by sampling interarrival times for each object, accurately reproduces popularity and temporal locality for the trace. A lightweight version reduces the dimension of the model with statistical clustering. It is workload-agnostic and object type-aware, suitable for testing emerging workloads and ‘what-if’ scenarios. We implemented a synthetic trace generator and validated it using: (1) a Big Data storage (HDFS) workload from Yahoo!, (2) a trace from a feature animation company, and (3) a streaming media workload. Two case studies in caching and replicated distributed storage systems show that our traces produce application-level results similar to the real workload. The trace generator is fast and readily scales to a system of 4.3 million files. It outperforms existing models in terms of accurately reproducing the characteristics of the real trace.  相似文献   

12.
A model of parallel program that can be effectively interpreted on the development computer guaranteeing the possibility of a sufficiently precise prediction of real run time for a simulated parallel program at the prescribed computer system is studied. The model is worked out for parallel programs with explicit message passing written in the Java language with MPI library access and is included into the composition of ParJava environment. The model is obtained by transforming the program control tree that can be constructed for Java programs by modifying the abstract syntax tree. To model communication functions, the model LogGP is used which allows taking into consideration the specific character of the communication network of the distributed computer system.  相似文献   

13.
In this paper we present a novel approach, based on the integration of static program analysis and simulation techniques, for the performance prediction of message passing programs. PS, a simulator of PVM applications developed in the last years by our research group, is fed with traces collected by executing the parallel program to be analyzed in quasi-concurrent mode on a single workstation. Since this process is typically a non negligible part of the simulation complexity, we have devised a technique based on static analysis and code restructuring for significantly speeding up the trace generation. We show how, by statically analyzing and restructuring the program, it is possible to obtain a simplified code (shrinked code) to be run for collecting a reduced version of the traces.  相似文献   

14.
A parallel computational implementation of modern meshless system is presented for explicit for 3D bulk forming simulation problems. The system is implemented by reproducing kernel particle method. Aspects of a coarse grain parallel paradigm—domain decompose method—are detailed for a Lagrangian formulation using model partitioning. Integration cells are uniquely assigned on each process element and particles are overlap in boundary zones. Partitioning scheme multilevel recursive spectrum bisection approach is applied. The parallel contact search algorithm is also presented. Explicit message passing interface statements are used for all communication among partitions on different processors. The parallel 3D system is developed and implemented into 3D bulk metal forming problems, and the simulation results demonstrated the efficiency of the developed parallel reproducing kernel particle method system.  相似文献   

15.
Parallel programs present some features such as concurrency, communication and synchronization that make the test a challenging activity. Because of these characteristics, the direct application of traditional testing is not always possible and adequate testing criteria and tools are necessary. In this paper we investigate the challenges of validating message‐passing parallel programs and present a set of specific testing criteria. We introduce a family of structural testing criteria based on a test model. The model captures control and data flow of the message‐passing programs, by considering their sequential and parallel aspects. The criteria provide a coverage measure that can be used for evaluating the progress of the testing activity and also provide guidelines for the generation of test data. We also describe a tool, called ValiPar, which supports the application of the proposed testing criteria. Currently, ValiPar is configured for parallel virtual machine (PVM) and message‐passing interface (MPI). Results of the application of the proposed criteria to MPI programs are also presented and analyzed. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

16.
将Parareal算法中的预估校正格式加以改进,提出时域分解并行算法。基于主从模式和消息传递,具体考察了群体通信和非阻塞通信模式,并设计出通用而简便的并行化模型。在集群系统下对热传导方程和对流扩散方程的数值模拟结果表明:算法具有较高的加速性能以及良好的可扩展性,体现了时域分解的独特优势。  相似文献   

17.
The Convex SPP-1000 is the most recent of the new generation of Scalable Parallel Computing systems being offered commercially. The SPP-1000 is distinguished by incorporating the first commercial version of directory based cache coherence mechanisms and the emerging Scalable Coherent Interface protocol to achieve a true global shared memory capability. Pairs of HP PA-RISC processors are combined in clusters of 8 processors using a cross-bar switch. Up to 16 clusters are interconnected using 4 ring networks in parallel with a distributed global cache. To evaluate this new system in a Beta test environment, the Goddard Space Flight Center conducted three classes of operational experiments with an emphasis on applications related to Earth and space science. A cluster was tested as a platform for executing a multiple program workload exploiting job-stream level parallelism. Synthetic programs were run to measure overhead costs of barrier, fork-join, and message passing synchronization primitives. A key problem for Earth and space science studies is gravitational N-body simulation of solar systems to galactic clusters. An efficient tree-code version of this problem was run to reveal scaling properties of the system and to measure the overall efficiency. This paper presents the experimental results and findings of this study and provides the earliest published evaluation of this new scalable architecture.  相似文献   

18.
随着粒子模拟并行计算在相关领域应用的不断深入和并行节点计算能力的不断增强,粒子模拟并行程序中通信耗时对整体性能的影响越来越显著,甚至成为主要性能瓶颈.本文在分析影响并行程序通信性能多种因素的基础上,从进程划分方式选择、通信协议优化的角度,对1个典型粒子模拟并行程序——二维宏观拟颗粒并行模拟程序在千兆以太网环境下的通信性能的优化策略进行了测试研究,通过改进并行进程划分方式,采用用户级通信协议等方法,使测试程序通信性能获得明显提高,进而提出了粒子模拟并行程序通信性能优化的思路和建议.  相似文献   

19.
This paper deals with the construction and use of simple synthetic programs that model the behavior of more complex, real parallel programs. Synthetic programs can be used in many ways: to construct an easily ported suite of benchmark programs, to experiment with alternate parallel implementations of a program without actually writing them, and to predict the behavior and performance of an algorithm on a new or hypothetical machine. Synthetic programs are constructed easily from scratch and from existing programs and can even be constructed using nothing but information obtained from traces of the real program's execution.  相似文献   

20.
Distributed-memory parallel systems rely on explicit message exchange for communication, but the communication operations they support can differ in many aspects. One key difference is the way messages are generated or consumed. With systolic communication, a message is transmitted as it is generated. For example, the result computed by the multiplier is sent directly to the communication subsystem for transmission to another node. With memory communication, the complete message is generated and stored in memory, and then transmitted to its destination. Since sender and receiver nodes are individually controlled, they can use different communication styles. One example of memory communication is message passing: both the sender and receiver buffer the message in memory. These two communication styles place different demands on processor design. This article illustrates each style's effect on processor resources for some key application kernels. We are targeting the iWarp system because it supports both communication styles. Two parallel-program generators, one for each communication style, automatically map the sample programs  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号