首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Analyzing and understanding the performance behavior of parallel applications on parallel computing platforms is a long‐standing concern in the High Performance Computing community. When the targeted platforms are not available, simulation is a reasonable approach to obtain objective performance indicators and explore various hypothetical scenarios. In the context of applications implemented with the Message Passing Interface, two simulation methods have been proposed, on‐line simulation and off‐line simulation, both with their own drawbacks and advantages. In this work, we present an off‐line simulation framework, that is, one that simulates the execution of an application based on event traces obtained from an actual execution. The main novelty of this work, when compared to previously proposed off‐line simulators, is that traces that drive the simulation can be acquired on large, distributed, heterogeneous, and non‐dedicated platforms. As a result, the scalability of trace acquisition is increased, which is achieved by enforcing that traces contain no time‐related information. Moreover, our framework is based on a state‐of‐the‐art scalable, fast, and validated simulation kernel. We introduce the notion of performing off‐line simulation from time‐independent traces, propose and evaluate several trace acquisition strategies, describe our simulation framework, and assess its quality in terms of trace acquisition scalability, simulation accuracy, and simulation time. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

2.
With the increasing uniprocessor and symmetric multiprocessor computational power available today, interprocessor communication has become an important factor that limits the performance of clusters of workstations/multiprocessors. Many factors including communication hardware overhead, communication software overhead, and the user environment overhead (multithreading, multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying operations. Ideally, it is desirable to have a true zero‐copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering. However, due to the fact that message‐passing applications at the send side do not know the final receive buffer addresses, early arrival messages have to be buffered at a temporary area. In this paper, we show that there is a message reception communication locality in message‐passing applications. We have utilized this communication locality and devised different message predictors at the receiver sides of communications. In essence, these message predictors can be efficiently used to drain the network and cache the incoming messages even if the corresponding receive calls have not yet been posted. The performance of these predictors, in terms of hit ratio, on some parallel applications are quite promising and suggest that prediction has the potential to eliminate most of the remaining message copies. We also show that the proposed predictors do not have sensitivity to the starting message reception call, and that they perform better than (or at least equal to) our previously proposed predictors. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

3.
4.
The multiplication of cores in today’s architectures raises the importance of intra-node communication in modern clusters and their impact on the overall parallel application performance. Although several proposals focused on this issue in the past, there is still a need for a portable and hardware-independent solution that addresses the requirements of both point-to-point and collective MPIoperations inside shared-memory computing nodes.  相似文献   

5.
The purpose of this paper is to compare the communication performance and scalability of MPI communication routines on a Windows Cluster, a Linux Cluster, a Cray T3E‐600, and an SGI Origin 2000. All tests in this paper were run using various numbers of processors and two message sizes. In spite of the fact that the Cray T3E‐600 is about 7 years old, it performed best of all machines for most of the tests. The Linux Cluster with the Myrinet interconnect and Myricom's MPI performed and scaled quite well and, in most cases, performed better than the Origin 2000, and in some cases better than the T3E. The Windows Cluster using the Giganet Full Interconnect and MPI/Pro's MPI performed and scaled poorly for small messages compared with all of the other machines. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

6.
The Message Passing Interface (MPI) 3.0 standard includes a significant revision to MPI's remote memory access (RMA) interface, which provides support for one‐sided communication. MPI‐3 RMA is expected to greatly enhance the usability and performance of MPI RMA. We present the first complete implementation of MPI‐3 RMA and document implementation techniques and performance optimization opportunities enabled by the new interface. Our implementation targets messaging‐based networks and is publicly available in the latest release of the MPICH MPI implementation. Using this implementation, we explore the performance impact of new MPI‐3 functionality and semantics. Results indicate that the MPI‐3 RMA interface provides significant advantages over the MPI‐2 interface by enabling increased communication concurrency through relaxed semantics in the interface and additional routines that provide new window types, synchronization modes, and atomic operations. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

7.
The purpose of this paper is to compare the performance of MPICH with the vendor Message Passing Interface (MPI) on a Cray T3E‐900 and an SGI Origin 3000. Seven basic communication tests which include basic point‐to‐point and collective MPI communication routines were chosen to represent commonly‐used communication patterns. Cray's MPI performed better (and sometimes significantly better) than Mississippi State University's (MSU's) MPICH for small and medium messages. They both performed about the same for large messages, however for three tests MSU's MPICH was about 20% faster than Cray's MPI. SGI's MPI performed and scaled better (and sometimes significantly better) than MPICH for all messages, except for the scatter test where MPICH outperformed SGI's MPI for 1 kbyte messages. The poor scalability of MPICH on the Origin 3000 suggests there may be scalability problems with MPICH. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

8.
An accurate and highly-efficient analysis approach is crucial for a designer to evaluate the performance of on-chip networks. To this end, the novel M/G/1/N queuing models that capture various blocking phenomenon of the wormhole switching router are presented to compute the average waiting time accurately. With the M/G/1/N queuing models, this paper then presents the performance analysis algorithm to estimate some key metrics in terms of packet latency, buffer utilization, etc. The comparisons with SystemC simulated results show that the proposed approach with mean errors of 6.9% and 7.8% achieves the speedups of 117 and 101 for single-channel and multi-channel routers respectively. In our design methodology, this approach can direct NoC synthesis process effectively and then can be applied to multi-objective optimizations conveniently to find the best mappings.  相似文献   

9.
This paper discusses several approaches to designing and implementing shared‐memory communication protocol modules for the message‐passing interface (MPI) libraries, colloquially called ‘shared‐memory devices’. The authors present a new taxonomy for classifying designs for shared‐memory MPI communication devices and formulate design evaluation criteria. Using these criteria, the authors compare three existing shared‐memory devices for MPICH and choose the best one. The authors also present experimental results that support their choice. The contributions of this paper are three‐fold. First, the authors present the taxonomy for shared‐memory communication devices. Second, they show advantages and potential problems of the devices that belong to different classes of their taxonomy using the formulated design criteria. Third, they analyze communication performance of existing MPICH shared‐memory devices, discuss optimizations of their performance, and show the performance gains that these optimizations yield. MPICH is used for comparison, since it is a widely used MPI implementation. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

10.
针对如何缓解Infiniband集群中因通信冲突引起的MPI程序性能下降问题进行了研究,从系统管理的角度出发,提出了通过改变进程映射来优化MPI作业加载方案从而优化应用程序通信性能的方法,设计了用于评价MPI作业加载方案的通信性能损失系数(CPLR)指标,基于模拟退火算法设计了优化加载方案的搜索算法,并对所提出的指标和算法进行了实现和测试。测试结果表明,经过优化加载后的MPI程序在通信性能上有一定程度的提高。  相似文献   

11.
给出一个新的MPI Allgather算法--邻居交换算法(neighbor exchange).提出的平均逻辑通信距离的概念和计算公式,可以有效地衡量通信的局部性.通过分析,发现在4种MPI Allgather算法中,邻居交换和环算法均具有最优的通信局部性.在万亿次机群深腾6800和曙光4000A上对4个MPI Allgather算法进行的性能测试和分析结果表明,邻居交换算法的长消息通信性能最优,中长消息通信性能不稳定,短消息通信性能次于递归倍增和Bruck算法.  相似文献   

12.
An Assessment of MPI Environments for Windows NT   总被引:1,自引:0,他引:1  
In this paper we evaluate the MPI environments currently available for Windows NT on the Intel IA32 and Compaq/DEC Alpha architectures. We present benchmark results for low-level communication and for the NAS Parallel Benchmarks to allow comparison with other systems, but our primary interest is determining real application performance and robustness in production cluster environments. For this we use PAFEC-FE, a large FORTRAN code for finite-element analysis. We present results from three MPI implementations, two architectures, and three networking technologies (10 and 100 Mbit/s Ethernet and 1 Gbit/s Myrinet).  相似文献   

13.
One of the main challenges of scheduling algorithms in Grid environment is the autonomy of sites, which makes it difficult for the grid scheduler to estimate the exact cost of a task execution on different sites. In this paper, we present a solution for this problem based on data history (workload traces) and time series techniques. The main focus of this work is devoted to forecasting the task waiting time in a resource queue, which is under the control of a local scheduler with distinctive scheduling policy. The main contribution of this work is the consideration of a special property of the grid resources, the dynamic membership, i.e. a resource may exit and then come back to the grid environment repeatedly. When the resource belongs to the grid environment, its workload trace (log file) is considered as a correct log. On the other hand, when the resource leaves the grid, the log file during this period is considered as a defective part of the trace. As the defective parts contain some useful information, after repairing these defective parts, they can be used for forecasting purposes. Of this, we employ state‐space model along with the associated Kalman recursions in conjunction with the Expectation‐Maximization algorithm to repair the defective waiting time series such as a correct log file by which the resource seems never to have left the grid. The experimental results on a number of workload logs demonstrated that this approach can achieve an average prediction error, between 22 and 64% less than those incurred by other rival methods. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

14.
MPI is commonly used to write parallel programs for distributed memory parallel computers. MPI‐CHECK is a tool developed to aid in the debugging of MPI programs that are written in free or fixed format Fortran 90 and Fortran 77. MPI‐CHECK provides automatic compile‐time and run‐time checking of MPI programs. MPI‐CHECK automatically detects the following problems in the use of MPI routines: (i) mismatch in argument type, kind, rank or number; (ii) messages which exceed the bounds of the source/destination array; (iii) negative message lengths; (iv) illegal MPI calls before MPI_INIT or after MPI_FINALIZE; (v) inconsistencies between the declared type of a message and its associated DATATYPE argument; and (vi) actual arguments which violate the INTENT attribute. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

15.
The message passing interface (MPI) has become a de facto standard for programming models of highperformance computing, but its rich and flexible interface semantics makes the program easy to generate communication deadlock, which seriously affects the usability of the system. However, the existing detection tools for MPI communication deadlock are not scalable enough to adapt to the continuous expansion of system scale. In this context, we propose a framework for MPI runtime communication deadlock detection, namely MPI-RCDD, which contains three kinds of main mechanisms. Firstly, MPI-RCDD has a message logging protocol that is associated with deadlock detection to ensure that the communication messages required for deadlock analysis are not lost. Secondly, it uses the asynchronous processing thread provided by the MPI to implement the transfer of dependencies between processes, so that multiple processes can participate in deadlock detection simultaneously, thus alleviating the performance bottleneck problem of centralized analysis. In addition, it uses an AND⊕OR model based algorithm named AODA to perform deadlock analysis work. The AODA algorithm combines the advantages of both timeout-based and dependency-based deadlock analysis approaches, and allows the processes in the timeout state to search for a deadlock circle or knot in the process of dependency transfer. Further, the AODA algorithm cannot lead to false positives and can represent the source of the deadlock accurately. The experimental results on typical MPI communication deadlock benchmarks such as Umpire Test Suit demonstrate the capability of MPIRCDD. Additionally, the experiments on the NPB benchmarks obtain the satisfying performance cost, which show that the MPI-RCDD has strong scalability.  相似文献   

16.
Particle swarm optimization and neural networks (PSONN) were proposed for twin-spirals scroll compressor (TSSC) performance prediction. The method integrated evolutionary mechanism of PSO and self-learning, nonlinear approach ability of NN. The main structure parameters of TSSC were been as input variables and the main performance parameters were been as output variables in established NN. PSO was used to train NN. The trained NN can predict the TSSC performance very well. The trained results showed that this kind of approach can converge to better solutions much faster than the earlier reported approaches. It also overcame the weakness of slow convergence and local minima. The PSONN offered a new method for TSSC performance optimization.  相似文献   

17.
Modern service-oriented enterprise systems have increasingly complex and dynamic loosely-coupled architectures that often exhibit poor performance and resource efficiency and have high operating costs. This is due to the inability to predict at run-time the effect of workload changes on performance-relevant application-level dependencies and adapt the system configuration accordingly. Architecture-level performance models provide a powerful tool for performance prediction, however, current approaches to modeling the context of software components are not suitable for use at run-time. In this paper, we analyze typical online performance prediction scenarios and propose a performance meta-model for (i) expressing and resolving parameter and context dependencies, (ii) modeling service abstractions at different levels of granularity and (iii) modeling the deployment of software components in complex resource landscapes. The presented meta-model is a subset of the Descartes Meta-Model (DMM) for online performance prediction, specifically designed for use in online scenarios. We motivate and validate our approach in the context of realistic and representative online performance prediction scenarios based on the SPECjEnterprise2010 standard benchmark.  相似文献   

18.
The Message‐passing Interface (MPI) standard provides basic means for adaptations of the mapping of MPI process ranks to processing elements to better match the communication characteristics of applications to the capabilities of the underlying systems. The MPI process topology mechanism enables the MPI implementation to rerank processes by creating a new communicator that reflects user‐supplied information about the application communication pattern. With the newly released MPI 2.2 version of the MPI standard, the process topology mechanism has been enhanced with new interfaces for scalable and informative user‐specification of communication patterns. Applications with relatively static communication patterns are encouraged to take advantage of the mechanism whenever convenient by specifying their communication pattern to the MPI library. Reference implementations of the new mechanism can be expected to be readily available (and come at essentially no cost), but non‐trivial implementations pose challenging problems for the MPI implementer. This paper is first and foremost addressed to application programmers wanting to use the new process topology interfaces. It explains the use and the motivation for the enhanced interfaces and the advantages gained even with a straightforward implementation. For the MPI implementer, the paper summarizes the main issues in the efficient implementation of the interface and explains the optimization problems that need to be (approximately) solved by a good MPI library. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

19.
A Study of Process Arrival Patterns for MPI Collective Operations   总被引:1,自引:0,他引:1  
Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of MPI programs on two common cluster platforms, use a micro-benchmark to study the process arrival patterns in MPI programs with balanced loads, and investigate the impacts of different process arrival patterns on collective algorithms. Our results show that (1) the differences between the times when different processes arrive at a collective operation are usually sufficiently large to affect the performance; (2) application developers in general cannot effectively control the process arrival patterns in their MPI programs in the cluster environment: balancing loads at the application level does not balance the process arrival patterns; and (3) the performance of collective communication algorithms is sensitive to process arrival patterns. These results indicate that process arrival pattern is an important factor that must be taken into consideration in developing and optimizing MPI collective routines. We propose a scheme that achieves high performance with different process arrival patterns, and demonstrate that by explicitly considering process arrival pattern, more efficient MPI collective routines than the current ones can be obtained.  相似文献   

20.
MPI全互换操作是集群计算机上进行仿真计算时常用的通信操作之一,用于各计算节点间交换上一步骤的中间计算结果。由于全互换通信的密集多对多通信容易产生接收端的阻塞从而增加通信延时,因此通过形成环状的多次规律且有序的通信过程来优化全互换通信操作过程,在大数据量的全互换通信中可以获得明显的性能提升。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号