首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recent studies show that MPI processes in real applications could arrive at an MPI collective operation at different times. This imbalanced process arrival pattern can significantly affect the performance of the collective operation. MPI_Alltoall() and MPI_Allgather() are communication-intensive collective operations that are used in many scientific applications. Therefore, their efficient implementations under different process arrival patterns are critical to the performance of scientific applications running on modern clusters. In this paper, we propose novel RDMA-based process arrival pattern aware MPI_Alltoall() and MPI_Allgather() for different message sizes over InfiniBand clusters. We also extend the algorithms to be shared memory aware for small to medium size messages under process arrival patterns. The performance results indicate that the proposed algorithms outperform the native MVAPICH implementations as well as other non-process arrival pattern aware algorithms when processes arrive at different times. Specifically, the RDMA-based process arrival pattern aware MPI_Alltoall() and MPI_Allgather() are 3.1 times faster than MVAPICH for 8 KB messages. On average, the applications studied in this paper (FT, RADIX, and N-BODY) achieve a speedup of 1.44 using the proposed algorithms.  相似文献   

2.
刘志强  宋君强  卢风顺  徐芬 《软件学报》2011,22(10):2509-2522
为了提高非平衡进程到达(unbalanced process arrival,简称UPA)模式下MPI广播的性能,对UPA模式下的广播问题进行了理论分析,证明了在多核集群环境中通过节点内多个MPI进程的竞争可以有效减少UPA对MPI广播性能的影响,并在此基础上提出了一种新的优化方法,即竞争式流水化方法(competitive and pipelined method,简称CP).CP方法通过一种节点内进程竞争机制在广播过程中尽早启动节点间通信,经该方法优化的广播算法利用共享内存在节点内通信,利用由竞争机制产生的引导进程执行原算法在节点间通信.并且,该方法使节点间通信和节点内通信以流水方式重叠执行,能够有效利用集群系统各节点的多核优势,减少了MPI广播受UPA的影响,提高了性能.为了验证CP方法的有效性,基于此方法优化了3种典型的MPI广播算法,分别适用于不同消息长度的广播.在真实系统中,通过微基准测试和两个实际的应用程序对CP广播进行了性能评价,结果表明,该方法能够有效地提高传统广播算法在UPA模式下的性能.在应用程序的负载测试实验结果中,CP广播的性能较流水化广播的性能提高约16%,较MVAPICH21.2中广播的性能提高18%~24%.  相似文献   

3.
In this paper, we present an adaptive extension library that combines the advantage of using a portable MPI library with the ability to optimize the performance of specific collective communication operations. The extension library is built on top of MPI and can be used with any MPI library. Using the extension library, performance improvements can be achieved by an orthogonal organization of the processors in 2D or 3D meshes and by decomposing the collective communication operations into several consecutive phases of MPI communication. Additional point‐to‐point‐based algorithms are also provided. The extension library works in two steps, an a priori configuration phase detecting possible improvements for implementing collective communication for the MPI library used and an execution phase selecting a better implementation during execution time. This allows an adaptation of the performance of MPI programs to a specific execution platform and communication situation. The experimental evaluation shows that significant performance improvements can be obtained for different MPI libraries by using the library extension for collective MPI communication operations in isolation as well as in the context of application programs. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

4.
The Message‐Passing Interface (MPI) is commonly used to write parallel programs for distributed memory parallel computers. MPI‐CHECK is a tool developed to aid in the debugging of MPI programs that are written in free or fixed format Fortran 90 and Fortran 77. This paper presents the methods used in MPI‐CHECK 2.0 to detect many situations where actual and potential deadlocks occur when using blocking and non‐blocking point‐to‐point routines as well as when using collective routines. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

5.
Collective communication operations are widely used in MPI applications and play an important role in their performance. However, the network heterogeneity inherent to grid environments represent a great challenge to develop efficient high performance computing applications. In this work we propose a generic framework based on communication models and adaptive techniques for dealing with collective communication patterns on grid platforms. Toward this goal, we address the hierarchical organization of the grid, selecting the most efficient communication algorithms at each network level. Our framework is also adaptive to grid load dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our experiments with the broadcast operation on a real-grid setup indicate that an adaptive framework allows significant performance improvements on MPI collective communications.  相似文献   

6.
Message Passing Interface (MPI) allows a group of computers in a network to be specified as a cluster system. It provides the routines for task activation and communication. Writing programs for a cluster system is a difficult job. In this paper the Message-passing Interface is presented. Parallel programs using the WMPI, a version of MPI, to solve the pi(π) calculation the quick sort algorithm and the Torsion problem are presented. The programs are written and compiled in Microsoft Visual C+ +.  相似文献   

7.
王浩  张伟  谢旻  董勇 《计算机工程与科学》2020,42(11):1981-1987
MPI collective communication operation is widely used in parallel scientific application, which has an important influence impact on the scalabilityof the program. Tianhe interconnect network supports the trigger communication operations, which can offload the messaging and processing work and improve the performance between nodes. Allreduce and Reduce algorithms under different tree topological structures are designed by using thetriggered operations to lower the latency the reduction operation communication between nodes. Tests based on the actual system platform show that that, compared with the point to point implementation of these two types of operations in MPICH, the offload algorithm based on trigger can reduce the running time by up to 59.6% at different node scales.  相似文献   

8.
As supercomputers scale to 1000 PFlop/s over the next decade, investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices for high-performance computing (HPC) hardware/software co-design is crucial. This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance investigation toolkit, such as its scalability to 227 simulated Message Passing Interface (MPI) ranks on 960 real processor cores, the capability to evaluate the performance of different MPI collective communication algorithms, and the ability to evaluate the performance of a basic Monte Carlo application with different architectural parameters.  相似文献   

9.
Current technologies allow efficient data collection by several sensors to determine an overall evaluation of the status of a cluster. However, no previous work of which we are aware analyzes the behavior of the parallel programs themselves in real time. In this paper, we perform a comparison of different artificial intelligence techniques that can be used to implement a lightweight monitoring and analysis system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of both library‐function and operating‐system calls of parallel programs written with C and MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a false‐positive rate near 0% in real time, and we show that the added computational load on each node is small. As an example, the monitoring of function calls using a hidden Markov model generates less than 5% overhead. The proposed system is able to automatically detect deviations of a process from its expected behavior in any node of the cluster, and thus it can be used as an anomaly detector, for performance monitoring to complement other systems or as a debugging tool. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

10.
The Message‐passing Interface (MPI) standard provides basic means for adaptations of the mapping of MPI process ranks to processing elements to better match the communication characteristics of applications to the capabilities of the underlying systems. The MPI process topology mechanism enables the MPI implementation to rerank processes by creating a new communicator that reflects user‐supplied information about the application communication pattern. With the newly released MPI 2.2 version of the MPI standard, the process topology mechanism has been enhanced with new interfaces for scalable and informative user‐specification of communication patterns. Applications with relatively static communication patterns are encouraged to take advantage of the mechanism whenever convenient by specifying their communication pattern to the MPI library. Reference implementations of the new mechanism can be expected to be readily available (and come at essentially no cost), but non‐trivial implementations pose challenging problems for the MPI implementer. This paper is first and foremost addressed to application programmers wanting to use the new process topology interfaces. It explains the use and the motivation for the enhanced interfaces and the advantages gained even with a straightforward implementation. For the MPI implementer, the paper summarizes the main issues in the efficient implementation of the interface and explains the optimization problems that need to be (approximately) solved by a good MPI library. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

11.
MPJ Express is a messaging system that allows application developers to parallelize their compute-intensive sequential Java codes on High Performance Computing clusters and multicore processors. In this paper, we extend MPJ Express software to provide two new communication devices. The first device—called hybrid—enables MPJ Express to exploit hybrid parallelism on cluster of multicore processors by sitting on top of existing shared memory and network communication devices. The second device—called native—uses JNI wrappers in interfacing MPJ Express to native MPI implementations like MPICH and Open MPI. We evaluate performance of these devices on a range of interconnects including 1G/10G Ethernet, 10G Myrinet and 40G InfiniBand. In addition, we analyze and evaluate the cost of MPJ Express buffering layer and compare it with the performance numbers of other Java MPI libraries. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability by exploiting multicore architecture. Our performance evaluation quantifies the cost incurred by buffering and its impact on overall performance of software. We witnessed comparative performance as both new devices improve application performance and achieve upto 90 % of the theoretical bandwidth available without application rewriting effort—including NAS Parallel Benchmarks, point-to-point and collective communication.  相似文献   

12.
王洁  曾宇  张建林 《计算机科学》2010,37(6):229-232
多核处理器的新特性给MPI应用带来了新的优化空间,其中调优MPI运行时参数被证明是优化MPI应用的有效方法.然而最优的运行时参数不仅与多核机群的体系结构有关,也决定于MPI应用的程序特征.提出并分析了一种在给定多核机群下基于人工神经网络的优化模型,用于自动为未知的MPI程序预测接近最优的运行时参数.两个不同基准的实验证明了本方法的有效性.实验证明,基于本方法得到的运行时参数所产生的加速比平均达到了实际最大加速比的95%以上.  相似文献   

13.
The purpose of this paper is to compare the performance of MPICH with the vendor Message Passing Interface (MPI) on a Cray T3E‐900 and an SGI Origin 3000. Seven basic communication tests which include basic point‐to‐point and collective MPI communication routines were chosen to represent commonly‐used communication patterns. Cray's MPI performed better (and sometimes significantly better) than Mississippi State University's (MSU's) MPICH for small and medium messages. They both performed about the same for large messages, however for three tests MSU's MPICH was about 20% faster than Cray's MPI. SGI's MPI performed and scaled better (and sometimes significantly better) than MPICH for all messages, except for the scatter test where MPICH outperformed SGI's MPI for 1 kbyte messages. The poor scalability of MPICH on the Origin 3000 suggests there may be scalability problems with MPICH. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

14.
王洁  衷璐洁  曾宇 《计算机科学》2011,38(10):281-284
多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间。国内外学 者提出了许多多核机群下MPI程序的优化方法和技术。测试了3个不同多核机群的通信性能,并分别在Intel与 AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/OpcnMP、优化MPI运行时参数以及优化 MPI进程摆放,同时对实验结果和优化性能进行了分析。  相似文献   

15.
Barrier is widely used for synchronization in parallel programs. Since the process arrived earlier than others should wait at the barrier, the total processor utilization decreases. In this paper, to find the sources of the barrier waiting time, parallel programs are executed on the various grain sizes through execution-driven simulations. In simulation studies, we found that even if approximately equal amounts of work are distributed to each processor, all processes may not arrive at a barrier at the same time. The reasons are that the different numbers of cache misses and instructions within in partitioned grains result in the difference in arrival time of processors at the barrier. In this paper, the two-phased barrier is considered to reduce the blind waiting time in the traditional barrier scheme, which can be simply constructed by dividing one specific stage for the synchronization into two stages. On each stage, processes decide their stall or not, which is dependent on the current execution state of grains running on any given processors. Simulation results show that the reduced barrier waiting times attributed to the two-phased barrier contribute to the performance improvement of parallel programs.  相似文献   

16.
嵌入式零树小波压缩算法是图像压缩技术中有效的压缩算法,但其压缩时间较长.对该算法进行了研究,并在多核机群系统下实现了该算法的并行算法,提高了算法的性能.实现了MPI和MPI+OpenMP两种并行算法,并将串行算法、MPI并行算法与MPI+OpenMP并行算法进行比较.结果显示,随着数据量的增多,MPI并行算法和MPI+OpenMP并行算法相对于串行算法的运行效率都有明显提高,其中MPI+OpenMP并行算法的效率更好.  相似文献   

17.
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors.  相似文献   

18.
现代超级计算机具有越来越多的计算结点,同时结点内具有多个处理器核。由于互联带宽的差异,结点间与结点内构成两个通信性能不同的通信层次,后者的通信性能好于前者。但是,目前MPI程序的默认进程映射未考虑该通信层次差异,无法利用结点内较好的通信带宽,严重束缚了超级计算机的性能发挥。针对该问题,本文设计实现了能利用层次通信差异的MPI程序自动进程优化映射工具POM,提供了高效、低开销获取MPI程序通信信息的方法,最终通过优化通信在通信层次上的分布提高了程序的通信效率,从而提高了应用程序的性能。本文解决了硬件平台通信层次的抽象、MPI程序通信信息的低开销获取与映射方案的计算三个问题。首先,按照通信能力差异将超级计算机结构抽象为高速互联的不同计算结点与相同结点上的多个处理器核两层。其次,提出了将集合通信转化成点到点通信的简单实现方法。最后,利用无向加权边图来表示MPI程序的进程间通信关系,将MPI程序的进程映射问题转化为图划分问题。在曙光5000A和曙光4000A上的实验结果表明,利用POM工具能够显著提高MPI程序的性能。  相似文献   

19.
In a previous article,(1) Gupta and Hill introduced anadaptive combining tree algorithm for busy-wait barrier synchronization on shared-memory multiprocessors. The intent of the algorithm was to achieve a barrier in logarithmic time when processes arrive simultaneously, and in constant time after the last arrival when arrival times are skewed. Afuzzy (2) version of the algorithm allows a process to perform useful work between the point at which it notifies other processes of its arrival and the point at which it waits for all other processes to arrive. Unfortunately, adaptive combining tree barriers as originally devised perform a large amount of work at each node of the tree, including the acquisition and release of locks. They also perform an unbounded number of accesses to nonlocal locations, inducing large amounts of memory and interconnect contention. We present new adaptive combining tree barriers that eliminate these problems. We compare the performance of the new algorithms to that of other fast barriers on a 64-node BBN Butterfly 1 multiprocessor, a 35-node BBN TC2000, and a 126-node KSR 1. The results reveal scenarios in which our algorithms outperform all known alternatives, and suggest that both adaptation and the combination of fuzziness with tree-style synchronization will be of increasing importance on future generations of shared-memory multiprocessors. At the University of Rochester, this work was supported in part by NSF Institutional Infrastructure grant number CDA-8822724 and ONR research contract number N00014-92-J-1801 (in conjunction with the ARPA Research in Information Science and Technology—High Performance Computing, Software Science and Technology program, ARPA Order No. 8930). At Rice University, this work was supported in part by NSF Cooperative Agreements CCR-8809615 and CCR-912008.  相似文献   

20.
This paper deals with a technique that can support the re-engineering of parallel programs based on point-to-point communication primitives by detecting typical process interaction patterns in the code. Pattern detection is performed by the static analysis of the parallel program and by solving Diophantine sets of inequalities. The objective is to determine process interactions and to classify them into a set of commonly occurring interaction patterns.

Information on the patterns contained in the program, besides being useful for code comprehension and documentation, makes it possible to obtain more structured and, possibly, efficient versions of the same programs through the use of collective communication constructs. These are primitives for collective data movement or computation often available in current message-passing programming environments.

After the presentation of the basic program analysis technique, several examples involving the detection of common communication patterns are shown. Then the structure of PPAR, a prototype tool that allows the analysis of parallel programs written in Fortran 77 with calls to PVM or MPI unstructured communication primitives is outlined, and conclusions are drawn.  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号