期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王洁衷璐洁曾宇《计算机科学》2011,38(10):281-284

多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间。国内外学者提出了许多多核机群下MPI程序的优化方法和技术。测试了3个不同多核机群的通信性能,并分别在Intel与 AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/OpcnMP、优化MPI运行时参数以及优化 MPI进程摆放,同时对实验结果和优化性能进行了分析。相似文献

2.

多核机群下MPI程序优化技术的研究

王洁衷璐洁曾宇《计算机科学与探索》2011,38(10)

多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间.国内外学者提出了许多多核机群下MPI程序的优化方法和技术.测试了3个不同多核机群的通信性能,并分别在Intel与AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/Op)MP、优化MPI运行时参数以及优化MPI进程摆放,同时对实验结果和优化性能进行了分析. 相似文献

3.

基于判例构造的并行作业性能预测

张伟哲张宏莉张元竞《软件学报》2010,21(Z1):238-250

针对基于MPI 的并行作业性能预测问题,鉴于历史预测与建模分析方法在异构网络计算环境中性能预测的局限,提出了基于判例构造的并行作业性能预测方法.在MPI 库PMPI 接口中插入封套函数,获取通信日志,并设计了日志规整和合并算法.将最核心的日志循环收缩问题,转化为字符串循环子串收缩问题,提出了一种基于后缀数组算法,在理论和实际的性能方面均优于已有算法;判例程序自动构建阶段,解决了计算时间与通信时间等比例缩放问题,设计了自动构建可执行判例程序的方法.同构与异构机群环境实验结果表明,判例预测方法能够比较准确地预估计算作业的运行时间,对于同构机群误差不超过3%,异构机群误差不超过10%,与同类算法相比,具有较好的综合性能. 相似文献

4.

大规模结构有限元分析程序在多核集群计算环境中的性能分析和优化

吕海邸瑞华龚华《计算机科学》2012,39(1):305-310

通过对基于MPI编程模型实现的开源有限元计算分析软件在多核集群计算平台中的程序性能的分析,找出程序瓶颈及其原因,实现了基于MPI编程模型的并行程序在多核计算环境中的性能优化。根据程序性能瓶颈的分析,提出了基于MPI/OpenMP混合并行编程模型的大规模线性/非线性方程组求解和多线程多进程同时进行消息通信的两种程序性能优化方案。不同计算规模的实验结果表明,在多核集群计算平台中,MPI/OpenMP混合编程模型实现的大规模非线性方程组求解器相对于单纯基于MPI编程模型实现的并行程序,其性能有2倍到3倍的提升;多线程多进程同时消息传递的优化方案虽然对程序能够起到性能优化作用,但是对解决程序消息通信瓶颈的问题不是最好的方法。两个方案总体性能分析结果表明,基于MPI/OpenMP混合编程模型实现的并行程序,在多核集群计算平台中能够更好地发挥硬件系统的计算能力。相似文献

5.

POM:一个MPI程序的进程优化映射工具

卢兴敬商磊陈莉《计算机工程与科学》2009,31(Z1)

现代超级计算机具有越来越多的计算结点,同时结点内具有多个处理器核。由于互联带宽的差异,结点间与结点内构成两个通信性能不同的通信层次,后者的通信性能好于前者。但是,目前MPI程序的默认进程映射未考虑该通信层次差异,无法利用结点内较好的通信带宽,严重束缚了超级计算机的性能发挥。针对该问题,本文设计实现了能利用层次通信差异的MPI程序自动进程优化映射工具POM,提供了高效、低开销获取MPI程序通信信息的方法,最终通过优化通信在通信层次上的分布提高了程序的通信效率,从而提高了应用程序的性能。本文解决了硬件平台通信层次的抽象、MPI程序通信信息的低开销获取与映射方案的计算三个问题。首先,按照通信能力差异将超级计算机结构抽象为高速互联的不同计算结点与相同结点上的多个处理器核两层。其次,提出了将集合通信转化成点到点通信的简单实现方法。最后,利用无向加权边图来表示MPI程序的进程间通信关系,将MPI程序的进程映射问题转化为图划分问题。在曙光5000A和曙光4000A上的实验结果表明,利用POM工具能够显著提高MPI程序的性能。相似文献

6.

基于二维结构化网格的可压缩流体并行算法研究

皇甫永硕刘杰龚春叶《计算机工程与科学》2017,39(9):1602-1609

基于二维/轴对称高精度可压缩多相流计算流体力学方法 MuSiC-CCASSIM的结构化网格部分,设计了区域并行分解方法;针对各处理器边界数据的通信,设计了阻塞式通信与非阻塞式通信并行算法;为了减少通信开销,设计了MPI/OpenMP混合并行优化算法。在天河二号超级计算机上进行了测试,每个核固定网格规模为625*250,最多调用8 192核。测试数据表明,采用MPI/OpenMP混合并行算法、纯MPI非阻塞式通信并行算法和纯MPI阻塞式通信并行算法的程序的平均并行效率分别达到86%、83%和77%,三种算法都具有良好的可扩展性。相似文献

7.

一种新的MPI Allgather算法及其在万亿次机群系统上的实现与性能分析 总被引：4，自引：0，他引：4

陈靖张云泉张林波袁伟《计算机学报》2006,29(5):808-814

给出一个新的MPI Allgather算法--邻居交换算法（neighbor exchange）.提出的平均逻辑通信距离的概念和计算公式,可以有效地衡量通信的局部性.通过分析,发现在4种MPI Allgather算法中,邻居交换和环算法均具有最优的通信局部性.在万亿次机群深腾6800和曙光4000A上对4个MPI Allgather算法进行的性能测试和分析结果表明,邻居交换算法的长消息通信性能最优,中长消息通信性能不稳定,短消息通信性能次于递归倍增和Bruck算法. 相似文献

8.

基于重排序变换和循环分布的通信优化算法

陈达智赵荣彩韩林丁锐赵捷《计算机科学》2012,39(9):296-301

针对现有通信优化算法无法使MPI自动并行化编译器生成加速比理想的消息传递程序问题,提出了一种基于重排序变换和循环分布的通信优化算法。该算法根据给出的过程间副作用集合和基于mpi_wait/mpi_irecv移动的重排序变换规则,有序地采用重排序变换和循环分布,尽可能安全地扩大点到点非阻塞通信中通信与计算的重叠窗口,使MPI自动并行化编译器生成具有更多计算重叠通信的消息传递代码。实验结果表明,该算法能够隐藏更多的点到点非阻塞通信开销,并且明显提升消息传递程序的加速比。相似文献

9.

并行作业启动及其可扩展性分析

曹宏嘉卢宇彤谢旻周恩强《计算机研究与发展》2013,50(8)

随着高性能计算机系统与并行应用规模的不断增加,大规模并行作业的启动时间不能再被忽略不计.已有的研究给出了在Tianhe-1A系统上加载MPI作业的性能结果.通过分析作业启动在控制消息传递、文件访问、MPI环境初始化等各阶段的时间开销,发现对于大规模MPI作业而言,环境初始化时间是作业启动的主要开销.基于此发现进行了一些优化,减少MPI环境初始化时交换的数据量,并避免不必要的数据传输开销.显著地提高了并行作业启动的性能,进而提出了一种层次式的可扩展进程管理结构,以进一步增强作业启动的可扩展性.与其他主流MPI实现的进程管理机制的作业启动时间进行了比较. 相似文献

10.

对MPICH作业提交方法的改进和验证

奚诚杨开济马允胜《计算机应用与软件》2008,25(6):190-192

MPI(Message Passing Interface)是大规模集群和网格平台中最通用的编程环境,而MPICH是其中应用得最广的一种可移植的实现.在集群式系统中,通信时间取决于许多因素,如节点数、网络带宽、拓扑结构还有软件算法等.到目前为止关于程序层面上的通信模式被研究得很多,以期达到提高通信效率的目的,但是MPICH系统内部所需要的通信时间特别是作业提交过程所花费的时间往往为人们所忽略.分析了当前MPICH的作业提交方法,并提出了同步二叉树法、异步二叉树法和二倍扩散法等一系列改进算法,达到了减少通信时间,优化通信性能的目的. 相似文献

11.

HARNESS fault tolerant MPI design, usage and performance issues

Graham E. Jack J. 《Future Generation Computer Systems》2002,18(8)

Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to support a dynamic process model suitable for use on clusters or distributed systems would have reduced their performance. As current HPC collaborative applications increase in size and distribution the potential levels of node and network failures increase. This is especially true when MPI implementations are used as the communication media for GRID applications where the GRID architectures themselves are inherently unreliable thus requiring new fault tolerant MPI systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified MPI API. Given is an overview of the FT-MPI semantics, design, example applications and some performance issues such as efficient group communications and complex data handling. Also briefly described is the HARNESS g_hcore system that handles low-level system operations on behalf of the MPI implementation. This includes details of plug-in services developed and their interaction with the FT-MPI runtime library. 相似文献

12.

Fat-tree routing and node ordering providing contention free traffic for MPI global collectives

Eitan Zahavi 《Journal of Parallel and Distributed Computing》2012

As the size of High Performance Computing clusters grows, so does the probability of interconnect hot spots that degrade the latency and effective bandwidth the network provides. This paper presents a solution to this scalability problem for real life constant bisectional-bandwidth fat-tree topologies. It is shown that maximal bandwidth and cut-through latency can be achieved for MPI global collective traffic. To form such a congestion-free configuration, MPI programs should utilize collective communication, MPI-node-order should be topology aware, and the packet routing should match the MPI communication patterns. First, we show that MPI collectives can be classified into unidirectional and bidirectional shifts. Using this property, we propose a scheme for congestion-free routing of the global collectives in fully and partially populated fat trees running a single job. The no-contention result is then obtained for multiple jobs running on the same fat-tree by applying some job size and placement restrictions. Simulation results of the proposed routing, MPI-node-order and communication patterns show no contention which provides a 40% throughput improvement over previously published results for all-to-all collectives. 相似文献

13.

Improving MPI communication overlap with collaborative polling

Sylvain Didelot Patrick Carribault Marc Pérache William Jalby 《Computing》2014,96(4):263-278

With the rise of parallel applications complexity, the needs in term of computational power are continually growing. Recent trends in High-Performance Computing (HPC) have shown that improvements in single-core performance will not be sufficient to face the challenges of an exascale machine: we expect an enormous growth of the number of cores as well as a multiplication of the data volume exchanged across compute nodes. To scale applications up to Exascale, the communication layer has to minimize the time while waiting for network messages. This paper presents a message progression based on Collaborative Polling which allows an efficient auto-adaptive overlapping of communication phases by performing computing. This approach is new as it increases the application overlap potential without introducing overheads of a threaded message progression. We designed our approch for Infiniband into a thread-based MPI runtime called MPC. We evaluate the gain from Collaborative Polling on the NAS Parallel Benchmarks and three scientific applications, where we show significant improvements in communication times up to a factor of 2. 相似文献

14.

Performance Modeling and Evaluation of MPI

《Journal of Parallel and Distributed Computing》2001,61(2):202-223

Users of parallel machines need to have a good grasp for how different communication patterns and styles affect the performance of message-passing applications. LogGP is a simple performance model that reflects the most important parameters required to estimate the communication performance of parallel computers. The message passing interface (MPI) standard provides new opportunities for developing high performance parallel and distributed applications. In this paper, we use LogGP as a conceptual framework for evaluating the performance of MPI communications on three platforms: Cray-Research T3D, Convex Exemplar 1600SP, and a network of workstations (NOW). We develop a simple set of communication benchmarks to extract the LogGP parameters. Our objective in this is to compare the performance of MPI communication on several platforms and to identify a performance model suitable for MPI performance characterization. In particular, two problems are addressed: how LogGP quantifies MPI performance and what extra features are required for modeling MPI, and how MPI performance compare on the three computing platforms: Cray Research T3D, Convex Exemplar 1600SP, and workstations clusters. 相似文献

15.

Communication Benchmarking and Performance Modelling of MPI Programs on Cluster Computers

D.?A.?Grove Email author P.?D.?Coddington 《The Journal of supercomputing》2005,34(2):201-217

This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors. 相似文献

16.

PC机群上JIAJIA与MPI的比较 总被引：3，自引：2，他引：3

下载免费PDF全文

胡明昌史岗胡伟武唐志敏张福新《软件学报》2003,14(7):1187-1194

对JIAJIA和MPI (message passing interface)是进行了比较.JIAJIA和MPI分别代表共享存储和消息传递的编程模式.MPI显式进行数据传输,编程复杂;JIAJIA由底层维护数据一致性,并附加提供简单的消息传递函数,编程容易、灵活.JIAJIA分配共享内存时开销较大,初始化时间比MPI长.提出了一个关于并行加速比与进程数目之间关系的近似经验公式,推出JIAJIA和MPI性能差距随着进程数目的增多而增大的结论.测试结果表明,大部分应用程序的JIAJIA和MPI版本的并行性能差距不超过10%.对于通信量很小的应用程序,其JIAJIA和MPI的性能差距较小,而通信量本身较大的应用程序,其JIAJIA和MPI的性能差距主要取决于运行时产生的实际通信量. 相似文献

17.

A framework for adaptive collective communications for heterogeneous hierarchical computing systems

Luiz Angelo Steffenel Grégory Mounié 《Journal of Computer and System Sciences》2008,74(6):1082-1093

Collective communication operations are widely used in MPI applications and play an important role in their performance. However, the network heterogeneity inherent to grid environments represent a great challenge to develop efficient high performance computing applications. In this work we propose a generic framework based on communication models and adaptive techniques for dealing with collective communication patterns on grid platforms. Toward this goal, we address the hierarchical organization of the grid, selecting the most efficient communication algorithms at each network level. Our framework is also adaptive to grid load dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our experiments with the broadcast operation on a real-grid setup indicate that an adaptive framework allows significant performance improvements on MPI collective communications. 相似文献

18.

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Zheng Gu Matthew Small Xin Yuan Aniruddha Marathe David K. Lowenthal 《International journal of parallel programming》2013,41(5):682-703

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations. 相似文献

19.

同构应用在计算网格中的子作业指派

王庆江桂小林郑守洪《计算机工程》2005,31(3):32-34

为改进同构应用在计算网格中的执行性能,提出了子作业指派方法。对于计算密集的应用,任务间通信是可忽略的,故一个这样的作业被划分为若干子作业,不同的子作业被分别指派到不同的机群,该作业划分是根据网格负载平衡完成的。非计算密集的应用在多站点计算时很少取得令人满意的性能,故一个这样的作业被整体指派到某个机群。为找出最适合机群,对每个机群的处理机性能和处理机间通信性能进行测量,并根据应用性能模型预测作业运行时间。实验显示,该子作业指派方法在优化同构应用的执行性能上是有效的。相似文献

20.

基于OMNet++的大规模InfiniBand互连网络模拟系统

汪鑫林放刘轶钱德沛《计算机工程与科学》2021,43(5):792-798

随着多核处理器的发展和计算需求的不断增长,高性能计算系统规模不断增大.使用模拟器对高性能计算系统进行模拟,对系统设计及优化有着重要的作用,互连网络模拟则是其中不可或缺的一部分.设计实现了一种基于OM Net++的大规模InfiniBand互连网络模拟系统,该系统通过记录的并行程序M PI消息来驱动网络仿真过程,可以模拟互连网络在程序运行过程中的工作状态,并可与消息驱动的高性能计算机模拟系统集成.通过与真实集群中节点间通信时延做对比,验证了模拟精度,并测试了模拟性能. 相似文献