首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 171 毫秒
1.
将MPI(Message Passing Interface)进程拓扑有效地映射到处理器拓扑上有助于提高MPI程序的通信性能。目前大部分的MPI进程映射只考虑点对点通信,很少考虑到集合通信,原因是获取集合通信的进程拓扑是比较困难的。目前大部分剖析(profiling)工具在剖析集合通信时只考虑了函数的接口语义,而忽视了实现语义,导致这些工具不能正确地获取集合通信进程之间的详细通信情况。本文提出了一套剖析算法,可以准确地计算出参与集合通信的每对进程之间的通信量,并以通信矩阵的形式给出进程拓扑。实验证明了剖析算法的正确性,并且通过这种剖析方法获取的进程拓扑能够提升进程到处理器核的映射实验效果。  相似文献   

2.
王洁  衷璐洁  曾宇 《计算机科学》2011,38(10):281-284
多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间。国内外学 者提出了许多多核机群下MPI程序的优化方法和技术。测试了3个不同多核机群的通信性能,并分别在Intel与 AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/OpcnMP、优化MPI运行时参数以及优化 MPI进程摆放,同时对实验结果和优化性能进行了分析。  相似文献   

3.
利用对称多处理机(SMP)作结点可为嵌入式集群带来更高的计算性价比,但多个并行和存储层次也会带来存储一致性、可伸缩性、性能差异等问题.提出一种基于共享存储的嵌入式集群模型LESC.该模型通过高度综合实现"计算单元-互连一致性模块-系统"三级高可伸缩结构,获得功耗成本有效性.LESC完成分布式共享存储的基本功能,其目录缓存一致性和扩展的共享存储机制改善了传统存储层次,并利用"共享存储虚拟网络"提供模块级的高效通信,避免了网络硬件开销,同时支持MPI编程.经该模型的真实系统平台测试,模块内MPI通信性能是传统嵌入式集群的3倍以上,单元间通信性能可达单元内性能的86%以上,Linpack测试其扩展性能在最差情况下接近理想值的70%.  相似文献   

4.
基于二维/轴对称高精度可压缩多相流计算流体力学方法 MuSiC-CCASSIM的结构化网格部分,设计了区域并行分解方法;针对各处理器边界数据的通信,设计了阻塞式通信与非阻塞式通信并行算法;为了减少通信开销,设计了MPI/OpenMP混合并行优化算法。在天河二号超级计算机上进行了测试,每个核固定网格规模为625*250,最多调用8 192核。测试数据表明,采用MPI/OpenMP混合并行算法、纯MPI非阻塞式通信并行算法和纯MPI阻塞式通信并行算法的程序的平均并行效率分别达到86%、83%和77%,三种算法都具有良好的可扩展性。  相似文献   

5.
多核处理器的新特性使多核机群的存储层次更加复杂,同时也给MPI程序带来了新的优化空间.国内外学者提出了许多多核机群下MPI程序的优化方法和技术.测试了3个不同多核机群的通信性能,并分别在Intel与AMD多核机群下实验评估了几种具有普遍意义的优化技术:混合MPI/Op)MP、优化MPI运行时参数以及优化MPI进程摆放,同时对实验结果和优化性能进行了分析.  相似文献   

6.
随着高性能计算机的应用和发展,并行应用程序所使用的处理器数越来越多,进程间的通信量也不断增多,这对应用程序的性能有很大影响. 在采用一种快速傅里叶变换HFFT对曙光5000A进行性能测试时发现,MPI集合通信函数MPI Alltoall的巨大通信开销是并行程序设计的瓶颈.为此,对现有主流Alltoall算法在曙光5000A和深腾7000上进行性能测试与分析,以期对未来的Alltoall算法的优化工作做出贡献.利用不同消息长度和不同进程数测试了Alltoall函数多种算法的性能,这些算法包括二维网格算法、三维网格算法、Bruck算法、原始算法、成对交换算法、递归倍增算法、环算法以及LAM/MPI中的简单算法等.实验结果表明:消息长度较小时,在曙光5000A上采用原始算法和Bruck算法的性能较好,而在深腾7000上用时较少的算法是简单算法和Bruck算法;对于长消息,曙光5000A上最优的算法是环算法,深腾7000上成对交换性能最优.  相似文献   

7.
数据流编程语言是一种面向领域的编程语言,它能够将计算与通信分离,暴露应用程序的并行性.多核集群中计算、存储和通信等底层资源的复杂性对数据流程序的性能提出了新的挑战.针对数据流程序在多核集群上执行存在资源利用低和扩展性差等问题,利用同步数据流图作为中间表示,文中提出并实现了面向多核集群的层次性流水线并行优化方法.方法包含任务划分与调度、层次流水线调度和数据局部性优化,经过编译优化后生成基于MPI的可并行执行的目标代码.其中任务划分与调度是利用程序中数据和任务并行性将任务映射到计算核上,实现负载均衡和低通信同步开销;层次性流水线调度是利用程序中的并行性构造低延迟流水线调度;数据局部性优化是针对数据访问存在的Cache伪共享做面向存储的优化.实验以X86架构多核处理器组成的集群为平台,选取媒体处理领域的典型应用算法作为测试程序,对层次流水线优化进行实验分析.实验结果表明了优化方法的有效性.  相似文献   

8.
针对并行计算机体系结构中没有通用的计算模型这一问题,分析了一些现有的典型计算模型,在同步性、通信方式、参数方面进行比较,以LogGP模型为基础提出一种改进的mzLogGP模型。利用MPI并行算法对满足节点计算资源非独占、网络存在拥塞条件下的并行程序进行分析与测试,通过增加memory层次化层数和网络拥塞指数这两个参数,计算其计算开销和通信开销,将实测时间与预测时间进行比较,可知随节点数的增加系统误差不断减小,说明该新模型能改善并行应用在多核处理器集群平台上运行的性能,具有较好的可扩展性。  相似文献   

9.
针对如何缓解Infiniband集群中因通信冲突引起的MPI程序性能下降问题进行了研究,从系统管理的角度出发,提出了通过改变进程映射来优化MPI作业加载方案从而优化应用程序通信性能的方法,设计了用于评价MPI作业加载方案的通信性能损失系数(CPLR)指标,基于模拟退火算法设计了优化加载方案的搜索算法,并对所提出的指标和算法进行了实现和测试。测试结果表明,经过优化加载后的MPI程序在通信性能上有一定程度的提高。  相似文献   

10.
广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。  相似文献   

11.
Considering application behavior in graph partitioning is an arduous task because of the chicken-and-egg problem: the application behavior depends on how the graph is decomposed while achieving load balance requires the knowledge of how the application utilizes the underlying resources. Advances in multi-core processors further complicate the endeavor by introducing hardware diversity and intra-node contention. As an attempt to quantify performance for partitioning refinement, we propose a model that predicts execution times of iterative mesh-based applications running on heterogeneous multi-core clusters. Apart from considering resource heterogeneity, the model takes into account hierarchical communication characteristics, overlap between computation and communication, as well as performance penalties due to intra-node contention. We present a detailed methodology on how to obtain key parameters from a real system and highlight potential pitfalls of conventional approaches in obtaining the parameters. Experiments were conducted using a synthetic application benchmark solving a partial differential equation. Evaluation shows a good agreement between actual time measurement and the performance model.  相似文献   

12.
Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and the accelerators and between computing nodes. We consider the case where the intra-node parallelism is handled with HMPP compiler directives and message-passing programming with MPI is used to program the inter-node communications. This way of programming on such an heterogeneous architecture is costly and error-prone. In this paper, we specifically demonstrate the transformation of HMPP programs designed to exploit a single computing node equipped with a GPU into an heterogeneous HMPP + MPI exploiting multiple GPUs located on different computing nodes.  相似文献   

13.
作为新一代大数据流式计算框架,Heron忽略了任务实例之间不同通信方式的差异以及节点资源利用率不均衡的问题导致系统性能下降。针对这一问题,设计了节点资源限制模型、通信开销优化模型和实例数据流关系模型,并在此基础上提出了Heron环境下基于实例重分配的传输负载优化策略(transmission load optimization strategy based on instance reallocation in Heron,TLIR-Heron)。该策略包括节点资源限制算法和实例重分配算法,通过判定实例重分配条件并执行重分配算法将节点间数据流转换为节点内数据流,从而降低通信开销。实验结果表明,在三组拓扑测试下,TLIR-Heron相较于Heron默认调度策略能够降低节点间通信开销和系统的计算延迟,并提升了计算节点资源利用的均衡性。  相似文献   

14.
This paper proposes a scheduling algorithm to solve the problem of task scheduling in a cloud computing system with time‐varying communication conditions. This algorithm converts the scheduling problem with communication changes into a directed acyclic graph (DAG) scheduling problem for existing fuzzy communication task nodes, that is, the scheduling problem for a communication‐change DAG (CC‐DAG). The CC‐DAG contains both computation task nodes and communication task nodes. First, this paper proposes a weighted time‐series network bandwidth model to solve the indefinite processing time (cost) problem for a fuzzy communication task node. This model can accurately predict the processing time of a fuzzy communication task node. Second, to address the scheduling order problem for the computation task nodes, a dynamic pre‐scheduling search strategy (DPSS) is proposed. This strategy computes the essential paths for the pre‐scheduling of the computation task nodes based on the actual computation costs (times) of the computation task nodes and the predicted processing costs (times) of the fuzzy communication task nodes during the scheduling process. The computation task node with the longest essential path is scheduled first because its completion time directly influences the completion time of the task graph. Finally, we demonstrate the proposed algorithm via simulation experiments. The experimental results show that the proposed DPSS produced remarkable performance improvement rate on the total execution time that ranges between 11.5% and 21.2%. In view of the experimental results, the proposed algorithm provides better quality scheduling solution that is suitable for scientific application task execution in the cloud computing environment than HEFT, PEFT, and CEFT algorithms.  相似文献   

15.
为解决传统任务划分方法在三维网格并行计算任务分配阶段产生的通信开销大的问题,提出了一种基于多层k路划分算法的并行任务分配策略.首先利用多层k路划分算法划分三维网格,将任务划分问题转化为图划分问题,然后基于图划分结果给出一个任务映射并行算法将计算任务分配到各计算结点.在深腾1800上求解三维网格模型最短路径问题的实验结果表明,相比于传统的行列划分任务分配策略,该策略在保证负裁平衡的同时有效地降低了通信开销,算法的运行时间减少,加速比得到提高.  相似文献   

16.
并行计算技术是计算机技术发展的重要方向之一。当前并行程序模型主要有消息传递模型和共享存储模型两种。随着处理器多核技术的发展,在一枚多核处理器中集成两个或多个完整的计算引擎(内核),并充分利用多核计算机的特性,发挥多核计算机的性能成为一个很重要的研究方向。介绍一种新的MPI实现机制,这种机制集成了共享存储模型和消息通信模型的优点,在节点内使用共享存储模型,在节点间使用消息传递模型,并且通过自动生成线程级的任务来获得更好的性能。.  相似文献   

17.
Partitioning graphs into equally large groups of nodes while minimizing the number of edges between different groups is an extremely important problem in parallel computing. For instance, efficiently parallelizing several scientific and engineering applications requires the partitioning of data or tasks among processors such that the computational load on each node is roughly the same, while communication is minimized. Obtaining exact solutions is computationally intractable, since graph partitioning is NP-complete. For a large class of irregular and adaptive data parallel applications (such as adaptive graphs), the computational structure changes from one phase to another in an incremental fashion. In incremental graph-partitioning problems the partitioning of the graph needs to be updated as the graph changes over time; a small number of nodes or edges may be added or deleted at any given instant. In this paper, we use a linear programming-based method to solve the incremental graph-partitioning problem. All the steps used by our method are inherently parallel and hence our approach can be easily parallelized. By using an initial solution for the graph partitions derived from recursive spectral bisection-based methods, our methods can achieve repartitioning at considerably lower cost than can be obtained by applying recursive spectral bisection. Further, the quality of the partitioning achieved is comparable to that achieved by applying recursive spectral bisection to the incremental graphs from scratch  相似文献   

18.
The synchronous model of computation is well suited for real-time systems, because it allows static analysis in order to find and guarantee their reaction times. Today’s multi-core systems are becoming the predominant computing platforms. Synchronous programs are typically compiled into single threaded code, which makes them unsuitable for exploiting parallelism of the multi-core platforms. Moreover, static timing analysis becomes highly intractable for multi-core systems. This article proposes a novel methodology that aims at finding the mapping and schedule of synchronous programs that guarantees, statically, reaction times when mapped onto a multi-core system consisting of two types of time-predictable cores. The proposed methodology combines design space exploration based on evolutionary algorithm and scheduling of parts of synchronous programs. It allows minimizing the resource usage in terms of number of cores by finding the mapping and schedule with the guaranteed reaction time for architectures with different number of cores. In particular, we: (a) transform a synchronous program written in synchronous SystemJ to a graph-based model represented with two types of computation nodes suitable for execution on two types of time-predictable cores, (b) perform mapping of computation nodes on a customizable multi-core platform using genetic operations, and (c) generate a resulting static schedule of computation nodes for each mapping as part of the design space exploration. The design flow, from program specification and node mapping to the design space exploration and multi-core scheduling is completely automated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号