共查询到20条相似文献,搜索用时 62 毫秒
1.
一种新的MPI Allgather算法及其在万亿次机群系统上的实现与性能分析 总被引:4,自引:0,他引:4
给出一个新的MPI Allgather算法--邻居交换算法(neighbor exchange).提出的平均逻辑通信距离的概念和计算公式,可以有效地衡量通信的局部性.通过分析,发现在4种MPI Allgather算法中,邻居交换和环算法均具有最优的通信局部性.在万亿次机群深腾6800和曙光4000A上对4个MPI Allgather算法进行的性能测试和分析结果表明,邻居交换算法的长消息通信性能最优,中长消息通信性能不稳定,短消息通信性能次于递归倍增和Bruck算法. 相似文献
2.
3.
随着高性能计算机的应用和发展,并行应用程序所使用的处理器数越来越多,进程间的通信量也不断增多,这对应用程序的性能有很大影响. 在采用一种快速傅里叶变换HFFT对曙光5000A进行性能测试时发现,MPI集合通信函数MPI Alltoall的巨大通信开销是并行程序设计的瓶颈.为此,对现有主流Alltoall算法在曙光5000A和深腾7000上进行性能测试与分析,以期对未来的Alltoall算法的优化工作做出贡献.利用不同消息长度和不同进程数测试了Alltoall函数多种算法的性能,这些算法包括二维网格算法、三维网格算法、Bruck算法、原始算法、成对交换算法、递归倍增算法、环算法以及LAM/MPI中的简单算法等.实验结果表明:消息长度较小时,在曙光5000A上采用原始算法和Bruck算法的性能较好,而在深腾7000上用时较少的算法是简单算法和Bruck算法;对于长消息,曙光5000A上最优的算法是环算法,深腾7000上成对交换性能最优. 相似文献
4.
国产万亿次高性能计算机KD-50-I具有低功耗、低面积和高集成度等特点,对未来研制国产千万亿次计算机系统及提高其自主创新性具有示范作用.高性能计算机KD-50-I达到实用化,必须要有与之相配套的高效通信性能.针对万亿次机KD-50-I节点间拓扑固定和层次简单的网络结构特点,采用简化的LBP通信模型分析和优化了点对点通信和全局通信,对KD-50-I国产高性能机的推广普及具有重要的意义. 相似文献
5.
6.
7.
万亿次机群系统高性能应用软件运行现状分析 总被引:2,自引:0,他引:2
通过调用PAPI(Performance Application Programming Interface)接口函数对2004年3月~4月之间运行在国家应用“973”计划项目LSSC—Ⅱ万亿次机群系统上部分应用程序进行了跟踪,收集到了大量宝贵的性能数据。依据这些性能数据信息,对我国当前高性能软件的运行情况给出了初步分析。分析结果表明,目前大部分应用程序性能都处于较低水平,并行程序使用处理器的数目范围一般为1~64个,处理器平均效率低于10%,平均性能低于300Mflops。 相似文献
8.
9.
机群的网络通信速度是影响系统整体性能的重要原因。本文讨论了机群系统中几种常见的互连网络,网络通信的类型,衡量通信性能的基本指标和相应的测量方案。 相似文献
10.
基于Web软件的性能测试 总被引:6,自引:0,他引:6
基于Wed的软件相对于传统的应用程序具有很多新的特点,这对软件测试提出了新的要求。文章对软件的性能测试进行了研究,分析了软件性能的内涵、性能评测等。这些对于提高和改善基于Wed软件的性能具有很好的指导意义。 相似文献
11.
传统集群网络(cluster area network,简称cLAN)的评测模型主要考虑了延迟、带宽、路由、拥塞、网络拓扑结构等因素.但这些因素是否足以描述实际应用程序在集群上的通信行为,或者对其在集群系统上的性能给出一个很好的预测呢?当对NAS Parallel Benchmark(2.4版本)在集群系统深腾1800(DeepComp 1800)上进行大量测试时发现,集群网络的通信性能可以被一种特殊的通信模式(LU模式)所严重影响.更深入的研究表明,这个影响LU模式的因素是独立于前面所述的如延迟、带宽、路由、拥塞、网络拓扑结构等因素的.因此有必要对集群网络的评测模型重新进行审视,并增加一个新的性能评测因子以反映这个新发现的现象.从研究结果来看,这个重新审视也将对集群系统上的并行算法设计以及实际大规模科学计算的应用程序性能的优化提供一些新的思路. 相似文献
12.
This paper describes the implementation of a sizable subset of OpenMP on networks of workstations(NOWs) and the source-to-source OpenMP complier(AutoPar) is used for the JIAJIA home-based shared virtual memory system (SVM).The paper suggests some simple modifications and extensions to the OpenMP standard for the difference between SVM and SMP(symmetric multi processor),at which the OpenMP specification is aimed.The OpenMP translator is based on an automatic paralleization compiler,so it is possible to check the correctness of the semantics of OpenMP programs which is not required in an OpenMP-compliant implementation AutoPar is measured for five applications including both programs from NAS Parallel Benchmarks and real applications on a cluster of eight Pentium Ⅱ PCs connected by a 100 Mbps switched Eternet.The evaluation shows that the parallelization by annotaing OpenMPdirectives is simple and the performance of generatd JIAJIA code is still acceptable on NOWs. 相似文献
13.
Parallel computation model is an abstraction for the performance characteristics of parallel computers, and should evolve with the development of computational infrastructure. The heterogeneous CPU/Graphics Processing Unit (GPU) systems have been and will be important platforms for scientific computing, which introduces an urgent demand for new parallel computation models targeting this kind of supercomputers. In this research, we propose a parallel computation model called HLognGP to abstract the computation and communication features of heterogeneous platforms like TH‐1A. All the substantial parameters of HLognGP are in vector form and deal with the new features in GPU clusters. A simplified version HLog3GP of the proposed model is mapped to a specific GPU cluster and verified with two typical benchmarks. Experimental results show that HLog3GP outperforms the other two evaluated models and can well model the new particularities of GPU clusters. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
14.
Leonid Oliker Andrew Canning Jonathan Carter John Shalf David Skinner Stphane Ethier Rupak Biswas Jahed Djomehri Rob Van der Wijngaart 《Concurrency and Computation》2005,17(1):69-93
The growing gap between sustained and peak performance for scientific applications is a well‐known problem in high‐performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX‐6 vector processor, and compares it against the cache‐based IBM Power3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low‐level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX‐6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache‐based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX‐6 effectively. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献
15.
高性能并行集群计算环境的构建与性能测试 总被引:10,自引:0,他引:10
高性能并行集群系统在大规模科学计算中发挥着越来越重要的作用.本文介绍了一个集群系统的硬件和软件环境的设置,并利用通用的测试基准对该系统的性能进行了实例测试和对比分析。 相似文献
16.
This paper describes the ARGUS prototype, a high‐density, low‐power supercomputer built from an IXIA network analyzer chassis and load modules. The prototype is configured as a diskless distributed system that is scalable to 128 processors in a single 9U chassis. The entire system has a footprint of 0.25 m2 (2.5 ft2), a volume of 0.09 m3 (3.3 ft3) and maximum power consumption of less than 2200 W. We compare and contrast the characteristics of ARGUS against various machines including our on‐site 32‐node Beowulf and LANL's Green Destiny. Our results show that the computing density (Gflops ft−3) of ARGUS is about 30 times higher than that of the Beowulf and about three times higher than that of Green Destiny with a comparable performance. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
17.
The rapid rise of OpenMP as the preferred parallel programming paradigm for small‐to‐medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model‐of‐choice for large scale high‐performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc‐NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy‐inside–copy‐back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP‐SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse‐grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献
18.
并行程序的优化与性能评价 总被引:5,自引:1,他引:5
文中讨论了并行程序的优化问题,指出并行程序的优化应从数据划分、通信优化和串行优化三个方面着手。针对传统加速比的缺点和不足,我们提出了优化加速比模型来评价优化并行程序的性能;对NAS基准测试程序MG和FT进行了优化,用优化加速比模型分析了上述两个程序在IBM SP2上的性能。 相似文献
19.
一种集群NAS网络备份系统的研究与实现 总被引:2,自引:1,他引:2
随着数据安全性越来越受到重视。在设计一个分散服务集中管理的NAS集群系统的基础上,设计并实现了一个集群NAS网络备份系统。通过详细介绍这个系统的整体设计方案,着重介绍了针对性设计的NBP(Network Backup Protect)协议,并进行了相应的试验测试和性能分析. 相似文献
20.
传统的并行计算的性能评价模型是加速比,文中讨论了加速比的缺点和不足,在此基础上提出了一种新的优化并行计算的性能评价模型(我们称之为优化加速比)。利用优化加速比分析了NAS基准测试程序MG和FT在IBM SP2(66mhz/wn)上的性能。 相似文献