期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

一种新的MPI Allgather算法及其在万亿次机群系统上的实现与性能分析 总被引：4，自引：0，他引：4

陈靖张云泉张林波袁伟《计算机学报》2006,29(5):808-814

给出一个新的MPI Allgather算法--邻居交换算法（neighbor exchange）.提出的平均逻辑通信距离的概念和计算公式,可以有效地衡量通信的局部性.通过分析,发现在4种MPI Allgather算法中,邻居交换和环算法均具有最优的通信局部性.在万亿次机群深腾6800和曙光4000A上对4个MPI Allgather算法进行的性能测试和分析结果表明,邻居交换算法的长消息通信性能最优,中长消息通信性能不稳定,短消息通信性能次于递归倍增和Bruck算法. 相似文献

2.

1O万亿次超级计算机面世

《计算机》2003,(23):10-10

12月15日．在曙光公司10万亿次超级计算机曙光4000A交付仪式上．中科院计算所所长李国杰表示，曙光4000A的运算能力超过美国弗吉尼亚工学院和弗吉尼亚州立大学师生们自己搭建的运算能力达到1028万亿次的超级计算机排名而世界第三。相似文献

3.

国产百万亿次机群系统Alltoall性能测试与分析

饶立张云泉李玉成《计算机科学》2010,37(8):186-188

随着高性能计算机的应用和发展,并行应用程序所使用的处理器数越来越多,进程间的通信量也不断增多,这对应用程序的性能有很大影响. 在采用一种快速傅里叶变换HFFT对曙光5000A进行性能测试时发现,MPI集合通信函数MPI Alltoall的巨大通信开销是并行程序设计的瓶颈.为此,对现有主流Alltoall算法在曙光5000A和深腾7000上进行性能测试与分析,以期对未来的Alltoall算法的优化工作做出贡献.利用不同消息长度和不同进程数测试了Alltoall函数多种算法的性能,这些算法包括二维网格算法、三维网格算法、Bruck算法、原始算法、成对交换算法、递归倍增算法、环算法以及LAM/MPI中的简单算法等.实验结果表明:消息长度较小时,在曙光5000A上采用原始算法和Bruck算法的性能较好,而在深腾7000上用时较少的算法是简单算法和Bruck算法;对于长消息,曙光5000A上最优的算法是环算法,深腾7000上成对交换性能最优. 相似文献

4.

国产万亿次高性能计算机KD-50-I的通信优化

杨晓奇郑启龙陈国良张俊霞《小型微型计算机系统》2009,30(8)

国产万亿次高性能计算机KD-50-I具有低功耗、低面积和高集成度等特点,对未来研制国产千万亿次计算机系统及提高其自主创新性具有示范作用.高性能计算机KD-50-I达到实用化,必须要有与之相配套的高效通信性能.针对万亿次机KD-50-I节点间拓扑固定和层次简单的网络结构特点,采用简化的LBP通信模型分析和优化了点对点通信和全局通信,对KD-50-I国产高性能机的推广普及具有重要的意义. 相似文献

5.

国外争相研制万亿次巨型机

欧策《电子计算机》1993,(3):1-15

相似文献

6.

首台国产万亿次计算机研制成功

《软件工程师》2008,(2):11-11

近日．我国首台采用国产高性能通用处理器芯片“龙：毒2F”的万亿次高性能计算机“KD-50-I在中国科技大学研制成功．并干2007年12月26日通过了以王守觉院士为主任的专家委员会的鉴定。这是我国高性能计算机国产化的一次重要突破。在已公开报道的运算次数达万亿次级的国r8超级计算机中,中央处理芯片主要来自IBM、英特尔或AMD等国外公司,基于我国自主研发的处理器;苍片的万亿次计算机系统的研制成功还是首次。据了解,“KD-50-I”万亿次计算机采用单一机柜,集成了330余颗“龙芯2F”处理器, 相似文献

7.

万亿次机群系统高性能应用软件运行现状分析 总被引：2，自引：0，他引：2

侯晓吻张林波张云泉《计算机工程》2005,31(22):81-83

通过调用PAPI（Performance Application Programming Interface）接口函数对2004年3月～4月之间运行在国家应用“973”计划项目LSSC—Ⅱ万亿次机群系统上部分应用程序进行了跟踪,收集到了大量宝贵的性能数据。依据这些性能数据信息,对我国当前高性能软件的运行情况给出了初步分析。分析结果表明,目前大部分应用程序性能都处于较低水平,并行程序使用处理器的数目范围一般为1～64个,处理器平均效率低于10%,平均性能低于300Mflops。相似文献

8.

万亿次运算战争—NVIDIA GeForce GTX280与AMD RADEON HD4870对比评测

《大众软件》2008,(17):8-10

NVIDIA GTX280与AMD RADEON HD4870无疑代表了目前显卡性能的最高水平,它们就像一对命中注定的对手,一面在竞争中创下多项业界的新纪录,一面准备在白热化的价格战中拼个你死我活。内部集成14亿个晶体管的GTX280和拥有800个流处理器单元的HD4870究竟鹿死谁手？PhysX物理加速引擎到底能为性能带来多少提升？相似文献

9.

机群系统网络通信性能的研究

颜启华《计算机时代》2004,(1):12-14

机群的网络通信速度是影响系统整体性能的重要原因。本文讨论了机群系统中几种常见的互连网络,网络通信的类型,衡量通信性能的基本指标和相应的测量方案。相似文献

10.

基于Web软件的性能测试 总被引：6，自引：0，他引：6

丁月华王方丽《计算机与数字工程》2006,34(1):47-49

基于Wed的软件相对于传统的应用程序具有很多新的特点,这对软件测试提出了新的要求。文章对软件的性能测试进行了研究,分析了软件性能的内涵、性能评测等。这些对于提高和改善基于Wed软件的性能具有很好的指导意义。相似文献

11.

集群网络评测模型的新探索 总被引：6，自引：0，他引：6

下载免费PDF全文

唐渊孙家昶张云泉张林波《软件学报》2005,16(6):1131-1139

传统集群网络(cluster area network,简称cLAN)的评测模型主要考虑了延迟、带宽、路由、拥塞、网络拓扑结构等因素.但这些因素是否足以描述实际应用程序在集群上的通信行为,或者对其在集群系统上的性能给出一个很好的预测呢?当对NAS Parallel Benchmark(2.4版本)在集群系统深腾1800(DeepComp 1800)上进行大量测试时发现,集群网络的通信性能可以被一种特殊的通信模式(LU模式)所严重影响.更深入的研究表明,这个影响LU模式的因素是独立于前面所述的如延迟、带宽、路由、拥塞、网络拓扑结构等因素的.因此有必要对集群网络的评测模型重新进行审视,并增加一个新的性能评测因子以反映这个新发现的现象.从研究结果来看,这个重新审视也将对集群系统上的并行算法设计以及实际大规模科学计算的应用程序性能的优化提供一些新的思路. 相似文献

12.

总被引：3，自引：0，他引：3

下载免费PDF全文

章锋陈国良张兆庆《计算机科学技术学报》2002,17(1):0-0

This paper describes the implementation of a sizable subset of OpenMP on networks of workstations(NOWs) and the source-to-source OpenMP complier(AutoPar） is used for the JIAJIA home-based shared virtual memory system (SVM).The paper suggests some simple modifications and extensions to the OpenMP standard for the difference between SVM and SMP(symmetric multi processor),at which the OpenMP specification is aimed.The OpenMP translator is based on an automatic paralleization compiler,so it is possible to check the correctness of the semantics of OpenMP programs which is not required in an OpenMP-compliant implementation AutoPar is measured for five applications including both programs from NAS Parallel Benchmarks and real applications on a cluster of eight Pentium Ⅱ PCs connected by a 100 Mbps switched Eternet.The evaluation shows that the parallelization by annotaing OpenMPdirectives is simple and the performance of generatd JIAJIA code is still acceptable on NOWs. 相似文献

13.

Fengshun Lu Junqiang Song Yufei Pang 《Concurrency and Computation》2015,27(17):4880-4896

Parallel computation model is an abstraction for the performance characteristics of parallel computers, and should evolve with the development of computational infrastructure. The heterogeneous CPU/Graphics Processing Unit (GPU) systems have been and will be important platforms for scientific computing, which introduces an urgent demand for new parallel computation models targeting this kind of supercomputers. In this research, we propose a parallel computation model called HLog_nGP to abstract the computation and communication features of heterogeneous platforms like TH‐1A. All the substantial parameters of HLog_nGP are in vector form and deal with the new features in GPU clusters. A simplified version HLog₃GP of the proposed model is mapped to a specific GPU cluster and verified with two typical benchmarks. Experimental results show that HLog₃GP outperforms the other two evaluated models and can well model the new particularities of GPU clusters. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

14.

Leonid Oliker Andrew Canning Jonathan Carter John Shalf David Skinner Stphane Ethier Rupak Biswas Jahed Djomehri Rob Van der Wijngaart 《Concurrency and Computation》2005,17(1):69-93

The growing gap between sustained and peak performance for scientific applications is a well‐known problem in high‐performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX‐6 vector processor, and compares it against the cache‐based IBM Power3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low‐level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX‐6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache‐based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX‐6 effectively. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

15.

高性能并行集群计算环境的构建与性能测试 总被引：10，自引：0，他引：10

王小伟郭力葛蔚杨章远《小型微型计算机系统》2004,25(3):325-328

高性能并行集群系统在大规模科学计算中发挥着越来越重要的作用．本文介绍了一个集群系统的硬件和软件环境的设置，并利用通用的测试基准对该系统的性能进行了实例测试和对比分析。相似文献

16.

Xizhou Feng Rong Ge Kirk W. Cameron 《Concurrency and Computation》2006,18(15):1975-1987

This paper describes the ARGUS prototype, a high‐density, low‐power supercomputer built from an IXIA network analyzer chassis and load modules. The prototype is configured as a diskless distributed system that is scalable to 128 processors in a single 9U chassis. The entire system has a footprint of 0.25 m² (2.5 ft²), a volume of 0.09 m³ (3.3 ft³) and maximum power consumption of less than 2200 W. We compare and contrast the characteristics of ARGUS against various machines including our on‐site 32‐node Beowulf and LANL's Green Destiny. Our results show that the computing density (Gflops ft⁻³) of ARGUS is about 30 times higher than that of the Beowulf and about three times higher than that of Green Destiny with a comparable performance. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

17.

Ami Marowka Zhenying Liu Barbara Chapman 《Concurrency and Computation》2004,16(4):371-384

The rapid rise of OpenMP as the preferred parallel programming paradigm for small‐to‐medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model‐of‐choice for large scale high‐performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc‐NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy‐inside–copy‐back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP‐SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse‐grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

18.

并行程序的优化与性能评价 总被引：5，自引：1，他引：5

刘杰迟利华胡庆丰《计算机工程与科学》2000,22(5):67-70

文中讨论了并行程序的优化问题,指出并行程序的优化应从数据划分、通信优化和串行优化三个方面着手。针对传统加速比的缺点和不足,我们提出了优化加速比模型来评价优化并行程序的性能;对ＮＡＳ基准测试程序ＭＧ和ＦＴ进行了优化,用优化加速比模型分析了上述两个程序在ＩＢＭＳＰ２上的性能。相似文献

19.

一种集群NAS网络备份系统的研究与实现 总被引：2，自引：1，他引：2

万继光詹玲《小型微型计算机系统》2005,26(6):905-908

随着数据安全性越来越受到重视。在设计一个分散服务集中管理的NAS集群系统的基础上，设计并实现了一个集群NAS网络备份系统。通过详细介绍这个系统的整体设计方案，着重介绍了针对性设计的NBP(Network Backup Protect)协议，并进行了相应的试验测试和性能分析．相似文献

20.

优化并行计算的性能评价

刘杰迟利华胡庆丰《计算机工程与设计》2000,21(6):4-7

传统的并行计算的性能评价模型是加速比,文中讨论了加速比的缺点和不足,在此基础上提出了一种新的优化并行计算的性能评价模型（我们称之为优化加速比）。利用优化加速比分析了NAS基准测试程序MG和FT在IBM SP2(66mhz/wn)上的性能。相似文献