首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
本文给出了一类基于六边形非张量积区域上的广义离散快速傅立叶变换算法(HFFT)以及它在国产百万亿次超级计算机(曙光5000A)上的测试运行情况.文章介绍了该算法在曙光5000A上的大规模集群测试加速比和可扩展性特性,并通过分析,说明HFFT在国产超级计算机的大规模并行环境下拥有良好的可扩展性.在使用8192个处理器核的情况下,HFFT加速比达到了277倍.我们同样对FFTw软件包进行了测试.本文的分析为解决其他科学计算程序在国产百万亿次规模集群上的可扩展性问题,提供了一些先行的参考和建议.  相似文献   

2.
万亿次机群系统高性能应用软件运行现状分析   总被引:2,自引:0,他引:2  
通过调用PAPI(Performance Application Programming Interface)接口函数对2004年3月~4月之间运行在国家应用“973”计划项目LSSC—Ⅱ万亿次机群系统上部分应用程序进行了跟踪,收集到了大量宝贵的性能数据。依据这些性能数据信息,对我国当前高性能软件的运行情况给出了初步分析。分析结果表明,目前大部分应用程序性能都处于较低水平,并行程序使用处理器的数目范围一般为1~64个,处理器平均效率低于10%,平均性能低于300Mflops。  相似文献   

3.
国产万亿次高性能计算机KD-50-I具有低功耗、低面积和高集成度等特点,对未来研制国产千万亿次计算机系统及提高其自主创新性具有示范作用.高性能计算机KD-50-I达到实用化,必须要有与之相配套的高效通信性能.针对万亿次机KD-50-I节点间拓扑固定和层次简单的网络结构特点,采用简化的LBP通信模型分析和优化了点对点通信和全局通信,对KD-50-I国产高性能机的推广普及具有重要的意义.  相似文献   

4.
本文给出了一类基于六边形非张量积区域上的广义离散快速傅立叶变换算法(HFFT)以及它在国产百万亿次超级计算机(曙光5000A)上的测试运行情况.文章介绍了该算法在曙光5000A上的大规模集群测试加速比和可扩展性特性,并通过分析,说明HFFT在国产超级计算机的大规模并行环境下拥有良好的可扩展性.在使用8192个处理器核的情况下,HFFT加速比达到了277倍.我们同样对FFTW软件包进行了测试.本文的分析为解决其他科学计算程序在国产百万亿次规模集群上的可扩展性问题,提供了一些先行的参考和建议.  相似文献   

5.
LUNF--基于节点失效特征的机群作业调度策略   总被引:1,自引:0,他引:1  
良好的可扩展性使得人们可通过扩大机群系统的规模来达到所需要的计算能力,但随着机群系统节点数目的增多,节点失效对机群系统性能的影响已经成为大规模机群系统使用过程中一个不可回避的问题.机群作业调度作为机群操作系统软件的重要组成部分,完成高效资源管理与合理作业调度,机群作业调度系统功能上可分为作业选取策略和节点分配策略两部分.结合机群系统节点失效的特征,提出了正常运行时间最长节点优先(longest uptime node first,LUNF)的节点分配策略.仿真结果表明,相对于节点随机分配策略,LUNF策略的作业平均响应时间与作业平均slowdown降低10%左右.  相似文献   

6.
据报道,近日中国首台采用国产高性能通用处理器芯片“龙芯2F”和其他国产器件、设备和技术的万亿次高性能计算机“KD-50-I”在中国科技大学研制成功.这台体积仅0.89立方米大小的万亿次高性能计算机于日前通过了专家委员会鉴定,成为中国高性能计算机国产化的一次重要突破.据悉,“KD-50-I”万亿次计算机采用单一机柜,集成了330余颗“龙芯2F”处理器,理论峰值计算能力达到1万亿次.硬件系统采用了中国自主设计的龙芯2F处理器、华为自主研发的千兆以太网交换机,  相似文献   

7.
OpenMP是现代多核机群系统采用的主要并行编程模型之一,在单CPU多核上可以获得良好的加速性能,但在整个机群系统上使用时,需要解决可扩展性差的问题.首先设计了求解非平衡动力学方程的并行算法.基于分布共享的多核机群系统,采用显式数据分布OpenMP并行计算方法,将数据进行分布式划分,分配到每个OpenMP线程,通过数据共享实现数据交换.计算结果表明显式OpenMP并行程序在保持可读性的同时,具有良好的可扩展性,在4核Xeon处理器构成的分布共享机群系统上,非平衡动力学方程组的数值并行计算可以扩展到1024个CPU核,具有明显的并行加速计算效果.  相似文献   

8.
分子动力学数值模拟程序在现代高性能计算机上的计算效率往往很低,只能发挥系统峰值性能的几个百分点。本文对并行分子动力学程序PMD3D在联想深腾6800超级计算机上进行性能优化。通过性能分析,我们发现粒子相互作用力计算中相互关联的浮点运算严重影响了处理器的指令级并行效率,为此我们应用计算缓存的方法,将大量不规则的浮点计算进行缓存,达到一定规模后再进行向量化计算。这样使得单机性能在优化后提升4倍多,达到处理器峰值性能5.2GFlops的32.3%。最后,在深腾6800的64个节点的256个CPU上进行了并行性能测试,达到峰值运算性能1.3万亿次的27%。  相似文献   

9.
12月4日,联想集团在联想商用技术发展论坛上正式宣布,国内第一个实际性能突破每秒百万亿次的异构机群系统——联想深腾7000研制成功,其运算能力达到每秒106.5万亿次。  相似文献   

10.
为研究极端条件下金属材料的性质,在JASMIN 框架上研制了三维并行位错动力学程序PDD3D. 它集成了离散位错动力学模拟的物理方案和数值算法.通过设计实现高效的分布式数据结构、可扩展的快速多极子解法器以及基于影像区的拓扑操作通信方式,该程序具有较高的性能和较好的可扩展性.1024 个处理器上,对包含3 千万条位错线的物理模型的模拟结果显示,PDD3D 程序获得了81%的并行效率.  相似文献   

11.
QSNET/sup II/ optimizes interprocessor communication in systems built from standard server building blocks. Its short-message processing unit permits fast injection of small messages, providing ultra-low latency and scalability to thousands of nodes. Thus, in a sense, the high-performance network in a cluster computer is the computer because it largely defines achievable performance, widening the range of the applications a cluster can efficiently execute, as well as defining its scalability, fault tolerance, system software, and overall usability.  相似文献   

12.
并行集群系统的Linpack性能测试分析   总被引:2,自引:0,他引:2  
§1.引言 近些年随着计算机软硬件技术的提高,尤其是网络部件性能的提高,集群技术得到不断的发展。传统的PVP(Parallel Vector Processor)超级计算机以及MPP(Massively Parallel  相似文献   

13.
同步操作在保证多核处理器线程的数据一致性和正确性等方面起着重要作用。随着处理器内核数量的不断增加,同步操作的开销也越来越大。栅栏同步是并行应用中多核同步的重要方法之一。软件同步方法通常需要数千个周期才能完成多个内核之间的同步,这种高延迟和串行化同步会导致多核程序性能的显著下降。相比于软件栅栏同步方法,硬件栅栏能够实现较低的同步延迟,然而传统集中式硬件栅栏的可扩展性有限,难以适应众核处理器系统的同步需求。面向众核处理器提出了一种层次化硬件栅栏机制——HSync,它由本地栅栏单元和全局栅栏单元组成,二者协调配合,以实现低硬件开销的快速同步。实验结果表明,与传统的集中式硬件栅栏相比,层次化硬件栅栏机制将众核处理器系统性能提高了1.13倍,同时网络流量减少了74%。  相似文献   

14.
Multi-core architectures have emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, benchmarking of these processors is necessary to identify the possible performance issues. In this paper, broad range of homogeneous multi-core architectures are investigated in terms of essential performance metrics. To measure performance, we used micro-benchmarks from High-Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), LMbench, and an FFT benchmark. Performance analysis is conducted on multi-core systems from UltraSPARC and x86 architectures; including systems based on Conroe, Kentsfield, Clovertown, Santa Rosa, Barcelona, Niagara, and Victoria Falls processors. Also, the effect of multi-core architectures in cluster performance is examined using a Clovertown based cluster. Finally, cache coherence overhead is analyzed using a full-system simulator. Experimental analysis and observations in this study provide for a better understanding of the emerging homogeneous multi-core systems.  相似文献   

15.
In this paper we consider the scalability of parallel space‐filling curve generation as implemented through parallel sorting algorithms. Multiple sorting algorithms are studied and results show that space‐filling curves can be generated quickly in parallel on thousands of processors. In addition, performance models are presented that are consistent with measured performance and offer insight into performance on still larger numbers of processors. At large numbers of processors, the scalability of adaptive mesh refined codes depends on the individual components of the adaptive solver. One such component is the dynamic load balancer. In adaptive mesh refined codes, the mesh is constantly changing resulting in load imbalance among the processors requiring a load‐balancing phase. The load balancing may occur often, requiring the load balancer to perform quickly. One common method for dynamic load balancing is to use space‐filling curves. Space‐filling curves, in particular the Hilbert curve, generate good partitions quickly in serial. However, at tens and hundreds of thousands of processors serial generation of space‐filling curves will hinder scalability. In order to avoid this issue we have developed a method that generates space‐filling curves quickly in parallel by reducing the generation to integer sorting. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

16.
Parallel computing scalability evaluates the extent to which parallel programs and architectures can effectively utilize increasing numbers of processors. In this paper, we compare a group of existing scalability metrics and evaluation models with an experimental metric which uses network latency to measure and evaluate the scalability of parallel programs and architectures. To provide insight into dynamic system performance, we have developed an integrated software environment prototype for measuring and evaluating multiprocessor scalability performance, called Scale-Graph. Scale-Graph uses a graphical instrumentation monitor to collect, measure and analyze latency-related data, and to display scalability performance based on various program execution patterns. The graphical software tool is X-Windows based and is currently implemented on standard workstations to analyze performance data of the KSR-1, a hierarchical ring-based shared-memory architecture  相似文献   

17.
测试并分析了高性能预条件库HYPRE的多重网格解法器SMG和BoomerAMG在某国产大规模并行机数千个处理器上的可扩展性能,得到若干对线性解法器算法研究和并行实现技术发展具有启示性意义的结论.这些结论对实际复杂物理系统数值模拟中线性解法器的应用和发展具有一定的指导意义.  相似文献   

18.
Parallel Algorithm Design on Some Distributed Systems   总被引:3,自引:0,他引:3       下载免费PDF全文
Some testing results on DAWINING-1000,Paragon and workstation cluster are described in this paper.On the home-made parallel system DAWNING-1000 with 32 computational processors,the practical performance of 1.1777 Gflops and 1.58 Gflops has been measured in solving a dense linear system and doing matrix multiplication,respectively .The scalability is also investigated.The importance of designing efficient parallel algorithms for evaluating parallel systems is emphasized.  相似文献   

19.
We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI’s ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are explored: work sharing using a centralized work server and distributed work stealing using explicit polling to service steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional techniques to manage the volume of communication and mitigate runtime overhead. Using additional parameters, we observed an improvement of up to 3–4X in parallel performance. We report results for three distributed memory parallel computer systems and use UTS to characterize the performance and scalability on these systems. Overall, we find that the simpler work sharing approach with a single work server achieves good performance on hundreds of processors and that our distributed work stealing implementation scales to thousands of processors and delivers more robust performance that is less sensitive to the particular workload and load balancing parameters.  相似文献   

20.
In this paper, a source to source parallelizing compiler system, AutoPar, is presentd. The system transforms FORTRAN programs to multi-level hybrid MPI/OpenMP parallel programs. Integrated parallel optimizing technologies are utilized extensively to derive an effective program decomposition in the whole program scope. Other features such as synchronization optimization and communication optimization improve the performance scalability of the generated parallel programs, from both intra-node and inter-node. The system makes great effort to boost automation of parallelization. Profiling feedback is used in performance estimation which is the basis of automatic program decomposition. Performance results for eight benchmarks in NPB1.0 from NAS on an SMP cluster are given, and the speedup is desirable. It is noticeable that in the experiment, at most one data distribution directive and a reduction directive are inserted by the user in BT/SP/LU. The compiler is based on ORC, Open Research Compiler. ORC is a powerful compiler infrastructure, with such features as robustness, flexibility and efficiency. Strong analysis capability and well-defined infrastructure of ORC make the system implementation quite fast.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号