期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

高剑刚卢宏生何王全任秀江陈淑平斯添浩周舟胡舒凯于康魏迪《计算机学报》2021,44(1):222-234

本文描述了神威E级原型机的互连网络和消息机制.神威E级原型机是继神威蓝光、神威?太湖之光之后神威家族的第三代计算机.该计算机作为一台E级计算机的原型机,峰值性能3.13PFlops,其最大的特色之一就是采用28Gbps传输技术,设计开发了新一代的神威高阶路由器和神威高性能网络接口两款芯片,在传统胖树的基础上,设计了双轨... 相似文献

2.

Electrostatic force computation for bio-molecules on supercomputers with torus networks

《Parallel Computing》2007,33(2):116-123

We present an application of the Ewald algorithm for electrostatic force computation on a supercomputer with a torus network, like those on QCDOC and BlueGene/L. Typical bio-molecular systems have thousands, possibly millions of atoms interacting, with simulation time ranging from microseconds to milliseconds. The most dominant time consuming calculation for bio-molecules is the electrostatic interactions. The importance of an efficient all-gather method is discussed, in particular for QCDOC since it does not have a network specific for global communication like the tree network on BlueGene/L. In addition, we demonstrate the ability for QCDOC to run non QCD (Quantum Chromodynamics) applications, in particular, electrostatic force computation on bio-molecules. 相似文献

3.

A fine grained parallel smooth particle mesh Ewald algorithm for biophysical simulation studies: Application to the 6-D torus QCDOC supercomputer

Bin Fang Yuefan Deng 《Computer Physics Communications》2007,177(4):362-377

In order to model complex heterogeneous biophysical macrostructures with non-trivial charge distributions such as globular proteins in water, it is important to evaluate the long range forces present in these systems accurately and efficiently. The Smooth Particle Mesh Ewald summation technique (SPME) is commonly used to determine the long range part of electrostatic energy in large scale molecular simulations. While the SPME technique does not give rise to a performance bottleneck on a single processor, current implementations of SPME on massively parallel, supercomputers become problematic at large processor numbers, limiting the time and length scales that can be reached. Here, a synergistic investigation involving method improvement, parallel programming and novel architectures is employed to address this difficulty. A relatively simple modification of the SPME technique is described which gives rise to both improved accuracy and efficiency on both massively parallel and scalar computing platforms. Our fine grained parallel implementation of the modified SPME method for the novel QCDOC supercomputer with its 6D-torus architecture is then given. Numerical tests of algorithm performance on up to 1024 processors of the QCDOC machine at BNL are presented for two systems of interest, a β-hairpin solvated in explicit water, a system which consists of 1142 water molecules and a 20 residue protein for a total of 3579 atoms, and the HIV-1 protease solvated in explicit water, a system which consists of 9331 water molecules and a 198 residue protein for a total of 29508 atoms. 相似文献

4.

基于BLACS的2.5D并行矩阵乘法

廖霞李胜国卢宇彤杨灿群《计算机学报》2021,44(5):1037-1050

并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel... 相似文献

5.

Low‐latency Java communication devices on RDMA‐enabled networks

Roberto R. Expsito Guillermo L. Taboada Sabela Ramos Juan Tourio Ramn Doallo 《Concurrency and Computation》2015,27(17):4852-4879

Providing high‐performance inter‐node communication is a key capability for running high performance computing applications efficiently on parallel architectures. In fact, current systems deployments are aggregating a significant number of cores interconnected via advanced networking hardware with Remote Direct Memory Access (RDMA) mechanisms, that enable zero‐copy and kernel‐bypass features. The use of Java for parallel programming is becoming more promising thanks to some useful characteristics of this language, particularly its built‐in multithreading support, portability, easy‐to‐learn properties, and high productivity, along with the continuous increase in the performance of the Java virtual machine. However, current parallel Java applications generally suffer from inefficient communication middleware, mainly based on protocols with high communication overhead that do not take full advantage of RDMA‐enabled networks. This paper presents efficient low‐level Java communication devices that overcome these constraints by fully exploiting the underlying RDMA hardware, providing low‐latency and high‐bandwidth communications for parallel Java applications. The performance evaluation conducted on representative RDMA networks and parallel systems has shown significant point‐to‐point performance increases compared with previous Java communication middleware, allowing to obtain up to 40% improvement in application‐level performance on 4096 cores of a Cray XE6 supercomputer. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

6.

Genetic algorithm based task reordering to improve the performance of batch scheduled massively parallel scientific applications

Ramanan Sankaran Jordan Angel W. Michael Brown 《Concurrency and Computation》2015,27(17):4763-4783

The growth in size of networked high performance computers along with novel accelerator‐based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub‐optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter‐task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm‐based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. Application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, thereby enabling the applications to achieve better time to solution and scalability on Titan during production. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

7.

The Servet 3.0 benchmark suite: Characterization of network performance degradation

Jorge González-Domínguez María J. Martín Guillermo L. Taboada Roberto R. Expósito Juan Touriño 《Computers & Electrical Engineering》2013

Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation. 相似文献

8.

Performance modeling of hyper-scale custom machine for the principal steps in block Wiedemann algorithm

Tong Zhou Jingfei Jiang 《The Journal of supercomputing》2016,72(11):4181-4203

Solving large-scale sparse linear systems over GF(2) plays a key role in fluid mechanics, simulation and design of materials, petroleum seismic data processing, numerical weather prediction, computational electromagnetics, and numerical simulation of unclear explosions. Therefore, developing algorithms for this issue is a significant research topic. In this paper, we proposed a hyper-scale custom supercomputer architecture that matches specific data features to process the key procedure of block Wiedemann algorithm and its parallel algorithm on the custom machine. To increase the computation, communication, and storage performance, four optimization strategies are proposed. This paper builds a performance model to evaluate the execution performance and power consumption for our custom machine. The model shows that the optimization strategies result in a considerable speedup, even three times faster than the fastest supercomputer, TH2, while consuming less power. 相似文献

9.

基于概率模型的E-2D Mesh网络容错性分析 总被引：1，自引：0，他引：1

肖杰梁家荣黄亿海《小型微型计算机系统》2009,30(11)

研究了太比特路由器核心交换网络拓扑的一种新结构-E-2D Mesh.提出一种计算E-2D Mesh网络连通率的新方法.证明了当网络结点失效率控制在0.66%以下时,具有四万多个结点的E-2D Mesh网络可保持不低于99%的连通率,且在同等规模条件下,E-2D Mesh网络结点容错率至少是Mesh网络的11.09倍.研究结果表明,该方法在计算E-2D Mesh网络连通率时显示出较强的生命力且能够用于研究其它层次的网络和其它网络通信问题. 相似文献

10.

Analyzing time-dimension communication characterizations for representative scientific applications on supercomputer systems

Juan CHEN Wenhao ZHOU Yong DONG Zhiyuan WANG Chen CUI Feihao WU Enqiang ZHOU Yuhua TANG 《Frontiers of Computer Science》2019,13(6):1228

Exascale computing is one of the major challenges of this decade, and several studies have shown that communications are becoming one of the bottlenecks for scaling parallel applications. The analysis on the characteristics of communications can effectively aid to improve the performance of scientific applications. In this paper, we focus on the statistical regularity in time-dimension communication characteristics for representative scientific applications on supercomputer systems, and then prove that the distribution of communication-event intervals has a power-law decay, which is common in scientific interests and human activities. We verify the distribution of communication-event intervals has really a power-lawdecay on the Tianhe-2 supercomputer, and also on the other six parallel systems with three different network topologies and two routing policies. In order to do a quantitative study on the power-law distribution, we exploit two groups of statistics: bursty vs. memory and periodicity vs. dispersion. Our results indicate that the communication events show a “strong-bursty and weak-memory” characteristic and the communication event intervals show the periodicity and the dispersion. Finally, our research provides an insight into the relationship between communication optimizations and time-dimension communication characteristics. 相似文献

11.

Real-time parallel computation of disparity and optical flow using phase difference

F. Valentinotti G. Di Caro B. Crespi 《Machine Vision and Applications》1996,9(3):87-96

A phase-difference-based algorithm for disparity and optical flow estimation is implemented on a TI-C40-based parallel DSP system. The module performs real-time computation of disparity maps on images of size 128 × 128 pixels and computation of optical flows on images of size 64 × 64 pixels. This paper describes the algorithm and its parallel implementation. Processing times required for the computation of disparity maps and velocity fields and measures of the algorithm's performance are reported in detail. 相似文献

12.

A parallel dynamic load-balancing algorithm for solution-adaptive finite element meshes on 2D tori

Yeh-Ching Chung Yaa-Jyun Yeh J.-S Liu 《Concurrency and Computation》1995,7(7):615-631

To efficiently execute a finite element program on a 2D torus, we need to map nodes of the corresponding finite element graph to processors of a 2D torus such that each processor has approximately the same amount of computational load and the communication among processors is minimized. If nodes of a finite element graph do not increase during the execution of a program, the mapping only needs to be performed once. However, if a finite element graph is solution-adaptive, that is, nodes of a finite element graph increase discretely due to the refinement of some finite elements during the execution of a program, a dynamic load-balancing algorithm has to be performed many times in order to balance the computational load of processors while keeping the communication cost as low as possible. In the paper we propose a parallel dynamic load-balancing algorithm (LB) to deal with the load-imbalancing problem of a solution-adaptive finite element program on a 2D torus. The algorithm uses an iterative approach to achieve load-balancing. We have implemented the proposed algorithm along with two parallel mapping algorithms, parallel orthogonal recursive bisection (ORB) and parallel recursive mincut bipartitioning (MC), on a simulated 2D torus. Three criteria, the execution time of load-balancing algorithms, the computation time of an application program under different load balancing algorithms, and the total execution time of an application program (under several refinement phases) are used for performance evaluation. Simulation results show that (1) the execution of LB is faster than those of MC and ORB; (2) the mappings of LB are better than those of ORB and MC; and (3) the speedups of LB are better than those of ORB and MC. 相似文献

13.

Parallelizing and optimizing large‐scale 3D multi‐phase flow simulations on the Tianhe‐2 supercomputer

Dali Li Chuanfu Xu Yongxian Wang Zhifang Song Min Xiong Xiang Gao Xiaogang Deng 《Concurrency and Computation》2016,28(5):1678-1692

The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi‐phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe‐2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD‐friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single‐thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load‐balance scheme to distribute workloads among intra‐node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU‐only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU‐MIC collaborative LBM simulation for 3D multi‐phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

14.

超级计算机网络引导技术研究与分析

龚道永宋长明刘沙漆锋滨《计算机应用》2019,39(6):1577-1582

针对超级计算机系统中网络引导时间开销大的问题，提出网络引导分布算法是影响网络引导性能的主要因素之一，是优化网络引导性能的主要方向的观点。首先，分析了影响大规模网络引导性能的主要因素；其次，结合一种典型超级计算机系统，分析了超节点循环分布算法（SCDA）和插件循环分布算法（BCDA）的网络引导数据流拓扑结构；最后，量化分析了这两种算法对各个网络路径段的压力和可获得的网络性能，发现BCDA性能是SCDA性能的1~20倍。通过理论分析和模型推导发现，在计算节点和引导服务器之间使用更细粒度的映射算法可以在引导部分资源时使用尽量多的引导服务器，减少对局部网络资源的过早竞争，提升网络引导性能。相似文献

15.

Quantum network coding for multi-unicast problem based on 2D and 3D cluster states

Jing LI Xiubo CHEN Xingming SUN Zongpeng LI Yixian YANG 《中国科学:信息科学(英文版)》2016,59(4):5-19

We mainly consider quantum multi-unicast problem over directed acyclic network, where each source wishes to transmit an independent message to its target via bottleneck channel. Taking the advantage of global entanglement state 2D and 3D cluster states, these problems can be solved efficiently. At first, a universal scheme for the generation of resource states among distant communication nodes is provided. The corresponding between cluster and bigraph leads to a constant temporal resource cost. Furthermore, a new approach based on stabilizer formalism to analyze the solvability of several underlying quantum multi-unicast networks is presented. It is found that the solvability closely depends on the choice of stabilizer generators for a given cluster state. And then, with the designed measurement basis and parallel measurement on intermediate nodes, we propose optimal protocols for these multi-unicast sessions. Also, the analysis reveals that the resource consumption involving spatial resources, operational resources and temporal resources mostly reach the lower bounds. 相似文献

16.

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Wang Xian Aoki Takayuki 《Parallel Computing》2011,37(9):521-535

GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384 × 384 × 384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3-4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.1-1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re = 13,000 was carried on successfully using the mesh system 2000 × 1000 × 1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0 h were used for processing 100,000 time steps. Under this condition, the computational time (2.79 h) and the data communication time (3.06 h) are almost the same. 相似文献

17.

网络并行超级计算系统THNPSC—1 总被引：2，自引：0，他引：2

李三立都志辉马群生王小鸽《计算机学报》2001,24(6):627-632

网络并行计算（也称集群式计算）是实现高性能计算的重要方式,该文介绍了一个清华大学研制的网络并行超级计算系统THNPSC－1,它是由Pentium Ⅲ SMP计算结点组成;网络互联采用两种高速网：一种是自制的具有动态仲裁与路由寻经的交叉开关网络THNet,另一种是100Mpbs的Ethernet.THNet中的交叉开关THSwitch是用15万门的ALTERA FPGA芯片构成,THNet还包括具有DMA引擎的网络适配器THNIA.THNet每一端口可以提供数据传输率为1．056Gbps,其聚合频宽可达8．4Gbps;采用固定用户缓冲和扩展的主动消息传递等法,THNet执行用户层的消息传递,旁路操作系统的系统调用,做到零拷贝的消息传递,乒乓测试结果表明：单向消息传递延迟可减少到8μs。THNetl软件包括THNIA驱动程序和支持用户层通信的函数库。此文对相关工作进行了简要对比,并说明了该系统的应用情况。相似文献

18.

SUPRENUM: A trendsetter in modern supercomputer development

Wolfgang K. Giloi 《Parallel Computing》1988,7(3):283-296

The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance.

In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF.

In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type. 相似文献

19.

TTPM – An efficient deadlock-free algorithm for multicast communication in 2D torus networks

M.G. A.A. M.A. K. 《Journal of Systems Architecture》2008,54(10):919-928

A torus network has become increasingly important to multicomputer design because of its many features including scalability, low bandwidth and fixed degree of nodes. A multicast communication is a significant operation in multicomputer systems and can be used to support several other collective communication operations. This paper presents an efficient algorithm, TTPM, to find a deadlock-free multicast wormhole routing in two-dimensional torus parallel machines. The introduced algorithm is designed such that messages can be sent to any number of destinations within two start-up communication phases; hence the name Torus Two Phase Multicast (TTPM) algorithm. An efficient routing function is developed and used as a basis for the introduced algorithm. Also, TTPM allows some intermediate nodes that are not in the destination set to perform multicast functions. This feature allows flexibility in multicast path selection and therefore improves the performance. Performance results of a simulation study on torus networks are discussed to compare TTPM algorithm with a previous algorithm. 相似文献

20.

Hardware system of the Earth Simulator

Shinichi Habata Kazuhiko Umezawa Mitsuo Yokokawa Shigemune Kitawaki 《Parallel Computing》2004,30(12):1287-1313

The Earth Simulator (ES), developed under the Japanese government’s initiative “Earth Simulator project”, is a highly parallel vector supercomputer system. In this paper, an overview of ES, its architectural features, hardware technology and the result of performance evaluation are described.

In May 2002, the ES was acknowledged to be the most powerful computer in the world: 35.86 teraflop/s for the LINPACK HPC benchmark and 26.58 teraflop/s for an atmospheric general circulation code (AFES). Such a remarkable performance may be attributed to the following three architectural features; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network.

The ES consists of 640 processor nodes (PN) and an interconnection network (IN), which are housed in 320 PN cabinets and 65 IN cabinets. The ES is installed in a specially designed building, 65 m long, 50 m wide and 17 m high. In order to accomplish this advanced system, many kinds of hardware technologies have been developed, such as a high-density and high-frequency LSI, a high-frequency signal transmission, a high-density packaging, and a high-efficiency cooling and power supply system with low noise so as to reduce whole volume of the ES and total power consumption.

For highly parallel processing, a special synchronization means connecting all nodes, Global Barrier Counter (GBC), has been introduced. 相似文献