期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Hybrid hierarchy storage system in MilkyWay-2 supercomputer

Weixia XU Yutong LU Qiong LI Enqiang ZHOU Zhenlong SONG Yong DONG Wei ZHANG Dengping WEI Xiaoming ZHANG Haitao CHEN Jianying XING Yuan YUAN 《Frontiers of Computer Science》2014,8(3):367-377

With the rapid improvement of computation capability in high performance supercomputer system, the imbalance of performance between computation subsystem and storage subsystem has become more and more serious, especially when various big data are produced ranging from tens of gigabytes up to terabytes. To reduce this gap, large-scale storage systems need to be designed and implemented with high performance and scalability.MilkyWay-2 (TH-2) supercomputer system with peak performance 54.9 Pflops, definitely has this kind of requirement for storage system. This paper mainly introduces the storage system in MilkyWay-2 supercomputer, including the hardware architecture and the parallel file system. The storage system in MilkyWay-2 supercomputer exploits a novel hybrid hierarchy storage architecture to enable high scalability of I/O clients, I/O bandwidth and storage capacity. To fit this architecture, a user level virtualized file system, named H²FS, is designed and implemented which can cooperate local storage and shared storage together into a dynamic single namespace to optimize I/O performance in IO-intensive applications. The evaluation results show that the storage system in MilkyWay-2 supercomputer can satisfy the critical requirements in large scale supercomputer, such as performance and scalability. 相似文献

2.

SUPRENUM: A trendsetter in modern supercomputer development

Wolfgang K. Giloi 《Parallel Computing》1988,7(3):283-296

The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance.

In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF.

In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type. 相似文献

3.

A fine grained parallel smooth particle mesh Ewald algorithm for biophysical simulation studies: Application to the 6-D torus QCDOC supercomputer

Bin Fang Yuefan Deng 《Computer Physics Communications》2007,177(4):362-377

In order to model complex heterogeneous biophysical macrostructures with non-trivial charge distributions such as globular proteins in water, it is important to evaluate the long range forces present in these systems accurately and efficiently. The Smooth Particle Mesh Ewald summation technique (SPME) is commonly used to determine the long range part of electrostatic energy in large scale molecular simulations. While the SPME technique does not give rise to a performance bottleneck on a single processor, current implementations of SPME on massively parallel, supercomputers become problematic at large processor numbers, limiting the time and length scales that can be reached. Here, a synergistic investigation involving method improvement, parallel programming and novel architectures is employed to address this difficulty. A relatively simple modification of the SPME technique is described which gives rise to both improved accuracy and efficiency on both massively parallel and scalar computing platforms. Our fine grained parallel implementation of the modified SPME method for the novel QCDOC supercomputer with its 6D-torus architecture is then given. Numerical tests of algorithm performance on up to 1024 processors of the QCDOC machine at BNL are presented for two systems of interest, a β-hairpin solvated in explicit water, a system which consists of 1142 water molecules and a 20 residue protein for a total of 3579 atoms, and the HIV-1 protease solvated in explicit water, a system which consists of 9331 water molecules and a 198 residue protein for a total of 29508 atoms. 相似文献

4.

Special report: Supercomputing-the view from Japan

《Micro, IEEE》1993,13(1):67-70

Japan's Ministry of International Trade and Industry's (MITI's) Superspeed project, which investigated high-speed devices and computer architecture, algorithms, and languages for parallel computing, is reviewed. The supercomputing industries in Japan and the United States are compared. The architecture and performance of current supercomputers and the current states of supercomputer technology and supercomputer software are discussed 相似文献

5.

富士通VPP300/500向量并行巨型机系统

下载免费PDF全文

蒋江张民选《计算机工程与科学》1997,19(1):27-31

本文主要讨论了富士通公司推出的ＶＰＰ３００／５００向量并行ＭＰＰ超级计算机的系统结构、硬件特性和软件配置情况。同时，本文也介绍了ＶＰＰ３００／５００系统中所采用的一些关键性技术及实现方法，并对ＭＰＰ未来的发展趋势作出了预测相似文献

6.

The TH Express high performance interconnect networks 总被引：1，自引：0，他引：1

Zhengbin PANG Min XIE Jun ZHANG Yi ZHENG Guibin WANG Dezun DONG Guang SUO 《Frontiers of Computer Science》2014,8(3):357-366

Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessor communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface. 相似文献

7.

Kernel Polynomial Method on GPU

Shixun Zhang Shinichi Yamagiwa Masahiko Okumura Seiji Yunoki 《International journal of parallel programming》2013,41(1):59-88

The simulation of lattice model systems for quantum materials is one of the most important approaches to understand quantum properties of matter in condensed matter physics. The main task in the simulation is to diagonalize a Hamiltonian matrix for the system and evaluate the electronic density of energy states. Kernel polynomial method (KPM) is one of the promising simulation methods. Because KPM contains a fine-grain recursive part in the algorithm, it is hard to parallelize it under the thread level parallelism such as on a supercomputer or a cluster computer. This paper focuses on methods to parallelize KPM on a massively parallel environment of GPU, aiming to achieve high parallelism for more speedups than the recent CPUs. This paper proposes two implementation methods called the full map and the sliding window methods, and evaluates the performances in the recent GPU platform. To enlarge available simulation sizes and at the same time to enhance the performance, this paper also describes additional optimization techniques depending on the GPU architecture. 相似文献

8.

Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer

Bin Fang Glenn Martyna 《Computer Physics Communications》2007,176(8):531-538

QCDOC is a massively parallel supercomputer with tens of thousands of nodes distributed on a six-dimensional torus network. The 6D structure of the network provides the needed communication resources for many communication-intensive applications. In this paper, we present a parallel algorithm for three-dimensional Fast Fourier Transform and its implementation for a 4096-node QCDOC prototype. Two techniques have been used to increase its parallel performance: simultaneous multi-dimensional communication and communication-and-computation overlapping. Benchmarking experiments suggest that 3D FFTs of size 128×128×128 can scale well on such platforms up to 4096 nodes. Our performance results suggest stronger scalability on QCDOC than on IBM BlueGene/L supercomputer. 相似文献

9.

通用雷达并行信号处理系统中软件平台对并行性问题的研究和实现 总被引：3，自引：0，他引：3

周鸣昕汤俊彭应宁王秀坛《计算机应用研究》2002,19(11):47-50

所讨论的软件平台是在一个采用超级计算机结构的高速实时通用雷达信号处理系统样机上实现。该系统在充分考虑了现代雷达信号处理本身所具有的各种粒度并行性的特点的基础上 ,定义了一套专用并行高级语言。其主要特点包括 :定义矩阵作为系统基本数据结构 ;定义并行控制结构以实现一定的中粒度并行 ,数据流驱动机制以实现进程间粗粒度并行处理等。软件平台通过编译器和操作系统配合 ,解决了上述特殊的并行性问题 ,保证了用户能简单快速地完成编程工作。相似文献

10.

Understanding the role of licenses and evolution in open architecture software ecosystems

Walt Scacchi Thomas A. Alspaugh 《Journal of Systems and Software》2012,85(7):1479-1494

The role of software ecosystems in the development and evolution of open architecture systems whose components are subject to different licenses has received insufficient consideration. Such systems are composed of components potentially under two or more licenses, open source or proprietary or both, in an architecture in which evolution can occur by evolving existing components, replacing them, or refactoring. The software licenses of the components both facilitate and constrain the system's ecosystem and its evolution, and the licenses’ rights and obligations are crucial in producing an acceptable system. Consequently, software component licenses and the architectural composition of a system help to better define the software ecosystem niche in which a given system lies. Understanding and describing software ecosystem niches for open architecture systems is a key contribution of this work. An example open architecture software system that articulates different niches is employed to this end. We examine how the architecture and software component licenses of a composed system at design time, build time, and run time help determine the system's software ecosystem niche and provide insight and guidance for identifying and selecting potential evolutionary paths of system, architecture, and niches. 相似文献

11.

Using high performance Fortran for parallel programming

G. Sarma T. Zacharia D. Miles 《Computers & Mathematics with Applications》1998,35(12):41-57

A finite element code with a polycrystal plasticity model for simulating deformation processing of metals has been developed for parallel computers using High Performance Fortran (HPF). The conversion of the code from an original implementation on the Connection Machine systems using CM Fortran is described. The sections of the code requiring minimal inter-processor communication are easily parallelized, by changing only the syntax for specifying data layout. However, the solver routine based on the conjugate gradient method required additional modifications, which are discussed in detail. The performance of the code on a massively parallel distributed-memory Intel PARAGON supercomputer is evaluated through timing statistics. Published by Elsevier Science Ltd. 相似文献

12.

The concurrent element level processing for nonlinear dynamic analysis on a massively parallel computer

《Computing Systems in Engineering》1995,6(3):285-293

The goal of this paper is to explore parallel methodologies with the desired flexibility, generality and accuracy for nonlinear dynamic finite element analysis on massively parallel computer. This paper tests the generality of the concurrent element processing approach and proposes a basic software design strategy to fully take advantage of features available in massively parallel computers having a hierarchical ring architecture. As a testbed, a large scale general purpose code, DYNA3D as used and modified as appropriate to test proposed parallel design concepts on a KSRI parallel computer. 相似文献

13.

Mapping massive SIMD parallelism onto vector architectures for simulation

Jonathan B. Rosenberg Jonathan D. Becker 《Software》1989,19(8):739-756

A software behavioural simulator for a new massively parallel single-instruction/multiple data (SIMD) architecture has been developed that can accurately simulate the entire 16, 384 bit-serial processor array. The key to this high performance modelling is the exploitation of an inherent mapping that exists between massively parallel SIMD architectures and the vector architectures used in many high performance scientific super-computers. The new SIMD architecture, called BLITZEN, is based on the Massively Parallel Processor (MPP) built for NASA by Goodyear in the late 1970s. By simulating the full-scale machine with very high performance, the simulator allows development of algorithms and high-level software to proceed before realization of the hardware. This paper describes the SIMD - vector architecture mapping, the highly vectorized simulator in which it is used, and how the result was a simulator that achieved a level of performance three orders of magnitude faster than the conventional uniprocessor approach. 相似文献

14.

Scalable communication architectures for massively parallel hardware multi-processors

Yahya Jan Lech Jóźwiak 《Journal of Parallel and Distributed Computing》2012

Modern complex embedded applications in multiple application fields impose stringent and continuously increasing functional and parametric demands. To adequately serve these applications, massively parallel multi-processor systems on a single chip (MPSoCs) are required. This paper is devoted to the design of scalable communication architectures of massively parallel hardware multi-processors for highly-demanding applications. We demonstrated that in the massively parallel hardware multi-processors the communication network influence on both the throughput and circuit area dominates the processors influence, while the traditionally used flat communication architectures do not scale well with the increase of parallelism. Therefore, we propose to design highly optimized application-specific partitioned hierarchical organizations of the communication architectures through exploiting the regularity and hierarchy of the actual information flows of a given application. We developed related communication architecture synthesis strategies and incorporated them into our quality-driven model-based multi-processor design methodology and related automated architecture exploration framework. Using this framework we performed a large series of architecture synthesis experiments. Some of the results of the experiments are presented in this paper. They demonstrate many features of the synthesized communication architectures and show that our method and related framework are able to efficiently synthesize well scalable communication architectures even for the high-end massively parallel multi-processors that have to satisfy extremely stringent computation demands. 相似文献

15.

Scalable mpNoC for massively parallel systems – Design and implementation on FPGA

M. Baklouti Y. Aydi Ph. Marquet J.L. Dekeyser M. Abid 《Journal of Systems Architecture》2010,56(7):278-292

The high chip-level integration enables the implementation of large-scale parallel processing architectures with 64 and more processing nodes on a single chip or on an FPGA device. These parallel systems require a cost-effective yet high-performance interconnection scheme to provide the needed communications between processors. The massively parallel Network on Chip (mpNoC) was proposed to address the demand for parallel irregular communications for massively parallel processing System on Chip (mppSoC). Targeting FPGA-based design, an efficient mpNoC low level RTL implementation is proposed taking into account design constraints. The proposed network is designed as an FPGA based Intellectual Property (IP) able to be configured in different communication modes. It can communicate between processors and also perform parallel I/O data transfer which is clearly a key issue in an SIMD system. The mpNoC RTL implementation presents good performances in terms of area, throughput and power consumption which are important metrics targeting an on chip implementation. mpNoC is a flexible architecture that is suitable for use in FPGA-based parallel systems. This paper introduces the basic mppSoC architecture. It mainly focuses on the mpNoC flexible IP based design and its implementation on FPGA. The integration of mpNoC in mppSoC is also described. Implementation results on a Stratix II FPGA device are given for three data-parallel applications ran on mppSoC. The obtained good performances justify the effectiveness of the proposed parallel network. It is shown that the mpNoC is a lightweight parallel network making it suitable for both small as well as large FPGA-based parallel systems. 相似文献

16.

面向国产异构众核系统的Parallel C语言设计与实现

何王全刘勇方燕飞魏迪漆锋滨《软件学报》2017,28(4):764-785

异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用. 相似文献

17.

The IC* model of parallel computation and programming environment

Cameron E.J. Cohen D.M. Gopinath B. Keese W.M. II Ness L. Uppaluru P. Vollaro J.R. 《IEEE transactions on pattern analysis and machine intelligence》1988,14(3):317-326

The IC* project is an effort to create an environment for the design, specification, and development of complex systems such as communication protocols, parallel machines, and distributed systems. The basis of the project is the IC* model of parallel computation, in which a system is specified by a set of invariant expressions which describe its behavior in time. The features of this model include temporal and structural constraints, inherent parallelism, explicit modeling of time, nondeterministic evolution, and dynamic activation. The project also includes the construction of a parallel computer specifically designed to support the model of computation. The authors discuss the IC* model and the current user language, and describe the architecture and hardware of the prototype supercomputer built to execute IC* programs 相似文献

18.

“天河一号”大规模并行应用程序测试

朱小谦孟祥飞营晓东冯景华《计算机科学》2012,39(3):265-267

目前安装在国家超级计算天津中心的"天河一号"是我国首台千万亿次超级计算机,在2010年11月世界超级计算机Top500排名中,位列世界第一。"天河一号"采用了CPU与GPU相结合的异构融合计算体系结构,自主设计实现了高速互连通信系统,在多个高性能计算应用领域具有应用适应性强、系统运行稳定可靠、性能可扩展性好等特点,为科学研究和应用提供了重要的高性能计算平台。采用石油地震数据处理、飞行器流场模拟、生物分子动力学模拟、磁约束聚变数值模拟、湍流数值模拟、晶体硅分子动力学模拟、全球大气浅水波全隐式数值模拟、地球外核热流动数值模拟等应用的典型算例对"天河一号"进行了大规模并行程序测试,结果表明,"天河一号"在上述应用领域具有良好的可扩展性和并行效率。相似文献

19.

Performance modeling of hyper-scale custom machine for the principal steps in block Wiedemann algorithm

Tong Zhou Jingfei Jiang 《The Journal of supercomputing》2016,72(11):4181-4203

Solving large-scale sparse linear systems over GF(2) plays a key role in fluid mechanics, simulation and design of materials, petroleum seismic data processing, numerical weather prediction, computational electromagnetics, and numerical simulation of unclear explosions. Therefore, developing algorithms for this issue is a significant research topic. In this paper, we proposed a hyper-scale custom supercomputer architecture that matches specific data features to process the key procedure of block Wiedemann algorithm and its parallel algorithm on the custom machine. To increase the computation, communication, and storage performance, four optimization strategies are proposed. This paper builds a performance model to evaluate the execution performance and power consumption for our custom machine. The model shows that the optimization strategies result in a considerable speedup, even three times faster than the fastest supercomputer, TH2, while consuming less power. 相似文献

20.

How to obtain efficient GPU kernels: An illustration using FMM & FGT algorithms

Felipe A. Cruz Simon K. Layton L.A. Barba 《Computer Physics Communications》2011,(10):2084-2098

Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied algorithmic redesign for attaining performance on gpus. The progression of performance improvements attained illustrates the exercise of formulating algorithms for the massively parallel architecture of the gpu. The end result has been gpu kernels that run at over 500 Gop/s on one nvidiatesla C1060 card, thereby reaching close to practical peak. 相似文献