期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Takashi Yanagawa Kenji Suehiro 《Parallel Computing》2004,30(12):1315-1327

The Earth Simulator (ES) is a large scale, distributed memory, parallel computer system consisting of 640 processor nodes (PN) with shared memory vector multi-processors (64GFLOPS/PN, 5120 APs in total, AP: arithmetic processor). All the nodes are connected via a high speed (16GB/s) single-stage crossbar network called the Interconnection Network (IN).

The operating system for the Earth Simulator is based on SUPER-UX, the UNIX operating system for the SX series scientific supercomputers. In order to realize high-performance parallel processing on the highly parallel machine, the operating system is enhanced for scalability.

The Earth Simulator system is managed as a two-level cluster system called the Super Cluster System. In the Super Cluster System, the Earth Simulator system is divided into 40 clusters (16 PNs/cluster). A single controller called Super Cluster Control Station (SCCS) manages all these clusters. This management system provides Single System Image (SSI) operation, management and job control for the large scale multi-node system.

The Job Scheduler (JS) and NQS running on the SCCS control all jobs of the system. They schedule the resources such as processing nodes and files which have not usually been treated as scheduling resources. This allows efficient scheduling of large scale jobs.

The MPI library (MPI/ES) and the HPF compiler (HPF/ES) are available for distributed parallel programming on the Earth Simulator. MPI/ES conforms to the MPI 2.0 standard and is optimized to exploit the hardware features. HPF/ES conforms to the core part of HPF 2.0 and supports some features of the HPF 2.0 approved extensions and HPF/JA 1.0 extensions. HPF/ES suitably handles the 3-level parallelism of the Earth Simulator system, that is, vectorization, shared-memory parallelization, and distributed-memory parallelization. Moreover, HPF/ES extends the language to easily handle irregular problems. 相似文献

2.

Scalability of hybrid programming for a CFD code on the Earth Simulator

K. Itakura A. Uno M. Yokokawa T. Ishihara Y. Kaneda 《Parallel Computing》2004,30(12):1329-1343

The Earth Simulator (ES) is an SMP cluster system. There are two types of parallel programming models available on the ES. One is a flat programming model, in which a parallel program is implemented by MPI interfaces only, both within an SMP node and among nodes. The other is a hybrid programming model, in which a parallel program is written by using thread programming within an SMP node and MPI programming among nodes simultaneously. It is generally known that it is difficult to obtain the same high level of performance using the hybrid programming model as can be achieved with the flat programming model.

In this paper, we have evaluated scalability of the code for direct numerical simulation of the Navier–Stokes equations on the ES. The hybrid programming model achieves the sustained performance of 346.9 Gflop/s, while the flat programming model achieves 296.4 Gflop/s with 16 PNs of the ES for a DNS problem size of 256³. For small scale problems, however, the hybrid programming model is not as efficient because of microtasking overhead. It is shown that there is an advantage for the hybrid programming model on the ES for the larger size problems. 相似文献

3.

Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

《Future Generation Computer Systems》2014

As supercomputers scale to 1000 PFlop/s over the next decade, investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices for high-performance computing (HPC) hardware/software co-design is crucial. This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance investigation toolkit, such as its scalability to 2²⁷ simulated Message Passing Interface (MPI) ranks on 960 real processor cores, the capability to evaluate the performance of different MPI collective communication algorithms, and the ability to evaluate the performance of a basic Monte Carlo application with different architectural parameters. 相似文献

4.

The Earth Simulator: roles and impacts

Tetsuya Sato 《Parallel Computing》2004,30(12):1279-1286

The Earth Simulator Research Project started in March 2002 with the primary objective of producing reliable prediction data for global climate change. Within a couple of months after the start of operation, the Earth Simulator achieved an amazing performance of 35.86 Teraflops (about 90% of the peak performance of 40.96 Teraflops) in the Linpack benchmark test and, more surprisingly, 26.58 Teraflops for a typical application program of global atmospheric circulation model (called AFES) with a horizontal resolution of 10 km. These facts ensure us that the real contribution of the Earth Simulator be far greater than originally expected. Undoubtedly, the Earth Simulator would work to make a paradigm shift in science, industry, and human thinking, as well as finding the best human’s wisdom to keep a sustainable symbiotic relationship with nature. 相似文献

5.

Extending Unix for scalable computing

DeBenedictis E.P. Johnson S.C. 《Computer》1993,26(11):43-53

Because it retrieves all instructions and data from a single memory, the von Neumann computer architecture has a fundamental speed limit. The scalable multicomputer architecture, which uses many microprocessors together to solve a single problem and can run at teraflop speeds, may be a solution. While teraflop processor technology is known, the scalable operating and I/O system technology necessary for those speeds are not known. The authors describe how Unix can be extended to scalable computing, permitting teraflop speeds and offering parallel computing to users unfamiliar with parallel programming. They designed this technology into the system software of the Ncube-2, the predecessor to Ncube's announced teraflop parallel computer. The authors describe the system in detail and provide some performance results 相似文献

6.

Packet synchronization for synchronous optical deflection-routedinterconnection networks

Feehrer J.R. Ramfelt L.H. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(6):605-611

Deflection routing resolves output port contention in packet switched multiprocessor interconnection networks by granting the preferred port to the highest priority packet and directing contending packets out other ports. When combined with optical links and switches, deflection routing yields simple bufferless nodes, high bit rates, scalable throughput, and low latency. We discuss the problem of packet synchronization in synchronous optical deflection networks with nodes distributed across boards, racks, and cabinets. Synchronous operation is feasible due to very predictable optical propagation delays. A routing control processor at each node examines arriving packets and assigns them to output ports. Packets arriving on different input ports must be bit wise aligned; there are no elastic buffers to correct for mismatched arrivals. “Time of flight” packet synchronization is done by balancing link delays during network design. Using a directed graph network model, we formulate a constrained minimization problem for minimizing link delays subject to synchronization and packaging constraints. We demonstrate our method on a ShuffleNet graph, and show modifications to handle multiple packet sizes and latency critical paths 相似文献

7.

Parallel iterative solvers for finite-element methods using an OpenMP/MPI hybrid programming model on the Earth Simulator 总被引：1，自引：0，他引：1

Kengo Nakajima 《Parallel Computing》2005,31(10-12):1048

An efficient parallel iterative method for finite-element method has been developed for symmetric multiprocessor (SMP) cluster architectures with vector processors such as the Earth Simulator. The method is based on a three-level hybrid parallel programming model, including message passing for inter-SMP node communication, loop directives by OpenMP for intra-SMP node parallelization and vectorization for each processing element (PE). Simple 3D linear elastic problems with more than 2.2 × 10⁹ DOF have been solved using 3 × 3 block ICCG(0) method with additive Schwarz domain decomposition and PDJDS/CM-RCM reordering on 176 nodes of the Earth Simulator, achieving performance of 3.80 TFLOPS. Furthermore, effect of color number in reordering has been evaluated on various types of computers. 相似文献

8.

SUPRENUM: A trendsetter in modern supercomputer development

Wolfgang K. Giloi 《Parallel Computing》1988,7(3):283-296

The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance.

In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF.

In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type. 相似文献

9.

Analysis of cycle stealing with switching times and thresholds

Takayuki Mor Alan 《Performance Evaluation》2005,61(4):347-369

We consider two processors, each serving its own M/GI/1 queue, where one of the processors (the “donor”) can help the other processor (the “beneficiary”) with its jobs, during times when the donor processor is idle. That is the beneficiary processor “steals idle cycles” from the donor processor. There is a switching time required for the donor processor to start working on the beneficiary jobs, as well as a switching back time. We also allow for threshold constraints on both the beneficiary and donor sides, whereby the decision to help is based not only on idleness but also on satisfying threshold criteria in the number of jobs.

We analyze the mean response time for the donor and beneficiary processors. Our analysis is approximate, but can be made as accurate as desired, and is validated via simulation. Results of the analysis illuminate principles on the general benefits of cycle stealing and the design of cycle stealing policies. 相似文献

10.

An analysis of the Intel 80×86 security architecture andimplementations

Sibert O. Porras P.A. Lindell R. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(5):283-293

An in depth analysis of the 80×86 processor families identifies architectural properties that may have unexpected, and undesirable, results in secure computer systems. In addition, reported implementation errors in some processor versions render them undesirable for secure systems because of potential security and reliability problems. We discuss the imbalance in scrutiny for hardware protection mechanisms relative to software, and why this imbalance is increasingly difficult to justify as hardware complexity increases. We illustrate this difficulty with examples of architectural subtleties and reported implementation errors 相似文献

11.

基三分层互连网络和2-D Mesh的比较

乔保军石峰计卫星《计算机科学》2007,34(9):253-255

多核处理器（multi—core processor）成为高性能处理器体系结构的研究发展方向，核间的连接方式对多核处理器性能的发挥起着重要作用。从降低节点度、减少网络链路数和缩短网络直径的角度出发，提出了一种用于片上核间互连的新型分层互连网络——基三分层互连网络（THIN），该网络拓扑简单，节点度数低，网络链路数相对较少，并具有明显的层次性和对称性以及良好的扩展性。深入比较了THIN和2-D Mesh的静态度量和无阻塞延迟，比较结果表明：在网络规模较小时，THIN比2-D Mesh更宜于用来构建片上核间的通信网络。相似文献

12.

基于多互联网络的并行信号处理系统

下载免费PDF全文

王逸林蔡平梅继丹《计算机工程》2008,34(10):259-260

在并行处理系统中,处理节点之间的通信开销是制约处理机性能提高的主要瓶颈。该文提出一种以TMS320C641X数字信号处理器为核心的并行处理系统,设计了PCI总线、串口和包交换网络等多种并行互联网络,使得输入、输出、控制等多种数据流分离,在适合的网络上传输,可以提高传输效率,实现高性能DSP与高性能互联系统的结合。相似文献

13.

VLSI design for massively parallel signal processors

SY Kung Jurgen Annevelink 《Microprocessors and Microsystems》1983,7(10):461-468

相似文献

14.

Optimal software multicast in wormhole-routed multistage networks

Hong Xu Yadong Gui Ni L.M. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):597-607

Multistage interconnection networks are a popular class of interconnection architecture for constructing scalable parallel computers (SPCs). The focus of this paper is on the multistage network system which supports wormhole routed turnaround routing. Existing machines characterized by such a system model include the IBM SP-1 and SP-2, TMC CM-5, and Meiko CS-2. Efficient collective communication among processor nodes is critical to the performance of SPCs. A system-level multicast service, in which the same message is delivered from a source node to an arbitrary number of destination nodes, is fundamental in supporting collective communication primitives including the application-level broadcast, reduction, and barrier synchronization. This paper addresses how to efficiently implement multicast services in wormhole-routed multistage networks, in the absence of hardware multicast support, by exploiting the properties of the turnaround switching technology. An optimal multicast algorithm is proposed. The results of implementations on a 64-node SP-1 show that the proposed algorithm significantly outperforms the application-level broadcast primitives provided by currently existing collective communication libraries including the public domain MPI 相似文献

15.

A rapid-prototyping environment for digital-signal processors

Hartley R. Welles K. II Hartman M. Chatterjee A. Delano P. Molnar B. Rafferty C. 《Design & Test of Computers, IEEE》1991,8(2):11-25

相似文献

16.

A scalable high-performance graphics processor: GVIP

Tsuneo Ikedo 《The Visual computer》1995,11(3):121-133

The GVIP (geometric and TV image processor) graphics processor, which creates and synthesizes computer graphics and TV images and meets the requirements of multi-media systems, is described. The hardware modules that make up this graphics processor include: a 32-bit embedded RISC processor, a Phong and Gouraud shading processor, a texture mapping processor, a hidden surface removal processor, an HDTV video image processor, a BitBlt processor, an imageprocessing module, and an outline font fill generator. These hardware modules fabricated using 0.8 m CMOS standard cells have been placed in three integrated circuit chips. The total number of gates used for one set of chips is approximately 350000. 相似文献

17.

基于CAN总线的飞行模拟器座舱系统设计 总被引：1，自引：0，他引：1

王述运谷树山田杰荣林亚军《软件》2011,32(2):119-121,124

本文提出了一种基于CAN总线的飞行模拟器座舱系统的设计方案,分析了某型飞机飞行模拟器座舱系统的功能和总体结构,将模拟器座舱内的信号根据种类和分布位置,分成多个总线节点,并给出了节点的硬件和软件设计方法。实践证明,基于CAN总线的飞行模拟器座舱系统具有可靠性高、实用性强、扩展灵活、开发周期短及性价比高等特点。相似文献

18.

并行系统的以存储器为中心的互联机制MCIM 总被引：2，自引：1，他引：1

李三立戈弋武剑峰《计算机学报》1999,22(4):395-402

并行系统中计算结点之间的互联网络一直是并行体系结构的研究热点,３０年来曾研究过多种ＩＮ的结构及其特性,然而这些ＩＮ都是以逻辑电路为基础的。本文提出一种以多端口快速静态存储器为中心的并行系统互联机制,称之为ＭＣＩＭ,ＭＣＩＭ不同于共享共享存储器,它的容量较小,划分为多个消息传递的通信邮区,并通过每个端口的访问接口（ＰＡＩ）。连接８－１６个计算结点。常用的四端口存储器可组成３２－６４个计算结点的并行相似文献

19.

Optimal floating point multiplication processor for signal processing

HC Yung CR Allen 《Image and vision computing》1983,1(3):152-156

The design of a floating point matrix- vector multiplication processor array for VLSI, which has an optimal area-time complexity product, is presented. This processor array is capable of performing the function (where n = 1,…, N) and can be applied in many digital signal processing applications, by simply changing the matrix coefficients stored in that array. Each N-bit mantissa, M-bit exponent (N, M) processor element of the array comprises a mantissa multiplier/adder circuit and hardware to handle the floating point control. The multiplier/adder circuit is implemented by a new optimal algorithm, which is regular, recursive and fast. Secondly, the algorithm offers a highly local and regular interconnection network, which is a fundamental requirement in VLSI circuit design methodology. 相似文献

20.

A new algorithm based on Givens rotations for solving linearequations on fault-tolerant mesh-connected processors

Murthy K.N.B. Bhuvaneswari K. Ram Murthy C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):825-832

In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures 相似文献