首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Parallel Computing》1999,25(10-11):1257-1280
We present physics results from large-scale simulations of Quantum Chromodynamics (QCD) on the space-time lattice carried out with the CP-PACS computer. The CP-PACS is a massively parallel system with a peak speed of 614 Gflops and 320 Gbyte of main memory developed at the Center for Computational Physics, University of Tsukuba.Since the start of full operation of CP-PACS in October 1996, precision calculation of the light hadron spectrum in the quenched approximation of QCD and a systematic attempt at a calculation without this approximation have been pursued. Physics motivations of these calculations, the computational difficulties,and advances brought in by the CP-PACS are discussed. Performance of the CP-PACS for lattice QCD computations are described in a companion paper by S. Aoki et al.  相似文献   

2.
Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.  相似文献   

3.
CP-PACS: A massively parallel processor at the University of Tsukuba   总被引:1,自引:0,他引:1  
Computational Physics by Parallel Array Computer System (CP-PACS) is a massively parallel processor developed and in full operation at the Center for Computational Physics at the University of Tsukuba. It is an MIMD machine with a distributed memory, equipped with 2048 processing units and 128 GB of main memory. The theoretical peak performance of CP-PACS is 614.4 Gflops. CP-PACS achieved 368.2 Gflops with the Linpack benchmark in 1996, which at that time was the fastest Gflops rating in the world.CP-PACS has two remarkable features. Pseudo Vector Processing feature (PVP-SW) on each node processor, which can perform high speed vector processing on a single chip superscalar microprocessor; and a 3-dimensional Hyper-Crossbar (3-D HXB) Interconnection network, which provides high speed and flexible communication among node processors.In this article, we present the overview of CP-PACS, the architectural topics, some details of hardware and support software, and several performance results.  相似文献   

4.
We developed MDGRAPE-2, a hardware accelerator that calculates forces at high speed in molecular dynamics (MD) simulations. MDGRAPE-2 is connected to a PC or a workstation as an extension board. The sustained performance of one MDGRAPE-2 board is 15 Gflops, roughly equivalent to the peak performance of the fastest supercomputer processing element. One board is able to calculate all forces between 10 000 particles in 0.28 s (i.e. 310000 time steps per day). If 16 boards are connected to one computer and operated in parallel, this calculation speed becomes ∼10 times faster. In addition to MD, MDGRAPE-2 can be applied to gravitational N-body simulations, the vortex method and smoothed particle hydrodynamics in computational fluid dynamics.  相似文献   

5.
The combination of a non-overlapping Schwarz preconditioner and the Hybrid Monte Carlo (HMC) algorithm is shown to yield an efficient simulation algorithm for two-flavor lattice QCD with Wilson quarks. Extensive tests are performed, on lattices of size up to 32×243, with lattice spacings a?0.08 fm and at bare current-quark masses as low as 21 MeV.  相似文献   

6.
We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC’s framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell’s synergistic processor elements. Speedups of 3.4 × for the 8 × 8 × 16 × 16 lattice and 5.7 × for the 16 × 16 × 16 × 16 lattice are obtained when comparing our implementation of the MILC application executed on a 3.2 GHz Cell processor to the standard MILC code executed on a quad-core 2.33 GHz Intel Xeon processor. We provide an empirical model to predict application performance for a given lattice size. We also show that performance of the compute-intensive part of the application on the Cell processor is limited by the bandwidth between main memory and the Cell’s synergistic processor elements, whereas performance of the application’s parallel execution framework is limited by the bandwidth between main memory and the Cell’s power processor element.  相似文献   

7.
Fast neural net simulation with a DSP processor array   总被引:1,自引:0,他引:1  
This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications.  相似文献   

8.
A parallel algorithm for the iterative solution of sparse linear systems is presented. This algorithm is shown to be efficient for arbitrarily sparse matrices. Analysis of this algorithm suggests that a network of Processing Elements [PE's] equal in number to the number R of non-zero matrix entries is particularly useful. If this collection of PE's is interconnected by a message-passing, or a synchronous, communication network which is fast enough, the iteration time grows as the logarithm of the number of PE's. A comparison with earlier work, which suggested that only √R PE's are useful for this task, is also presented. The performance of three proposed networks of PE's on this algorithm is analyzed. The networks investigated all have the topology of the Cube Connected Cycles [CCC] graph, and all employ the same silicon technology, and the same number of chips and wires, and hence all should cost the same. One, the Boolean Vector Machine [BVM], employs 220 bit-serial PE's implemented in 4096 VLSI chips; the other two networks use different 32-bit parallel microprocessors, and a 32-bit parallel CCC to interconnect 2048 2-chip processors. One of the microprocessors is assumed to deliver about 1 Mflop, while the other is assumed to deliver 32 Mflops per PE. The comparison indicates that the BVM network would have superior performance to both of these parallel networks.  相似文献   

9.
Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We analyze and compare three algorithms and obtain an implementation, BiMMeR, that uses communication primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, i.e. able to deal with various data layouts as well as arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86 %, with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 × 8800 matrix. We describe BiMMeR's design and implementation and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.  相似文献   

10.
格点量子色动力学(QCD)是从第一原理出发求解QCD的非微扰方法,通过在超立方格子上模拟胶子场和费米子场相互作用,其计算结果被认为是对强相互作用现象的可靠描述,格点计算对QCD理论研究意义重大.但是,格点QCD计算具有非常大的计算自由度导致计算效率难以提升,通常对格子体系采用区域分解的方法实现并行计算的可扩展性,但如何提升数据并行计算效率仍然是核心问题.本文以格点QCD典型软件Grid为例,研究格点QCD计算中的数据并行计算模式,围绕格点QCD中的复杂张量计算和提升大规模并行计算效率的问题,开展格点QCD方法中数据并行计算特征的理论分析,之后针对Grid软件的SIMD和OpenMP等具体数据并行计算方式进行性能测试分析,最后阐述数据并行计算模式对格点QCD计算应用的重要意义.  相似文献   

11.
The designer of a numerical supercomputer is confronted with fundamental design decisions stemming from some basic dichotomies in supercomputer technology and architecture. On the side of the hardware technology there exists the dichotomy between the use of very high-speed circuitry or very large-scale integrated circuitry. On the side of the architecture there exists the dichotomy between the SIMD vector machine and the MIMD multiprocessor architecture. In the latter case, the ‘nodes’ of the system may communicate through shared memory, or each node has only private memory, and communication takes place through the exchange of messages. All these design decisions have implications with respect to performance, cost-effectiveness, software complexity, and fault-tolerance.

In the paper the various dichotomies are discussed and a rationale is provided for the decision to realize the SUPRENUM supercomputer, a large ‘number cruncher’ with 5 Gflops peak performance, in the form of a massively parallel MIMD/SIMD multicomputer architecture. In its present incorporation, SUPRENUM is configurable to up to 256 nodes, where each node is a pipeline vector machine with 20 Mflops peak performance, IEEE double precision. The crucial issues of such an architecture, which we consider the trendsetter for future numerical supercomputer architecture in general, are on the hardware side the need for a bottleneck-free interconnection structure as well as the highest possible node performance obtained with the highest possible packaging density, in order to accommodate a node on a single circuit board. On the side of the system software the design goal is to obtain an adequately high degree of operational safety and data security with minimum software overhead. On the side of the user an appropriate program development environment must be provided. Last but not least, the system must exhibit a high degree of fault tolerance, if for nothing else but for the sake of obtaining a sufficiently high MTBF.

In the paper a detailed discussion of the hardware and software architecture of the SUPRENUM supercomputer, whose design is based upon the considerations discussed, is presented. A largely bottleneck-free interconnection structure is accomplished in a hierarchical manner: the machine consists of up to 16 ‘clusters’, and each cluster consists of 16 working ‘nodes’ plus some organisational nodes. The node is accommodated on a single circuit board; its architecture is based on the principle of data structure architecture explained in the paper. SUPRENUM is strictly a message-based system; consequently, the local node operating system has been designed to handle a secured message exchange with a considerable degree of hardware support and with the lowest possible software overhead. SUPRENUM is organized as a distributed system—a prerequisite for the high degree of fault tolerance required; therefore, there exists no centralized global operating system. The paper concludes with an outlook on the performance limits of a future supercomputer architecture of the SUPRENUM type.  相似文献   


12.
Computational fluid dynamic simulations are in general very compute intensive. Only by parallel simulations on modern supercomputers the computational demands of complex simulation tasks can be satisfied. Facing these computational demands GPUs offer high performance, as they provide the high floating point performance and memory to processor chip bandwidth. To successfully utilize GPU clusters for the daily business of a large community, usable software frameworks must be established on these clusters. The development of such software frameworks is only feasible with maintainable software designs that consider performance as a design objective right from the start. For this work we extend the software design concepts to achieve more efficient and highly scalable multi-GPU parallelization within our software framework waLBerla for multi-physics simulations centered around the lattice Boltzmann method. Our software designs now also support a pure-MPI and a hybrid parallelization approach capable of heterogeneous simulations using CPUs and GPUs in parallel. For the first time weak and strong scaling performance results obtained on the Tsubame 2.0 cluster for more than 1000 GPUs are presented using waLBerla. With the help of a new communication model the parallel efficiency of our implementation is investigated and analyzed in a detailed and structured performance analysis. The suitability of the waLBerla framework for production runs on large GPU clusters is demonstrated. As one possible application we show results of strong scaling experiments for flows through a porous medium.  相似文献   

13.
This paper describes the FPGA implementation of FastCrypto, which extends a general-purpose processor with a crypto coprocessor for encrypting/decrypting data. Moreover, it studies the trade-offs between FastCrypto performance and design parameters, including the number of stages per round, the number of parallel Advance Encryption Standard (AES) pipelines, and the size of the queues. Besides, it shows the effect of memory latency on the FastCrypto performance. FastCrypto is implemented with VHDL programming language on Xilinx Virtex V FPGA. A throughput of 222 Gb/s at 444 MHz can be achieved on four parallel AES pipelines. To reduce the power consumption, the frequency of four parallel AES pipelines is reduced to 100 MHz while the other components are running at 400 MHz. In this case, our results show a FastCrypto performance of 61.725 bits per clock cycle (b/cc) when 128-bit single-port L2 cache memory is used. However, increasing the memory bus width to 256-bit or using 128-bit dual-port memory, improves the performance to 112.5 b/cc (45 Gb/s at 400 MHz), which represents 88% of the ideal performance (128 b/cc).  相似文献   

14.
This paper presents two coupled software packages which receive widespread use in the field of numerical simulations of Quantum Chromo-Dynamics. These consist of the BAGEL library and the BAGEL fermion sparse-matrix library, BFM.The Bagel library can generate assembly code for a number of architectures and is configurable – supporting several precision and memory pattern options to allow architecture specific optimisation. It provides high performance on the QCDOC, BlueGene/L and BlueGene/P parallel computer architectures that are popular in the field of lattice QCD. The code includes a complete conjugate gradient implementation for the Wilson and domain wall fermion actions, making it easy to use for third party codes including the Jefferson Laboratory's CHROMA, UKQCD's UKhadron, and the Riken–Brookhaven–Columbia Collaboration's CPS packages.

Program summary

Program title: BagelCatalogue identifier: AEFE_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFE_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License V2No. of lines in distributed program, including test data, etc.: 109 576No. of bytes in distributed program, including test data, etc.: 892 841Distribution format: tar.gzProgramming language: C++, assemblerComputer: Massively parallel message passing. BlueGene/QCDOC/others.Operating system: POSIX, Linux and compatible.Has the code been vectorised or parallelized?: Yes. 16 384 processors used.Classification: 11.5External routines: QMP, QDP++Nature of problem: Quantum Chromo-Dynamics sparse matrix inversion for Wilson and domain wall fermion formulations.Solution method: Optimised Krylov linear solver.Unusual features: Domain specific compiler generates optimised assembly code.Running time: 1 h per matrix inversion; multi-year simulations.  相似文献   

15.
ESPRIT Project 5417 - BECAUSE - is concerned with the benchmarking of parallel computer architectures for computer intensive scientific and engineering applications. Fluid flow, semi-conductor modelling and electromagnetic field simulation software was profiled on a range of example applications and sizes. The profiling results were used to select the algorithms and functions that ought to be included in the Because Benchmark Set (BBS). This paper reviews the BBS results obtained on the ·control’ sequential architecture computer. The results show consistent performance with some interesting variations in achieved Mflops.  相似文献   

16.
Parallel Algorithm Design on Some Distributed Systems   总被引:3,自引:0,他引:3       下载免费PDF全文
Some testing results on DAWINING-1000,Paragon and workstation cluster are described in this paper.On the home-made parallel system DAWNING-1000 with 32 computational processors,the practical performance of 1.1777 Gflops and 1.58 Gflops has been measured in solving a dense linear system and doing matrix multiplication,respectively .The scalability is also investigated.The importance of designing efficient parallel algorithms for evaluating parallel systems is emphasized.  相似文献   

17.
Explicit methods for the solution of fluid flow problems are of considerable interest in supercomputing. These methods parallelize well. The treatment of the boundaries is of particular interest with respect to both the numeric behavior of the solution and the computational efficiency. We have solved the three-dimensional Euler equations for a twisted channel using second-order, centered difference operators, and a three-stage Runge-Kutta method for the integration. Three different fourth-order dissipation operators were studied for numeric stabilization: one positive definite, one positive semidefinite, and one indefinite. The operators only differ in the treatment of the boundary. For computational efficiency all dissipation operators were designed with a constant bandwidth in matrix representation, with the bandwidth determined by the operator in the interior. The positive definite dissipation operator results in a significant growth in entropy close to the channel walls. The other operators maintain constant entropy. Several different implementations of the semidefinite operator obtained through factoring of the operator were also studied. We show the difference both in convergence rate and robustness for the different dissipation operators, and the factorizations of the operator due to Eriksson. For the simulations in this study one of the factorizations of the semidefinite operator required 70%–90% of the number of iterations required by the positive definite operator. The indefinite operator was sensitive to perturbations in the inflow boundary conditions. The simulations were performed on a 8,192 processor Connection Machine system CM-2. Full processor utilization was achieved, and a performance of 135 Mflops/sec in single precision was obtained. A performance of 1.1 Gflops/sec for a fully configured system with 65,536 processors was demonstrated.  相似文献   

18.
The stream architecture is a novel microprocessor architecture with wide application potential. It is critical to study how to use the stream architecture to accelerate scientific computing programs. However, existing stream processors and stream programming languages are not designed for scientific computing. To address this issue, we design and implement a 64-bit stream processor, Fei Teng 64 (FT64), which has a peak performance of 16 Gflops. FT64 supports two kinds of communications, message passing and stream communications, based on which, an interconnection architecture is designed for a FT64-based high-performance computer. This high-performance computer contains multiple modules, with each module containing eight FT64s. We also design a novel stream programming language, Stream Fortran 95 (SF95), together with the compiler SF95Compiler, so as to facilitate the development of scientific applications. We test nine typical scientific application kernels on our FT64 platform to evaluate this design. The results demonstrate the effectiveness and efficiency of FT64 and its compiler for scientific computing.  相似文献   

19.
A toroidal lattice architecture (TLA) and a planar lattice architecture (PLA) are proposed as massively parallel neurocomputer architectures for large-scale simulations. The performance of these architectures is almost proportional to the number of node processors, and they adopt the most efficient two-dimensional processor connections for WSI implementation. They also give a solution to the connectivity problem, the performance degradation caused by the data transmission bottleneck, and the load balancing problem for efficient parallel processing in large-scale neural network simulations. The general neuron model is defined. Implementation of the TLA with transputers is described. A Hopfield neural network and a multilayer perceptron have been implemented and applied to the traveling salesman problem and to identity mapping, respectively. Proof that the performance increases almost in proportion to the number of node processors is given.  相似文献   

20.
Special-purpose computer for holography HORN-4 with recurrence algorithm   总被引:1,自引:0,他引:1  
We designed and built a special-purpose computer for holography, HORN-4 (HOlographic ReconstructioN) using PLD (Programmable Logic Device) technology. HORN computers have a pipeline architecture. We use HORN-4 as an attached processor to enhance the performance of a general-purpose computer when it is used to generate holograms using a “recurrence formulas” algorithm developed by our previous paper. In the HORN-4 system, we designed the pipeline by adopting our “recurrence formulas” algorithm which can calculate the phase on a hologram. As the result, we could integrate the pipeline composed of 21 units into one PLD chip. The units in the pipeline consists of one BPU (Basic Phase Unit) unit and twenty CU (Cascade Unit) units. These CU units can compute twenty light intensities on a hologram plane at one time. By mounting two of the PLD chips on a PCI (Peripheral Component Interconnect) universal board, HORN-4 can calculate holograms at high speed of about 42 Gflops equivalent. The cost of HORN-4 board is about 1700 US dollar. We could obtain 800×600 grids hologram from a 3D-image composed of 415 points in about 0.45 sec with the HORN-4 system.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号