期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Lattice quantum hadrodynamics on a CRAY Y-MP

Joachim Frank Siegfried Knecht 《The Journal of supercomputing》1992,6(3-4):195-209

Quantum corrections to the mean-field equation of state for nuclear matter are estimated in a lattice simulation of quantum hadrodynamics on a CRAY Y-MP. In contrast with lattice quantum chromodynamics, where coordinate space methods are the standard, the calculations are carried out in momentum space and on nonhypercubic (irregular) lattices. The quantum corrections to the known, mean-field equation of state were found to be considerable. The time frame of the project and the large computational needs of the program required the use of powerful supercomputers, like the CRAY Y-MP, which are capable of performing at a very high computing speed by using both vector and parallel hardware, the latter being exploited by means of autotasking. The paper describes the applied analytical and the numerical methods as well as the changes needed for the program to be executed in parallel. After some code modifications a very efficient version could be obtained on a CRAY Y-MP8/832, leading to an overall performance of 2.13 gigaflops. 相似文献

2.

Solving Linear Recurrences with Loop Raking

《Journal of Parallel and Distributed Computing》1995,25(1):91-97

We present a variation of the partition method for solving linear recurrences that is well-suited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On four processors, we gain an additional speedup of at least 3.7. 相似文献

3.

Performance of parallel Cholesky factorization algorithms using BLAS

Glenn R. Luecke Jae Heon Yun Philip W. Smith 《The Journal of supercomputing》1992,6(3-4):315-329

This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990. 相似文献

4.

The arithmetic mean method for solving essentially positive systems on a vector computer ∗

《国际计算机数学杂志》2012,89(1-2):113-121

This paper is concerned with a generalization of the arithmetic mean method developed in [3] for solving large “essentially positive” dynamic systems which are asymptotically stable. This method is well suitable for parallel implementation on a multiprocessor system that can execute concurrently different tasks on a few vector processors with shared central memory, such as the CRAY X-MP. A high-level parallelism among independent tasks is obtained using the Cray multitasking. The consistency and the stability of the method are analysed. 相似文献

5.

Ultrahigh-performance FFTs for the CRAY-2 and CRAY Y-MP supercomputers

David A. Carlson 《The Journal of supercomputing》1992,6(2):107-116

In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%. 相似文献

6.

Implementation of a vector quantization codebook design technique based on a competitive learning artificial neural network

Stanley C. Ahalt Prakoon Chen Cheng-Taou Chou Tzyy-Ping Jung 《The Journal of supercomputing》1992,5(4):307-330

We describe an implementation of a vector quantization codebook design algorithm based on the frequencysensitive competitive learning artificial neural network. The implementation, designed for use on high-performance computers, employs both multitasking and vectorization techniques. A C version of the algorithm tested on a CRAY Y-MP8/864 is discussed. We show how the implementation can be used to perform vector quantization, and demonstrate its use in compressing digital video image data. Two images are used, with various size codebooks, to test the performance of the implementation. The results show that the supercomputer techniques employed have significantly decreased the total execution time without affecting vector quantization performance.This work was supported by a Cray University Research Award and by NASA Lewis research grant number NAG3-1164. 相似文献

7.

The use of intermediate memories for low-latency memory access in supercomputer scalar units

Gurindar S. Sohi Wei-Chung Hsu 《The Journal of supercomputing》1990,4(1):5-21

One of the prime considerations for high scalar performance in supercomputers is a low memory latency. With the increasing disparity between main memory and CPU clock speeds, the use of an intermediate memory in the hierarchy becomes necessary. In this paper, we present an intermediate memory structure called a programmable cache. A programmable cache exploits structural locality to decrease the average memory access time. We evaluate the concept of a programmable cache by using the vector registers in the CRAY X-MP and Y-MP supercomputers as a programmable cache. Our results indicate that a programmable cache can be used profitably to reduce the memory latency if the pattern of references to a data structure can be determined at compile time.The work of the first author was supported in part by NSF Grant CCR-8706722. 相似文献

8.

Short Data Parallel Vector Slant Transform

《Journal of Parallel and Distributed Computing》1994,23(1):27-36

In this paper, a comparative study of one- and two-dimensional Slant transform algorithms using vector and parallel computers is presented. A new factorization for the Slant matrix is discussed. General methods of vectorizing short data vectors are included. These methods are suitable for applications where long data may not be available to take advantage of a vector computer. Microtasking with four processors is implemented to improve the speed performance of two-dimensional transforms. Simulation results on the Cray Y-MP and Cray X-MP using single processors and multiprocessors are also included. 相似文献

9.

An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

Jayasimha D. N. Hayder M. E. Pillay S. K. 《The Journal of supercomputing》1997,11(1):41-60

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time-accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared-memory multiprocessor (the CRAY Y-MP), and distributed-memory multiprocessors with different topologies (the IBM SP and the CRAY T3D). We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message-passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms. This revised version was published online in June 2006 with corrections to the Cover Date. 相似文献

10.

Optimizing INTBIS on the CRAY Y-MP

Chenyi Hu Joe Sheldon R. Baker Kearfott Qing Yang 《Reliable Computing》1995,1(3):265-274

INTBIS is a well-tested software package which uses an interval Newton/generalized bisection method to find all numerical solutions to nonlinear systems of equations. Since INTBIS uses interval computations, its results are guaranteed to contain all solutions. To efficiently solve very large nonlinear systems on a parallel vector computer, it is necessary to effectively utilize the architectural features of the machine In this paper, we report our implementations of INTBIS for large nonlinear systems on the Cray Y-MP supercomputer. We first present the direct implementation of INTBIS on a Cray. Then, we report our work on optimizing INTBIS on the Cray Y-MP 相似文献

11.

Relationship between average and real memory behavior

Kay A. Robbins Steven Robbins 《The Journal of supercomputing》1994,8(3):209-232

The CRAY Y-MP has a nonintrusive hardware performance monitor that accurately accumulates certain data about program performance. This paper examines the relationship between the averages obtained from the hardware performance monitor and actual memory behavior of the Perfect Club Benchmarks run on a single processor of an eight-processor CRAY Y-MP. I/O and instruction buffer fetches are not considered. The vectorized programs show regular behavior characterized by dominant vector lengths and interburst times. The distribution of vector lengths is not well-predicted by hardware performance monitor averages. Scalar programs also exhibit some clumping of memory references but have less temporal regularity than the vectorized programs. While overall port utilization is surprisingly low, there is considerable cyclic variation, and all of the ports tend to experience their maximal loading at the same time. A simple probabilistic model is developed to allow estimation of port utilitzation from hardware performance monitor data. The results can be used as a guide for generating more realistic synthetic memory workloads and port utilization estimates for shared-memory machines. 相似文献

12.

Vector performance estimation for CRAY X-MP/Y-MP supercomputers

Allen R. Hainline Steven R. Thompson Lawrence L. Halcomb 《The Journal of supercomputing》1992,6(1):49-70

Optimization of vector-intensive applications for the CRAY X-MP/Y-MP often requires arranging the operations to take full advantage of such architectural features as the memory system, independent memory ports, chaining, and independent functional units. Estimation of performance is not straightforward since many operations can occur concurrently. As a tool for making trades between vector algorithms, a method has been developed and used successfully at E-Systems Inc. to predict the execution time of a sequence of vector operations without resorting to actual code development. This method reduced our software development time, produced significantly more efficient code, and provided for a systematic approach to optimization. The performance estimation is generally accurate to within 10% and accounts for memory conflicts that result from fixed stride references. 相似文献

13.

Use of Supercomputers in Quantitative Electron Diffraction

C. Birkeland R. Holmestad K. Marthinsen R. Høier 《Journal of scientific computing》1998,13(1):1-18

The method of Quantitative Convergent Beam Electron Diffraction (QCBED) is used to study bonding effects in crystals. Because the accurate determination of electron charge densities requires extremely time consuming computations, the use of supercomputers is often necessary. In this article, we describe how the QCBED algorithm was modified to run on a parallel computer, an Intel Paragon. The implementation is based on the master-slave process and is explained in detail. For comparison, the program is also run on a state-of-the-art vector computer, a CRAY Y-MP. The performances of the two supercomputers are compared and results from a workstation are given as a reference. The parallel implementation is found successful. It demonstrates that parallelization of serial algorithms can be efficient when increasing accuracy and better performance are required. 相似文献

14.

Processor preallocation and load balancing of DOALL loops

Gary W. Elsesser Viet N. Ngo Sourav Bhattacharya Wei -Tek Tsai 《The Journal of supercomputing》1994,8(2):135-161

Load balance is important because it may affect the speedup attained through the concurrent execution of loop iterations on a parallel processor. We study loop load balance in the context of the well-known Perfect benchmarks. Several static and dynamic characteristics of the Perfect benchmark DOALL loops are observed and interpreted. Thelate arrival of processors is noted as a major source of load imbalance. This observation suggested the idea ofprocessor preallocation. An analytic cost model is presented and the advantages of processor preallocation are demonstrated by experimental evaluation on a CRAY Y-MP8 under the Unicos operating system. 相似文献

15.

PARALLEL INCREMENTAL ALGORITHMS FOR ANALYZING ACTIVITY NETWORKS

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(2):153-165

This paper presents parallel incremental algorithms for analyzing activity networks. The start-over algorithm used for this problem is a modified version of an algorithm due to Chaudhuri and Ghosh (BIT 26 (1986), 418-429). The computational model used is a shared memory single-instruction stream, multiple-data stream computer that allows both read and write conflicts. It is shown that the incremental algorithms for the event and activity insertion problems both require only O(loglogn) parallel time, in contrast to O(logn log logn) parallel time for the corresponding start-over algorithm. 相似文献

16.

A comparison of autotasking and macrotasking in a quantum chemical application program on a cray Y-MP computer

《Computers & chemistry》1993,17(4):379-381

The results of some experiments in parallel processing using either autotasking or macrotasking on a dedicated CRAY Y-MP/4 computer under the UNICOS (version 7.0.2) operating system are reported. Two sets of calculations were carried out using the quantum chemical application program ccMBPT (concurrent computation Many-Body Perturbation Theory) for the study of electron correlation effects in atoms and molecules. In the first set, multiprocessing was achieved by exploiting the autotasking features of the compiler whilst, in the second set, an explicitly macrotasked code was employed. The performance of the system was mentioned as a function of the size of the problem, i.e. the number of virtual orbitals. It is demonstrated that the explicitly macrotasked code leads to uniformly better performance than the autotasked code. 相似文献

17.

Implementation of a Portable Nested Data-Parallel Language

《Journal of Parallel and Distributed Computing》1994,21(1):4-14

相似文献

18.

Adapting a Navier-Stokes solver for three parallel machines

R. A. Fatoohi 《The Journal of supercomputing》1994,8(2):91-115

This paper presents the results of parallelizing a three-dimensional Navier-Stokes solver on a 32K-processor Thinking Machines CM-2, a 128-node Intel iPSC/860, and an 8-processor CRAY Y-MP. The main objective of this work is to study the performance of the flow solver, INS3D-LU code, on two distributed-memory machines, a massively parallel SIMD machine (CM-2) and a moderately parallel MIMD machine (iPSC/860), and compare it with its performance on a shared-memory MIMD machine with a small number of processors (Y-MP). The code is based on a Lower-Upper Symmetric-Gauss-Seidel implicit scheme for the pseudocompressibility formulation of the three-dimensional incompressible Navier-Stokes equations. The code was rewritten in CMFORTRAN with shift operations and run on the CM-2 using the slicewise model. The code was also rewritten with distributed data and Intel message-passing calls and run on the iPSC/860. The timing results for two grid sizes are presented and analyzed using both 32-bit and 64-bit arithmetic. Also, the impact of communication and load balancing on the performance of the code is outlined. The results show that reasonable performance can be achieved on these parallel machines. However, the CRAY Y-MP outperforms the CM-2 and iPSC/860 for this particular algorithm.The author is an employee of Computer Sciences Corporation. This work was funded through NASA Contract NAS 2-12961. 相似文献

19.

Radiative Heat Transfer Simulation on a SPARCStation Farm

Ronald G. Minnich Daniel V. Pryor 《Concurrency and Computation》1993,5(4):345-357

At the Supcrcomputing Research Center we have built a computing farm consisting of 16 SPARCStation. ELCs. The ELCs all support the Mother distributed shared memory, which has primitives to support efficient synchronization and use of the network and processors. Mother docs not support the traditional consistency semantics provided by, for example, Ivy or Mach external pagers. The first parallel application we ran on the farm was a Monte Carlo radiative heat transfer simulation. The performance we achieved on the farm was within an order of magnitude of the performance we would expect to achieve on a 16-processor model of the C90 supercomputer available from Cray Research. With this application we found that the use of the Mother distributed shared memory allowed us to run the same code on the Cray as we ran on the SPARCStatlons, and we did not require the complex cache-coherent memory semantics provided by, say, Ivy or Mach to run this application effectively. 相似文献

20.

HASPRNG: Hardware Accelerated Scalable Parallel Random Number Generators

JunKyu Lee Yu Bi Gregory D. Peterson Robert J. Hinde Robert J. Harrison 《Computer Physics Communications》2009,180(12):2574-2581

The Scalable Parallel Random Number Generators library (SPRNG) supports fast and scalable random number generation with good statistical properties for parallel computational science applications. In order to accelerate SPRNG in high performance reconfigurable computing systems, we present the Hardware Accelerated SPRNG library (HASPRNG). Ported to the Xilinx University Program (XUP) and Cray XD1 reconfigurable computing platforms, HASPRNG includes the reconfigurable logic for Field Programmable Gate Arrays (FPGAs) along with a programming interface which performs integer random number generation that produces identical results with SPRNG. This paper describes the reconfigurable logic of HASPRNG exploiting the mathematical properties and data parallelism residing in the SPRNG algorithms to produce high performance and also describes how to use the programming interface to minimize the communication overhead between FPGAs and microprocessors. The programming interface allows a user to be able to use HASPRNG the same way as SPRNG 2.0 on platforms such as the Cray XD1. We also describe how to install HASPRNG and use it. For HASPRNG usage we discuss a FPGA π-estimator for a High Performance Reconfigurable Computer (HPRC) sample application and compare to a software π-estimator. HASPRNG shows 1.7x speedup over SPRNG on the Cray XD1 and is able to obtain substantial speedup for a HPRC application.

Program summary

Program title: HASPRNGCatalogue identifier: AEER_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEER_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 594 928No. of bytes in distributed program, including test data, etc.: 6 509 724Distribution format: tar.gzProgramming language: VHDL (XUP and Cray XD1), C++ (XUP), C (Cray XD1)Computer: PowerPC 405 (XUP) / AMD 2.2 GHz Opteron processor (Cray XD1)Operating system: LinuxFile size: 15 MB (XUP) / 22 MB (Cray XD1)Classification: 4.13Nature of problem: Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations such as π-estimation are able to consume limitless random numbers forthe computation as long as hardware resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The library presented here accelerates the generators of independent streams of random numbers.Solution method: Multiple copies of random number generators in FPGAs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. HASPRNG is a random number generators library to allow a computational science application to employ the multiple copies of random number generators to boost performance. Users can interface HASPRNG with software code executing on microprocessors and/or with hardware applications executing on FPGAs. 相似文献