首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
At the first VAPP conference attention was drawn to the difficulty of calculating angular integrals on the CRAY-1. In this paper we describe how multitasking on the CRAY-2 and CRAY X-MP can be exploited to improve the efficiency of the calculation of angular integrals. Timings for the CRAY-2 and CRAY X-MP are presented. One surprising result is that for this application the CRAY X-MPis faster than the CRAY-2 in both unitasking and multitasking modes.  相似文献   

2.
The availability of a multiprocessor vector machine, such as the CRAY X-MP, along with large, fast secondary memory such as the CRAY SSD, opens new frontiers to numerical algorithm design for 3-D simulations. The 3-D seismic migration, which is of crucial importance in exploration seismology, will be studied as a model problem. The numerical model discussed in this paper employs an alternating direction implicit (ADI) Crank—Nicolson scheme which takes full advantage of the parallel architecture of the underlying machine. It is demonstrated that careful algorithm design can lead to a significant speedup of the calculation when more than one processor is used. The throughput times obtained in this study are an order of magnitude faster than some conventional approaches.  相似文献   

3.
Various scientific codes were benchmarked on three vector computers: the CRAY X-MP/48 and CRAY-2 supercomputers and the SCS-40/XM minisupercomputer. On the X-MP, two Fortran compilers were also compared. The benchmarks, which were initially all in Fortran, consisted of six research codes from Caltech, the 24 Livermore loops, and two cases from the LINPACK benchmark. As a corollary effort, the effect of manual optimization on the Caltech codes was also considered, including the selected use of assembly-language math routines.On each machine the ratio of the maximum to the minimum speeds for the various benchmarks was more than a factor of 50, even though the study was restricted to unitasked (i.e., single CPU) runs. The maximum speed for all-Fortran codes was more than 80% of the peak speed on the X-MP and SCS, but less than 40% of the peak speed on the CRAY-2.Despite having a clock that is 2.3 times faster, the CRAY-2 generally runs slower than the X-MP, typically by a factor of 1.3 for scalar code and even slower for moderately vectorized code. Only for highly vectorized codes does the CRAY-2 marginally outperform the X-MP, at least for in-core benchmarks. The poorer performance of the CRAY-2 is due to its slower scalar speed, its lack of chaining, its single port between each CPU and memory, and its relatively slow memory.The SCS runs slower than the X-MP by a factor of 2.6 in the scalar limit and by a factor of 4.7 (the clock ratio) in the vector limit when the same CFT compiler is used on both machines. Use of the newer CFT77 compiler on the X-MP negates the relative enhancement of the SCS scalar performance.On the X-MP, the CFT77 3.0 compiler produces significantly faster code than CFT 1.14, typically by a factor of 1.4. This is obtained, however, at the expense of compilation times that are three to five times longer. Regardless of the compiler, manual optimization is still worthwhile. For three of the six Caltech codes compiled with CFT77, run time speedups of 2, 4, and 16 were achieved due to Fortran optimization only.  相似文献   

4.
The serial and parallel performance of one of the world's fastest general purpose computers, the CRAY-2, is analyzed using the standard Los Alamos Benchmark Set plus codes adapted for parallel processing. For comparison, architectural and performance data are also given for the CRAY X-MP/416. Factors affecting performance, such as memory bandwidth, size and access speed of memory, and software exploitation of hardware, are examined. The parallel processing environments of both machines are evaluated, and speedup measurements for the parallel codes are given.An earlier version of this paper was presented at Supercomputing '88This work was performed under the auspices of the U.S. Department of Energy.  相似文献   

5.
FFTs in external or hierarchical memory   总被引:2,自引:0,他引:2  
Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 m -point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.A condensed version of this paper previously appeared in the Proceedings of Supercomputing '89.  相似文献   

6.
In this paper we discuss code optimization techniques for implementing the Level 2 and 3 basic linear algebra subprograms on a single processor for the CRAY Y-MP and the CRAY-2. Our performance measurements show that the use of these techniques leads to a significant improvement in performance, and most subroutines achieve close to the peak performance of the machine for computations of relatively small sizes.  相似文献   

7.
Recently, a number of advanced architecture machines have become commercially available. These new machines promise better cost performance than traditional computers, and some of them have the potential of competing with current supercomputers, such as the CRAY X-MP, in terms of maximum performance. This paper describes the methodology and results of a pilot study of the performance of a broad range of advanced architecture computers using a number of complete scientific application programs. The computers evaluated include:
  • 1 shared-memory bus architecture machines such as the Alliant FX/8, the Encore Multimax, and the Sequent Balance and Symmetry
  • 2 shared-memory network-connected machines such as the Butterfly
  • 3 distributed-memory machines such as the NCUBE, Intel and Jet Propulsion Laboratory (JPL)/Caltech hypercubes
  • 4 very long instruction word machines such as the Cydrome Cydra-5
  • 5 SIMD machines such as the Connection Machine
  • 6 ‘traditional’ supercomputers such as the CRAY X-MP, CRAY-2 and SCS-40.
Seven application codes from a number of scientific disciplines have been used in the study, although not all the codes were run on every machine. The methodology and guidelines for establishing a standard set of benchmark programs for advanced architecture computers are discussed. The CRAYs offer the best performance on the benchmark suite; the shared memory multiprocessor machines generally permitted some parallelism, and when coupled with substantial floating point capabilities (as in the Alliant FX/8 and Sequent Symmetry), provided an order of magnitude less speed than the CRAYs. Likewise, the early generation hypercubes studied here generally ran slower than the CRAYs, but permitted substantial parallelism from each of the application codes.  相似文献   

8.
One of the many interesting architectural features of the CRAY-2 supercomputer is that each processor has access to 16K 64-bit words of local memory. This is in addition to the extremely large, 268-million-word common memory that is accessible by all four processors. By using local memory judiciously, it is possible to achieve increased performance on the CRAY-2. This is partly because accesses to local memory can be done simultaneously with accesses to common memory and other operations. Also, it is slightly faster to start up a vector access to local memory, and a processor does not have to compete with other processors when accessing its local memory. In this paper, we present an algorithm for computing the fast Fourier transform that takes advantage of the CRAY-2's local memory. It operates by solving subproblems, which are themselves Fourier transforms, entirely within local memory. By doing so it achieves a performance increase of between 25 and 40 percent over an equivalent algorithm that uses only common memory, and for some input sizes is able to outperform the CRAY-2 library FFT.  相似文献   

9.
In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%.  相似文献   

10.
Strassen's algorithm for fast matrix-matrix multiplication has been implemented for matrices of arbitrary shapes on the CRAY-2 and CRAY Y-MP supercomputers. Several techniques have been used to reduce the scratch space requirement for this algorithm while simultaneously preserving a high level of performance. When the resulting Strassen-based matrix multiply routine is combined with some routines from the new LAPACK library, LU decomposition can be performed with rates significantly higher than those achieved by conventional means. We succeeded in factoring a 2048 × 2048 matrix on the CRAY Y-MP at a rate equivalent to 325 MFLOPS.This work is supported through NASA Contract NAS 2-12961.  相似文献   

11.
One of the prime considerations for high scalar performance in supercomputers is a low memory latency. With the increasing disparity between main memory and CPU clock speeds, the use of an intermediate memory in the hierarchy becomes necessary. In this paper, we present an intermediate memory structure called a programmable cache. A programmable cache exploits structural locality to decrease the average memory access time. We evaluate the concept of a programmable cache by using the vector registers in the CRAY X-MP and Y-MP supercomputers as a programmable cache. Our results indicate that a programmable cache can be used profitably to reduce the memory latency if the pattern of references to a data structure can be determined at compile time.The work of the first author was supported in part by NSF Grant CCR-8706722.  相似文献   

12.
The position of the CRAY-1S within the computing facilities provided by the SERC will be presented and its relationship to the front end machine at DL and the SERC network discussed. The physical characteristics of a CRAY Computer are then described followed by a list of the design features that contribute to the high speed of these machines.The second area which will be discussed is the type of scientific work currently being carried out on the CRAY. The general principles involved in the selection of research work suitable for implementation on the CRAY will be presented together with a list of all the different areas of scientific calculations currently pursued. The performance of the CRAY relative to other computer systems is also given for several of the applications.The accessibility of the CRAY-1S will be described together with the facilities available for job input/output and dataset transfer. The available program packages will be listed and their range of applications described. The nature of the algorithms used for job scheduling will be presented together with some statistics on the characteristics of jobs run on the machine.  相似文献   

13.
As information processing applications take greater roles in our everyday life, database management systems (DBMSs) are growing in importance. DBMSs have traditionally exhibited poor cache performance and large memory footprints, therefore performing only at a fraction of their ideal execution and exhibiting low processor utilization. Previous research has studied the memory system of DBMSs on research-based simultaneous multithreading (SMT) processors. Recently, several differences have been noted between the real hyper-threaded architecture implemented by the Intel Pentium 4 and the earlier SMT research architectures. This paper characterizes the performance of a prototype open-source DBMS running TPC-equivalent benchmark queries on an Intel Pentium 4 Hyper-Threading processor. We use hardware counters provided by the Pentium 4 to evaluate the micro-architecture and study the memory system behavior of each query running on the DBMS. Our results show a performance improvement of up to 1.16 in TPC-C-equivalent and 1.26 in TPC-H-equivalent queries due to hyperthreading.  相似文献   

14.
基于磁盘数据库系统的瓶颈主要在磁盘I/O,通常采用缓冲池的设计,将读到的数据页先放入到内存缓冲池后再进行操作。因此,缓存池的大小直接决定了数据库的性能。通过研究基于闪存固态硬盘的特性,提出了一种基于闪存固态硬盘的辅助缓冲池设计。最后,通过修改开源数据库MySQL InnoDB存储引擎,并通过TPC-C实验对比分析了启用辅助缓冲池后数据库的性能可有100%-320%的提高。  相似文献   

15.
Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the CRAY X-MP and CRAY-2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to model application performance based on properties identified from traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.An earlier version of this paper was presented at Supercomputing '90.Supported in part by the National Science Foundation under Grants No. NSF MIP-88-07775 and No. NSF ASC-84-04556, and the NASA Ames Research Center Grant No. NCC-2-559.Supported in part by the National Science Foundation under grant NSF ASC-84-04556.Supported in part by the National Science Foundation under grants NSF CCR-86-57696, NSF CCR-87-06653 and NSF CDA-87-22836 and by the National Aeronautics and Space Administration under NASA Contract Number NAG-1-613.  相似文献   

16.
In this paper a new algorithm for the computation of the eigenvalues of real symmetric matrices is presented. The algorithm may be expressed in terms of a collection of communicating process and is suitable for implementation as a dedicated engine constructed from a network of transputers. However, it can also be efficiently implemented on a multiprocessor supercomputer such as the CRAY X-MP or on a set of interconnected SIMD machines.  相似文献   

17.
This paper presents an approach for parallel computation of structural optimization problems on the CRAY X-MP by using parallel sensitivity analysis calculation. In this approach, a main processor is chosen to perform all the optimization calculations except the constraint gradient evaluations. When a sensitivity analysis is needed the main processor decomposes it into several computation tasks, then assigns the computation tasks to the other available associate processors and manages the communication. Due to uncoupled characteristics of the constraint gradient calculations, the associate processors perform the computation tasks in parallel. The algorithm for the structural optimization process with parallel design sensitivity is presented along with some numerical test cases to demonstrate the efficiency of this approach.  相似文献   

18.
The traditional hard disk drive (HDD) is often a bottleneck in the overall performance of modern computer systems. With the development of solid state drives (SSD) based on flash memory, new possibilities are available to improve secondary storage performance. In this work, we propose a new hybrid SSD–HDD storage system and a selection of algorithms designed to assign pages across an HDD and an SSD to optimise I/O performance. The hybrid system combines the advantages of the SSD’s fast random seek speed with the sequential access speed and large storage capacity of the HDD to produce significantly improved performance in a variety of situations. We further improve performance by allowing concurrent access across the two types of storage devices. We show the drive assignment problem is NP-complete and accordingly propose effective heuristic solutions. Extensive experiments using both synthetic and real data sets show our system with a small SSD can outperform a striped dual HDD and remain competitive with a dual SSD.  相似文献   

19.
Transaction processing performance council benchmark C (TPC-C) is the de facto standard for evaluating the performance of high-end computers running on-line transaction processing applications. Differing from other standard benchmarks, the transaction processing performance council only defines specifications for the TPC-C benchmark, but does not provide any standard implementation for end-users. Due to the complexity of the TPC-C workload, it is a challenging task to obtain optimal performance for TPC-C evaluation on a large-scale high-end computer. In this paper, we designed and implemented a large-scale TPC-C evaluation system based on the latest TPC-C specification using solid-state drive (SSD) storage devices. By analyzing the characteristics of the TPC-C workload, we propose a series of system-level optimization methods to improve the TPC-C performance. First, we propose an approach based on SmallFile table space to organize the test data in a round-robin method on all of the disk array partitions; this can make full use of the underlying disk arrays. Second, we propose using a NOOP-based disk scheduling algorithm to reduce the utilization rate of processors and improve the average input/output service time. Third, to improve the system translation lookaside buffer hit rate and reduce the processor overhead, we take advantage of the huge page technique to manage a large amount of memory resources. Lastly, we propose a locality-aware interrupt mapping strategy based on the asymmetry characteristic of non-uniform memory access systems to improve the system performance. Using these optimization methods, we performed the TPC-C test on two large-scale high-end computers using SSD arrays. The experimental results show that our methods can effectively improve the TPC-C performance. For example, the performance of the TPC-C test on an Intel Westmere server reached 1.018 million transactions per minute.  相似文献   

20.
Optimization of vector-intensive applications for the CRAY X-MP/Y-MP often requires arranging the operations to take full advantage of such architectural features as the memory system, independent memory ports, chaining, and independent functional units. Estimation of performance is not straightforward since many operations can occur concurrently. As a tool for making trades between vector algorithms, a method has been developed and used successfully at E-Systems Inc. to predict the execution time of a sequence of vector operations without resorting to actual code development. This method reduced our software development time, produced significantly more efficient code, and provided for a systematic approach to optimization. The performance estimation is generally accurate to within 10% and accounts for memory conflicts that result from fixed stride references.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号