共查询到20条相似文献,搜索用时 15 毫秒
1.
The serial and parallel performance of one of the world's fastest general purpose computers, the CRAY-2, is analyzed using the standard Los Alamos Benchmark Set plus codes adapted for parallel processing. For comparison, architectural and performance data are also given for the CRAY X-MP/416. Factors affecting performance, such as memory bandwidth, size and access speed of memory, and software exploitation of hardware, are examined. The parallel processing environments of both machines are evaluated, and speedup measurements for the parallel codes are given.An earlier version of this paper was presented at Supercomputing '88This work was performed under the auspices of the U.S. Department of Energy. 相似文献
2.
In this paper we discuss code optimization techniques for implementing the Level 2 and 3 basic linear algebra subprograms on a single processor for the CRAY Y-MP and the CRAY-2. Our performance measurements show that the use of these techniques leads to a significant improvement in performance, and most subroutines achieve close to the peak performance of the machine for computations of relatively small sizes. 相似文献
3.
Allen R. Hainline Steven R. Thompson Lawrence L. Halcomb 《The Journal of supercomputing》1992,6(1):49-70
Optimization of vector-intensive applications for the CRAY X-MP/Y-MP often requires arranging the operations to take full advantage of such architectural features as the memory system, independent memory ports, chaining, and independent functional units. Estimation of performance is not straightforward since many operations can occur concurrently. As a tool for making trades between vector algorithms, a method has been developed and used successfully at E-Systems Inc. to predict the execution time of a sequence of vector operations without resorting to actual code development. This method reduced our software development time, produced significantly more efficient code, and provided for a systematic approach to optimization. The performance estimation is generally accurate to within 10% and accounts for memory conflicts that result from fixed stride references. 相似文献
4.
David A. Carlson 《The Journal of supercomputing》1991,4(4):345-356
One of the many interesting architectural features of the CRAY-2 supercomputer is that each processor has access to 16K 64-bit words of local memory. This is in addition to the extremely large, 268-million-word common memory that is accessible by all four processors. By using local memory judiciously, it is possible to achieve increased performance on the CRAY-2. This is partly because accesses to local memory can be done simultaneously with accesses to common memory and other operations. Also, it is slightly faster to start up a vector access to local memory, and a processor does not have to compete with other processors when accessing its local memory. In this paper, we present an algorithm for computing the fast Fourier transform that takes advantage of the CRAY-2's local memory. It operates by solving subproblems, which are themselves Fourier transforms, entirely within local memory. By doing so it achieves a performance increase of between 25 and 40 percent over an equivalent algorithm that uses only common memory, and for some input sizes is able to outperform the CRAY-2 library FFT. 相似文献
5.
Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the CRAY X-MP and CRAY-2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to model application performance based on properties identified from traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.An earlier version of this paper was presented at Supercomputing '90.Supported in part by the National Science Foundation under Grants No. NSF MIP-88-07775 and No. NSF ASC-84-04556, and the NASA Ames Research Center Grant No. NCC-2-559.Supported in part by the National Science Foundation under grant NSF ASC-84-04556.Supported in part by the National Science Foundation under grants NSF CCR-86-57696, NSF CCR-87-06653 and NSF CDA-87-22836 and by the National Aeronautics and Space Administration under NASA Contract Number NAG-1-613. 相似文献
6.
The fast Fourier transform (FFT) is undoubtedly an essential primitive that has been applied in various fields of science and engineering. In this paper, we present a decomposition method for the parallelization of multi-dimensional FFTs with the smallest communication amounts for all ranges of the number of processes compared to previously proposed methods. This is achieved by two distinguishing features: adaptive decomposition and transpose order awareness. In the proposed method, the FFT data is decomposed based on a row-wise basis that maps the multi-dimensional data into one-dimensional data, and translates the corresponding coordinates from multi-dimensions into one dimension so that the one-dimensional data can be divided and allocated equally to the processes using a block distribution. As a result and different from previous works that have the dimensions of decomposition pre-defined, our method can adaptively decompose the FFT data on the lowest possible dimensions depending on the number of processes. In addition, this row-wise decomposition provides plenty of alternatives in data transpose, and different transpose order results in different amounts of communication. We identify the best transpose orders with the smallest communication amounts for the 3-D, 4-D, and 5-D FFTs by analyzing all possible cases. We also develop a general parallel software package for the most popular 3-D FFT based on our method using the 2-D domain decomposition. Numerical results show good performance and scaling properties of our implementation in comparison with other parallel packages. Given both communication efficiency and scalability, our method is promising in the development of highly efficient parallel packages for the FFT. 相似文献
7.
Steven R. Thompson Allen R. Hainline Lawrence L. Halcomb 《The Journal of supercomputing》1993,7(4):437-467
The conventional method of assessing supercomputer performance by measuring the execution time of software has many shortcomings. First, effort is required to write and debug the software. Second, time on the machine is required, and additional effort is needed to verify the validity of the test. Third, alterations to the algorithm require changing the code and retiming. Fourth, a black box approach to determining machine performance leaves the user with little confidence in how well the software was optimized. We present a pencil and paper methodology for computing the execution time of vectorized loops on a Cray Research X-MP/Y-MP. With this methodology a user can accurately compute the processing rate of an algorithm before the software is actually written. When several implementations of an algorithm are designed, this methodology can be used to select the best one for development, preventing wasted coding effort on less efficient implementations. Since this methodology computes optimal machine performance, it can be used to verify the efficiency of compiler translation. Changes to algorithms are easily appraised to determine their effect on performance. While the purpose of the methodology is to compute an algorithm's execution time, a side benefit is that this technique induces the user to think in terms of optimization. Bottlenecks in the code are pinpointed, and possible options for increased performance become obvious. At E-Systems, this methodology has become an integral part of the software development of vector-intensive code. This article is written specifically for Cray Research X-MP/Y-MP supercomputers, but many of the general concepts are applicable to other machines and therefore should benefit a number of supercomputer users. 相似文献
8.
C. Froese Fischer
N. S. Scott
J. Yoo
《Parallel Computing》1988,8(1-3):385-390At the first VAPP conference attention was drawn to the difficulty of calculating angular integrals on the CRAY-1. In this paper we describe how multitasking on the CRAY-2 and CRAY X-MP can be exploited to improve the efficiency of the calculation of angular integrals. Timings for the CRAY-2 and CRAY X-MP are presented. One surprising result is that for this application the CRAY X-MPis faster than the CRAY-2 in both unitasking and multitasking modes. 相似文献
9.
The CRAY-2 is considered to be one of the most powerful supercomputers. Its state-of-the-art technology features a faster clock and more memory than any other supercomputer available today. In this report the single processor performance of the CRAY-2 is compared with the older, more mature CRAY X-MP. Benchmark results are included for both the slow and the fast memory DRAM MOS CRAY-2. Our comparison is based on a kernel benchmark set aimed at evaluating the performance of these two machines on some standard tasks in scientific computing. Particular emphasis is placed on evaluating the impact of the availability of large real memory on the CRAY-2 versus fast secondary memory on the CRAY X-MP with SSD. Our benchmark includes large linear equation solvers and FFT routines, which test the capabilities of the different approaches to providing large memory. We find that in spite of its higher processor speed the CRAY-2 does not perform as well as the CRAY X-MP on the Fortran kernel benchmark. We also find that for large-scale applications, which have regular and predictable memory access patterns, a high-speed secondary memory device such as the SSD can provide performance equal to the large real memory of the CRAY-2.The author is an employee of SCA Division of Boeing Computer Services. 相似文献
10.
长信号卷积的快速运算及其语音处理的应用 总被引:1,自引:0,他引:1
通过对有限长信号卷积运算的快速算法分析,根据长序列信号的结构特点及卷积运算的数学特征,提出了一种长信号引速卷积及相关的算法实现,给出了相应的C 算法程序,结合算术傅立叶变换进行了改进。并把该算法运用到实际的语音处理中,得到了较好的快速和重建效果。 相似文献
11.
无速度传感器技术在过去的十年中有了长足进步,该文阐述了一种新颖的用在交流调速系统中的无速度传感器估算方法。使用这种方法不需要关心电机结构、电参数以及负载条件。除此之外,也不需要在系统中进行额外的调节。这种新的估算方法在瞬态谱估计的基础上使用快速傅立叶变换技术从而得到与转子速度相关的齿槽谐波频率。该文在MATLAB/SIMULINK平台的基础上,使用数字信号处理技术提取了由于感应电机气隙产生的齿槽谐波中所包含的与速度相关的信息。并且在不同的PWM逆变器供电频率下估计得到比较满意的感应电机转速。 相似文献
12.
We present a new parallel radix-4 FFT algorithm based on the BSP model. Our parallel algorithm uses the group-cyclic distribution family, which makes it simple to understand and easy to implement. We show how to reduce the communication cost of the algorithm by a factor of 3, in the case that the input/output vector is in the cyclic distribution. We also show how to reduce computation time on computers with a cache-based architecture. We present performance results on a Cray T3E with up to 64 processors, obtaining reasonable efficiency levels for local problem sizes as small as 256 and very good efficiency levels for local sizes larger than 2048. 相似文献
13.
Wayne Pfeiffer Arnold Alagar Anke Kamrath Robert H. Leary Jack Rogers 《The Journal of supercomputing》1990,4(2):131-152
Various scientific codes were benchmarked on three vector computers: the CRAY X-MP/48 and CRAY-2 supercomputers and the SCS-40/XM minisupercomputer. On the X-MP, two Fortran compilers were also compared. The benchmarks, which were initially all in Fortran, consisted of six research codes from Caltech, the 24 Livermore loops, and two cases from the LINPACK benchmark. As a corollary effort, the effect of manual optimization on the Caltech codes was also considered, including the selected use of assembly-language math routines.On each machine the ratio of the maximum to the minimum speeds for the various benchmarks was more than a factor of 50, even though the study was restricted to unitasked (i.e., single CPU) runs. The maximum speed for all-Fortran codes was more than 80% of the peak speed on the X-MP and SCS, but less than 40% of the peak speed on the CRAY-2.Despite having a clock that is 2.3 times faster, the CRAY-2 generally runs slower than the X-MP, typically by a factor of 1.3 for scalar code and even slower for moderately vectorized code. Only for highly vectorized codes does the CRAY-2 marginally outperform the X-MP, at least for in-core benchmarks. The poorer performance of the CRAY-2 is due to its slower scalar speed, its lack of chaining, its single port between each CPU and memory, and its relatively slow memory.The SCS runs slower than the X-MP by a factor of 2.6 in the scalar limit and by a factor of 4.7 (the clock ratio) in the vector limit when the same CFT compiler is used on both machines. Use of the newer CFT77 compiler on the X-MP negates the relative enhancement of the SCS scalar performance.On the X-MP, the CFT77 3.0 compiler produces significantly faster code than CFT 1.14, typically by a factor of 1.4. This is obtained, however, at the expense of compilation times that are three to five times longer. Regardless of the compiler, manual optimization is still worthwhile. For three of the six Caltech codes compiled with CFT77, run time speedups of 2, 4, and 16 were achieved due to Fortran optimization only. 相似文献
14.
A program for a direct solution of the Poisson equation in cylindrically symmetric geometry is described. It is based on the use of fast Fourier transforms for the axial solution, and an expansion in cubic B-splines for the radial solution. 相似文献
15.
In recent years several approaches have been proposed to overcome the multiple-minima problem associated with nonlinear optimization techniques used in the analysis of molecular conformations. One such technique based on a parallel Monte Carlo search algorithm is analyzed. Experiments on the Intel iPSC/2 confirm that the attainable parallelism is limited by the underlying acceptance rate in the Monte Carlo search. It is proposed that optimal performance can be achieved in combination with vector processing. Tests on both the IBM 3090 and Intel iPSC/2-VX indicate that vector performance is related to molecule size and vector pipeline latency. 相似文献
16.
The speed of the three supercomputers CRAY-1M, CRAY-X/MP, FUJITSU VP-200 is measured several times. There exist technical numbers like cycle time, start-up times, etc. and numbers for the speed of basic arithmetic operations depending on the vector length, for kernel programs and for a few special production programs. In this article some numbers and some experiences are given for a broader program set, real production programs from a heterogenous workload, typical for a university computer center environment with technical oriented research problems. The intention was to measure the relative speed to the CONTROL DATA CYBER 76, the main computer which the Regionales Rechenzentrum für Niedersachsen (RRZN) at the University of Hannover has operated for more than 10 years. For replacing this computer some investigations were necessary for benchmarking some new computers and supercomputers. Experiences are given with the migration of real programs to the supercomputers and a used benchmark is described. The measured speed factors are given for the three supercomputers compared with the CYBER 76. Very remarkable is the big range of the different speed factors. Some global thoughts about benchmarking, the interpretation of the results for the used benchmark and some special programs with their effects are discussed. 相似文献
17.
A. S. Ilinski 《Computers & Mathematics with Applications》2000,40(12):1363-1373
The singular integral equation method is used for solving the problem of diffraction by infinite nonhomogeneous cylinder. A numerical method is developed, and certain aspects of its practical application are considered. Some numerical results are described. 相似文献
18.
19.
There are two ways, other than the standard fast Fourier transform (FFT) algorithm, of computing Fourier transforms of real data, namely, (1)the real fast Fourier transform (RFFT) algorithm, and (2) the fast Hartley transform (FHT) algorithm. On a sequential computer, it has been shown that both the RFFT and the FHT algorithms are faster than the FFT algorithm. However, it is not obvious that the same is true on a parallel machine. The communication requirements of the RFFT and the FHT algorithms, which are critical to the cost of any parallel implementation, are different from those of the FFT algorithm. In this paper we present efficient implementations of the RFFT and the FHT algorithms on a hypercube machine. Experimental results are given for the implementation of the RFFT and the FHT algorithms on the NCUBE machine. 相似文献
20.
对地学数据处理时,把原始数据中的干扰去掉,保留数据中的真实信息,是迫切需要解决的问题。文章引入快速傅里叶变换对地学数据进行处理,以地球化学数据为例,通过谱密度图分析对数据进行降噪,以达到优化数据,提高信息提取准确度的目的。 相似文献