首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The CRAY-2 is considered to be one of the most powerful supercomputers. Its state-of-the-art technology features a faster clock and more memory than any other supercomputer available today. In this report the single processor performance of the CRAY-2 is compared with the older, more mature CRAY X-MP. Benchmark results are included for both the slow and the fast memory DRAM MOS CRAY-2. Our comparison is based on a kernel benchmark set aimed at evaluating the performance of these two machines on some standard tasks in scientific computing. Particular emphasis is placed on evaluating the impact of the availability of large real memory on the CRAY-2 versus fast secondary memory on the CRAY X-MP with SSD. Our benchmark includes large linear equation solvers and FFT routines, which test the capabilities of the different approaches to providing large memory. We find that in spite of its higher processor speed the CRAY-2 does not perform as well as the CRAY X-MP on the Fortran kernel benchmark. We also find that for large-scale applications, which have regular and predictable memory access patterns, a high-speed secondary memory device such as the SSD can provide performance equal to the large real memory of the CRAY-2.The author is an employee of SCA Division of Boeing Computer Services.  相似文献   

2.
Recently, a number of advanced architecture machines have become commercially available. These new machines promise better cost performance than traditional computers, and some of them have the potential of competing with current supercomputers, such as the CRAY X-MP, in terms of maximum performance. This paper describes the methodology and results of a pilot study of the performance of a broad range of advanced architecture computers using a number of complete scientific application programs. The computers evaluated include:
  • 1 shared-memory bus architecture machines such as the Alliant FX/8, the Encore Multimax, and the Sequent Balance and Symmetry
  • 2 shared-memory network-connected machines such as the Butterfly
  • 3 distributed-memory machines such as the NCUBE, Intel and Jet Propulsion Laboratory (JPL)/Caltech hypercubes
  • 4 very long instruction word machines such as the Cydrome Cydra-5
  • 5 SIMD machines such as the Connection Machine
  • 6 ‘traditional’ supercomputers such as the CRAY X-MP, CRAY-2 and SCS-40.
Seven application codes from a number of scientific disciplines have been used in the study, although not all the codes were run on every machine. The methodology and guidelines for establishing a standard set of benchmark programs for advanced architecture computers are discussed. The CRAYs offer the best performance on the benchmark suite; the shared memory multiprocessor machines generally permitted some parallelism, and when coupled with substantial floating point capabilities (as in the Alliant FX/8 and Sequent Symmetry), provided an order of magnitude less speed than the CRAYs. Likewise, the early generation hypercubes studied here generally ran slower than the CRAYs, but permitted substantial parallelism from each of the application codes.  相似文献   

3.
CHAU-WEN TSENG 《Software》1997,27(7):763-796
Fortran D is a version of Fortran enhanced with data decomposition specifications. Case studies illustrate strengths and weaknesses of the prototype Fortran D compiler when compiling linear algebra codes and whole programs. Statement groups, execution conditions, inter-loop communication optimizations, multi-reductions, and array kills for replicated arrays are identified as new compilation issues. On the Intel iPSC/860, the output of the prototype Fortran D compiler approaches the performance of hand-optimized code for parallel computations, but needs improvement for linear algebra and pipelined codes. The Fortran D compiler outperforms the CM Fortran compiler (2.1 beta) by a factor of four or more on the TMC CM-5 when not using vector units. Its performance is comparable to the DEC and IBM HPF compilers on a Alpha cluster and SP-2. Better analysis, run-time support, and flexibility are required for the prototype compiler to be useful for a wider range of programs. © 1997 John Wiley & Sons, Ltd.  相似文献   

4.
At the first VAPP conference attention was drawn to the difficulty of calculating angular integrals on the CRAY-1. In this paper we describe how multitasking on the CRAY-2 and CRAY X-MP can be exploited to improve the efficiency of the calculation of angular integrals. Timings for the CRAY-2 and CRAY X-MP are presented. One surprising result is that for this application the CRAY X-MPis faster than the CRAY-2 in both unitasking and multitasking modes.  相似文献   

5.
《Computers & chemistry》1991,15(1):79-85
The AMBER 3.0 molecular mechanics and molecular dynamics programs have been ported to and vectorized on the NEC SX-2/400 supercomputer. A detailed discussion of the vector enhancement of the AMBER non-bonded pair list generation subroutine is presented. Automatic vectorization using the FORT77SX compiler yielded speed-up factors of 1.2 to 1.5 over unvectorized code. Recoding of key portions of the program, as described in this paper, yielded speed-up factors of 1.8-2.7. The perturbation molecular dynamics program, PERDYN, now runs up to 35 times faster on the SX-2/400 than the VAX optimized version of the same program runs on the VAX 8650.  相似文献   

6.
7.
The serial and parallel performance of one of the world's fastest general purpose computers, the CRAY-2, is analyzed using the standard Los Alamos Benchmark Set plus codes adapted for parallel processing. For comparison, architectural and performance data are also given for the CRAY X-MP/416. Factors affecting performance, such as memory bandwidth, size and access speed of memory, and software exploitation of hardware, are examined. The parallel processing environments of both machines are evaluated, and speedup measurements for the parallel codes are given.An earlier version of this paper was presented at Supercomputing '88This work was performed under the auspices of the U.S. Department of Energy.  相似文献   

8.
Because they are interpreted, Java executables run slower than their compiled counterparts. The native executable translation (NET) compiler's objective is to optimize the translation of Java byte-code to native machine code so that it runs nearly as fast as native code generated directly from a source. The article presents some preliminary results for several large application programs and standard benchmarks. It compares the NET-compiled code performance with Sun's Java VM, Microsoft's Java just-in-time compiler, and equivalent C and C++ programs directly compiled. The results show that the optimizing NET compiler is capable of achieving better performance than the two other byte-code execution methods, in some cases achieving speeds comparable to directly compiled native code  相似文献   

9.
A Vectorizing Compiler for Multimedia Extensions   总被引:6,自引:0,他引:6  
In this paper, we present an implementation of a vectorizing C compiler for Intel's MMX (Multimedia Extension). This compiler would identify data parallel sections of the code using scalar and array dependence analysis. To enhance the scope for application of the subword semantics, our compiler performs several code transformations. These include strip mining, scalar expansion, grouping and reduction, and distribution. Thereafter inline assembly instructions corresponding to the data parallel sections are generated. We have used the Stanford University Intermediate Format (SUIF), a public domain compiler tool, for our implementation. We evaluated the performance of the code generated by our compiler for a number of benchmarks. Initial performance results reveal that our compiler generated code produces a reasonable performance improvement (speedup of 2 to 6.5) over the the code generated without the vectorizing transformations/inline assembly. In certain cases, the performance of the compiler generated code is within 85% of the hand-tuned code for MMX architecture.  相似文献   

10.
Open Computing Language (OpenCL) is an open, functionally portable programming model for a large range of highly parallel processors. To provide users with access to the underlying platforms, OpenCL has explicit support for features such as local memory and vector data types (VDTs). However, these are often low‐level, hardware‐specific features, which can be detrimental to performance on different platforms. In this paper, we focus on VDTs and investigate their usage in a systematic way. First, we propose two different approaches (inter‐vdt and intra‐vdt) to use VDTs in OpenCL kernels, and show how to translate scalar OpenCL kernels to vectorized ones. After obtaining vectorized code, we evaluate the performance effects of using VDTs with two types of benchmarks: micro‐benchmarks and macro‐benchmarks. With micro‐benchmarks, we study the execution model of VDTs and the role of the compiler‐aided vectorizer on five devices. With macro‐benchmarks, we explore the changes of memory access patterns before and after using VDTs, and the resulting performance impact. Not only our evaluation provides insights into how OpenCL's VDTs are mapped on different processors, but it also indicates that using such data types introduces changes in both computation and memory accesses. Based on the lessons learned, we discuss how to deal with performance portability in the presence of VDTs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

11.
An automatic vectorizing compiler called V-Pascal is described in detail. The compiler has been designed and implemented with a view to vectorizing Pascal source programs. Using the mechanism of vector indirect addressing, it reduces multiply nestedfor loops to equivalent single loops, which are then executed by vector mode with sufficiently long vector lengths. TheD matrix, which is an adjacency matrix giving dependences between intermediate code nodes, plays an important role in the V-Pascal compiler. It is demonstrated that, in some cases, the V-Pascal compiler yields object code that runs faster than the Fortran counterpart. This paper mainly presents the basic constituents of the Version 1 of the V-Pascal compiler. Version 2 includes higher functions such as vectorization ofwhile-do loops and recursive procedures, vectorization of character string manipulations and relational database operations (written in Pascal), and automatic parallel decomposition for multiprocessor environments.  相似文献   

12.
We present a Monte Carlo programme version written in Vector-FORTRAN 200 which allows a fast computation of thermodynamic properties of dense model fluids on the CYBER 205 vector processing computer.A comparison of the execution speed of this programme, a scalar version and a vectorized molecular dynamics programme showed the following: (i) the vectorized form of the Monte Carlo programme runs about a factor of 8 faster on the CYBER 205 than the scalar version on the conventional computer CYBER 855; (ii) for small ensembles of 32–108 particles, the Monte Carlo programme is of about the velocity as the molecular dynamics one. However, for larger numbers of particles, the molecular dynamics programme is vastly faster executed on the CYBER 205 than the Monte Carlo programme, particularly when neighbour tables are used. We propose a technique to accelerate the Monte Carlo programme for larger ensembles.  相似文献   

13.
The conventional method of assessing supercomputer performance by measuring the execution time of software has many shortcomings. First, effort is required to write and debug the software. Second, time on the machine is required, and additional effort is needed to verify the validity of the test. Third, alterations to the algorithm require changing the code and retiming. Fourth, a black box approach to determining machine performance leaves the user with little confidence in how well the software was optimized. We present a pencil and paper methodology for computing the execution time of vectorized loops on a Cray Research X-MP/Y-MP. With this methodology a user can accurately compute the processing rate of an algorithm before the software is actually written. When several implementations of an algorithm are designed, this methodology can be used to select the best one for development, preventing wasted coding effort on less efficient implementations. Since this methodology computes optimal machine performance, it can be used to verify the efficiency of compiler translation. Changes to algorithms are easily appraised to determine their effect on performance. While the purpose of the methodology is to compute an algorithm's execution time, a side benefit is that this technique induces the user to think in terms of optimization. Bottlenecks in the code are pinpointed, and possible options for increased performance become obvious. At E-Systems, this methodology has become an integral part of the software development of vector-intensive code. This article is written specifically for Cray Research X-MP/Y-MP supercomputers, but many of the general concepts are applicable to other machines and therefore should benefit a number of supercomputer users.  相似文献   

14.
文章[1]中提出了数组之间的数据融合优化方法,并以IA-32服务器为平台测试了数据融合优化的效果。测试结果表明,在IA-32机器上,数据融合优化在性能代价模型的控制下,能较好地改善具有非连续数据访问特征的应用程序的CACHE利用率。那么,在新一代体系结构IA-64平台上,数据融合优化的效果如何呢?该文分别以IntelIA-32服务器和HPITANIUM服务器为平台,用IntelFORTRAN编译器ifc和efc及自由软件编译器g95分别编译并运行数据融合优化变换前后的程序,获得两种平台上的执行时间及相关的性能数据。测试结果表明,源程序级的数据融合优化不能很好地与IA-64平台上的EFC编译器高级优化配合工作,在O3级优化开关控制下,优化效果是负值。此测试结果进一步表明,编译高级优化如数据预取、循环变换和数据变换等各种优化必须结合体系结构的特点统筹考虑,才能取得好的全局优化效果。该文为研究各种面向IA-32体系结构的编译优化算法在IA-64体系结构上的性能可移植性优化起到抛砖引玉的作用。  相似文献   

15.
This paper addresses how to automatically generate code for multimedia extension architectures in the presence of conditionals. We evaluate the costs and benefits of exploiting branches on the aggregate condition codes associated with the fields of a superword (an aggregate object larger than a machine word) such as the branch-on-any instruction of the AltiVec. Branch-on-superword-condition-codes (BOSCC) instructions allow fast detection of aggregate conditions, an optimization opportunity often found in multimedia applications. This paper presents compiler analyses and techniques for generating efficient parallel code using BOSCC instructions. We evaluate our approach, which has been implemented in the SUIF compiler, through a set of experiments with multimedia benchmarks, and compare it with the default approach previously implemented in our compiler. Our experimental results show that using BOSCC instructions can result in better performance for applications where the aggregate condition codes of a superword often evaluate to the same value.  相似文献   

16.
Abstract machine modelling is a popular technique for developing portable compilers. A compiler can be quickly realized by translating the abstract machine operations to target machine operations. The problem with these compilers is that they trade execution efficiency for portability. Typically, the code emitted by these compilers runs two to three times slower than the code generated by compilers that employ sophisticated code generators. This paper describes a C compiler that uses abstract machine modelling to achieve portability. The emitted target machine code is improved by a simple, classical rule-directed peephole optimizer. Our experiments with this compiler on four machines show that a small number of very general handwritten patterns (under 40) yields code that is comparable to the code from compilers that use more sophisticated code generators. As an added bonus, compilation time on some machines is reduced by 10 to 20 per cent.  相似文献   

17.
18.
The time‐dependent Maxwell equations are one of the most important approaches to describing dynamic or wide‐band frequency electromagnetic phenomena. A sequential finite‐volume, characteristic‐based procedure for solving the time‐dependent, three‐dimensional Maxwell equations has been successfully implemented in Fortran before. Due to its need for a large memory space and high demand on CPU time, it is impossible to test the code for a large array. Hence, it is essential to implement the code on a parallel computing system. In this paper, we discuss an efficient and scalable parallelization of the sequential Fortran time‐dependent Maxwell equations solver using High Performance Fortran (HPF). The background to the project, the theory behind the efficiency being achieved, the parallelization methodologies employed and the experimental results obtained on the Cray T3E massively parallel computing system will be described in detail. Experimental runs show that the execution time is reduced drastically through parallel computing. The code is scalable up to 98 processors on the Cray T3E and has a performance similar to that of an MPI implementation. Based on the experimentation carried out in this research, we believe that a high‐level parallel programming language such as HPF is a fast, viable and economical approach to parallelizing many existing sequential codes which exhibit a lot of parallelism. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

19.
This is the first of a series of papers on the Genesis distributed-memory benchmarks, which were developed under the European ESPRIT research program. The benchmarks provide a standard reference Fortran77 uniprocessor version, a distributed memory. MIMD version, and in some cases a Fortran90 version suitable for SIMD computers. The problems selected all have a scientific origin (mostly from physics or theoretical chemistry), and range from synthetic code fragments designed to measure the basic hardware properties of the computer (especially communication and synchronisation overheads), through commonly used library subroutines, to full application codes. This first paper defines the methodology to be used to analyse the benchmark results, and gives an example of a fully analysed application benchmark from General Relativity (GR1). First, suitable absolute performance metrics are carefully defined, then the performance analysis treats the execution time and absolute performance as functions of at least two variables, namely the problem size and the number of proecssors. The theoretical predictions are compared with, or fitted to, the measured results, and then used to predict (with due caution) how the performance might scale for larger problems and more processors than were actually available during the benchmarking. Benchmark measurements are given primarily for the German SUPRENUM computer, but also for the IBM 3083J, Convex C210 and a Parsys Supernode with 32 T800-20 transputers.  相似文献   

20.
利用专门的软件STM32CubeMX与 MATLAB进行嵌入式建模与仿真,仿真成功后,在编译器中将其翻译为高效的 MDK C语言代码,大大提高了嵌入式程序的开发效率,缩短了开发周期,并且可以同时利用 MATLAB的代码优化工具箱,提高代码质量。实验结果表明,利用 MATLAB与STM32CubeMX生成的代码在目标系统中运行良好,在设计效率和易维护性方面优于手工编写的代码。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号