首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a tracing library for the CRAY X-MP and CRAY-2 supercomputers that supports the low-overhead capture of execution events for sequential and multitasked programs. This library has been extended to use the automatic instrumentation facilities on these machines, allowing trace data from routine entry and exit, and other program segments, to be captured. To assess the utility of the trace-based tools, three of the Perfect Benchmark codes have been tested in scalar and vector modes with the tracing instrumentation. In addition to computing summary execution statistics from the traces, interesting execution dynamics appear when studying the trace histories. It is also possible to model application performance based on properties identified from traces. Our conclusion is that adding tracing support in Cray supercomputers can have significant returns in improved performance characterization and evaluation.An earlier version of this paper was presented at Supercomputing '90.Supported in part by the National Science Foundation under Grants No. NSF MIP-88-07775 and No. NSF ASC-84-04556, and the NASA Ames Research Center Grant No. NCC-2-559.Supported in part by the National Science Foundation under grant NSF ASC-84-04556.Supported in part by the National Science Foundation under grants NSF CCR-86-57696, NSF CCR-87-06653 and NSF CDA-87-22836 and by the National Aeronautics and Space Administration under NASA Contract Number NAG-1-613.  相似文献   

2.
The serial and parallel performance of one of the world's fastest general purpose computers, the CRAY-2, is analyzed using the standard Los Alamos Benchmark Set plus codes adapted for parallel processing. For comparison, architectural and performance data are also given for the CRAY X-MP/416. Factors affecting performance, such as memory bandwidth, size and access speed of memory, and software exploitation of hardware, are examined. The parallel processing environments of both machines are evaluated, and speedup measurements for the parallel codes are given.An earlier version of this paper was presented at Supercomputing '88This work was performed under the auspices of the U.S. Department of Energy.  相似文献   

3.
The CRAY-2 is considered to be one of the most powerful supercomputers. Its state-of-the-art technology features a faster clock and more memory than any other supercomputer available today. In this report the single processor performance of the CRAY-2 is compared with the older, more mature CRAY X-MP. Benchmark results are included for both the slow and the fast memory DRAM MOS CRAY-2. Our comparison is based on a kernel benchmark set aimed at evaluating the performance of these two machines on some standard tasks in scientific computing. Particular emphasis is placed on evaluating the impact of the availability of large real memory on the CRAY-2 versus fast secondary memory on the CRAY X-MP with SSD. Our benchmark includes large linear equation solvers and FFT routines, which test the capabilities of the different approaches to providing large memory. We find that in spite of its higher processor speed the CRAY-2 does not perform as well as the CRAY X-MP on the Fortran kernel benchmark. We also find that for large-scale applications, which have regular and predictable memory access patterns, a high-speed secondary memory device such as the SSD can provide performance equal to the large real memory of the CRAY-2.The author is an employee of SCA Division of Boeing Computer Services.  相似文献   

4.
This paper presents an approach for parallel computation of structural optimization problems on the CRAY X-MP by using parallel sensitivity analysis calculation. In this approach, a main processor is chosen to perform all the optimization calculations except the constraint gradient evaluations. When a sensitivity analysis is needed the main processor decomposes it into several computation tasks, then assigns the computation tasks to the other available associate processors and manages the communication. Due to uncoupled characteristics of the constraint gradient calculations, the associate processors perform the computation tasks in parallel. The algorithm for the structural optimization process with parallel design sensitivity is presented along with some numerical test cases to demonstrate the efficiency of this approach.  相似文献   

5.
In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%.  相似文献   

6.
In this paper we discuss code optimization techniques for implementing the Level 2 and 3 basic linear algebra subprograms on a single processor for the CRAY Y-MP and the CRAY-2. Our performance measurements show that the use of these techniques leads to a significant improvement in performance, and most subroutines achieve close to the peak performance of the machine for computations of relatively small sizes.  相似文献   

7.
The efficient use of MIMD computers calls for a careful choice of adequate algorithms as for an implementation taking into account the particular architecture. To demonstrate these facts, a parallel algorithm to find an approximate solution to the Euclidean Traveling Salesman Problem (ETSP) is presented. The algorithm is a parallelization of Karp's partitioning algorithm. It is a divide-and-conquer method for solving the ETSP approximately. Since the successor vertex to any vertex in the tour is usually a nearby vertex, the problem can be ‘geographically’ partitioned into subproblems which then can be solved independently. The resulting subtours can be combined into a single tour which is an approximate solution to the ETSP. The algorithm is implemented on a CRAY X-MP with two and four processors, and results using macrotasking and microtasking are presented.  相似文献   

8.
We consider direct methods based on Gaussian elimination for solving sparse sets of linear equations. Among conventional approaches, band and frontal methods are obviously vectorizable and general sparse methods equally do not vectorize easily since they involve indirect addressing in inner loops. We illustrate these effects with actual times from runs on the Cray at Harwell. To avoid indirect addressing we have been developing code that uses a “multi-frontal” technique. This moves the reals within storage in such a way that all operations are performed on full matrices although the pivotal strategy is minimum degree. We describe how the in-core and out-of-core versions perform on theCRAY-1.  相似文献   

9.
A comparison of the architectures and performance of a set of standard FORTRAN benchmark codes is made of the Alliant FX, Convex C-1, and SCS-40 minisupercomputers.  相似文献   

10.
Parallelism in dynamic programming is considered within the specificity of optimal control. We present the program PDVP developed for solving a general deterministic discrete-time optimization problem by means of a parallel dynamic programming algorithm on the state variables. Multitasking and vectorization are considered from the viewpoint to implement PDVP on a CRAY-2. The performances are analysed through a significant application to the optimization of satellite trajectories. Promising results are obtained.  相似文献   

11.
One of the many interesting architectural features of the CRAY-2 supercomputer is that each processor has access to 16K 64-bit words of local memory. This is in addition to the extremely large, 268-million-word common memory that is accessible by all four processors. By using local memory judiciously, it is possible to achieve increased performance on the CRAY-2. This is partly because accesses to local memory can be done simultaneously with accesses to common memory and other operations. Also, it is slightly faster to start up a vector access to local memory, and a processor does not have to compete with other processors when accessing its local memory. In this paper, we present an algorithm for computing the fast Fourier transform that takes advantage of the CRAY-2's local memory. It operates by solving subproblems, which are themselves Fourier transforms, entirely within local memory. By doing so it achieves a performance increase of between 25 and 40 percent over an equivalent algorithm that uses only common memory, and for some input sizes is able to outperform the CRAY-2 library FFT.  相似文献   

12.
13.
基于HBase的矢量空间数据存储与访问优化   总被引:2,自引:0,他引:2  
  相似文献   

14.
15.
16.
The development of massively parallel supercomputers provides a unique opportunity to advance the state of the art inN-body simulations. TheseN-body codes are of great importance for simulations in stellar dynamics and plasma physics. For systems with long-range forces, such as gravity or electromagnetic forces, it is important to increase the number of particles toN 107 particles. Significantly improved modeling ofN body systems can be expected by increasingN, arising from a more realistic representation of physical transport processes involving particle diffusion and energy and momentum transport. In addition, it will be possible to guarantee that physically significant portions of complex physical systems, such as Lindblad resonances of galaxies or current sheets in magnetospheres, will have an adequate population of particles for a realistic simulation. Particle-mesh (PM) and particle-particle particle-mesh (P3M) algorithms present the best prospects for the simulation of large-scaleN-body systems. As an example we present a two-dimensional PM simulation of a disk galaxy that we have developed on the Connection Machine-2, a massively parallel boolean hypercube supercomputer. The code is scalable to any CM-2 configuration available and, on the largest configuration, simulations withN = 128 M = 227 particles are possible in reasonable run times.  相似文献   

17.
We report performance measurements made on the 2-CPU CRAY X-MP at ECMWF, Reading. Vector (SIMD) performance on one CPU is interpreted by the two parameters (r, n12), and we find for dyadic operations using FORTRAN r = 70 Mflop/s, n12 = 53 flop. All vector triadic operations produce r = 107 Mflop/s, n12 = 45 flop; and a triadic operation with two vectors and one scalar gives r = 148 Mflop/s and n12 = 60 flop. MIMD performance using both CPUs on one job is interpreted with the two parameters (r, s12), where s12 is the amount of arithmetic that could have been done during the time taken to synchronize the two CPUs. We find, for dyadic operations using the TSKSTART and TSKWAIT synchronization primitives, that r = 130 Mflop/s and s12 = 5700 flop. This means that a job must contain more than ~ 6000 floating-point operations if it is to run at more than 50% of the maximum performance when split between both CPUs by this method. Less expensive synchronization methods using LOCKS and EVENTS reduces s12 to 4000 flop and 2000 flop respectively. A simplified form of LOCK synchronization written in CAL code further reduces s12 to 220 flop. This is probably the minimum possible value for synchronization overhead on the CRAY X-MP.  相似文献   

18.
针对支持向量机发酵建模中,选择重要建模参数值的问题,提出利用全局搜索能力较强的粒子群优化算法,优化调整支持向量机建模过程中的重要参数,每一个粒子的位置向量对应一组支持向量机建模的参数。参数不断优化后,得到拟合预测效果较优的模型,预测青霉素发酵过程。仿真结果表明,该方法能使模型的预测效果较好。  相似文献   

19.
Choosing optimal parameters for support vector regression (SVR) is an important step in SVR. design, which strongly affects the pefformance of SVR. In this paper, based on the analysis of influence of SVR parameters on generalization error, a new approach with two steps is proposed for selecting SVR parameters, First the kernel function and SVM parameters are optimized roughly through genetic algorithm, then the kernel parameter is finely adjusted by local linear search, This approach has been successfully applied to the prediction model of the sulfur content in hot metal. The experiment results show that the proposed approach can yield better generalization performance of SVR than other methods,  相似文献   

20.
基于永磁直线同步电机数学模型,构建具有三闭环控制结构的永磁直线同步电机伺服控制系统,并采用电压空间矢量控制策略,在MatLab/Simulink环境下搭建永磁直线同步电机伺服控制系统整体仿真模型,通过对仿真实验结果的分析,表明了该模型的正确性,为后续高精度永磁直线同步电机控制方法的研究提供了基础模型.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号