首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Achieving a high sustained simulation performance is the most important concern in the HPC community. To this end, many kinds of HPC system architectures have been proposed, and the diversity of the HPC systems grows rapidly. Under this circumstance, a vector-parallel supercomputer SX-ACE has been designed to achieve a high sustained performance of memory-intensive applications by providing a high memory bandwidth commensurate with its high computational capability. This paper examines the potential of the modern vector-parallel supercomputer through the performance evaluation of SX-ACE using practical engineering and scientific applications. To improve the sustained simulation performances of practical applications, SX-ACE adopts an advanced memory subsystem with several new architectural features. This paper discusses how these features, such as MSHR, a large on-chip memory, and novel vector processing mechanisms, are beneficial to achieve a high sustained performance for large-scale engineering and scientific simulations. Evaluation results clearly indicate that the high sustained memory performance per core enables the modern vector supercomputer to achieve outstanding performances that are unreachable by simply increasing the number of fine-grain scalar processor cores. This paper also discusses the performance of the HPCG benchmark to evaluate the potentials of supercomputers with balanced memory and computational performance against heterogeneous and cutting-edge scalar parallel systems.  相似文献   

2.
The paper describes the implementation of the Successive Overrelaxation (SOR) method on an asynchronous multiprocessor computer for solving large, linear systems. The parallel algorithm is derived by dividing the serial SOR method into noninterfering tasks which are then combined with an optimal schedule of a feasible number of processors. The important features of the algorithm are: (i) achieves a speedup Sp O(N/3) and an efficiency Ep 2/3 using P = [N/2] processors, where N is the number of the equations, (ii) contains a high level of inherent parallelism, whereas on the other hand, the convergence theory of the parallel SOR method is the same as its sequential counterpart and (iii) may be modified to use block methods in order to minimise the overhead due to communication and synchronisation of the processors.  相似文献   

3.
In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.  相似文献   

4.
Atomistic simulations of thin film deposition, based on the lattice Monte Carlo method, provide insights into the microstructure evolution at the atomic level. However, large-scale atomistic simulation is limited on a single computer—due to memory and speed constraints. Parallel computation, although promising in memory and speed, has not been widely applied in these simulations because of the intimidating overhead. The key issue in achieving optimal performance is, therefore, to reduce communication overhead among processors. In this paper, we propose a new parallel algorithm for the simulation of large-scale thin film deposition incorporating two optimization strategies: (1) domain decomposition with sub-domain overlapping and (2) asynchronous communication. This algorithm was implemented both on message-passing-processor systems (MPP) and on cluster computers. We found that both architectures are suitable for parallel Monte Carlo simulation of thin film deposition in either a distributed memory mode or a shared memory mode with message-passing libraries.  相似文献   

5.
In this paper, we present parallel simulations of three-dimensional complex flows obtained on an ORIGIN 3800 computer and on homogeneous and heterogeneous (processors of different speeds and RAM) computational grids. The solver under consideration, which is representative of modern numerics used in industrial computational fluid dynamics (CFD) software, is based on a mixed element-volume method on unstructured tedrahedrisations. The parallelisation strategy combines mesh partitioning techniques, a message-passing programming model and an additive Schwarz algorithm. The parallelisation performances are analysed on a two-phase compressible flow and a turbulent flow past a square cylinder.  相似文献   

6.
在现代处理器中,存储控制器是处理器芯片对片外存储器进行访问的管理者和执行者,其中对访存过程的调度算法会对实际访存性能产生十分重要的影响。针对已有调度算法在不同负载特征下自适应性不足的问题,提出了一种基于强化学习方法的ALHS算法,通过对访存调度中页命中优先时的连续页命中上限次数进行自适应调整,习得最优策略。多种不同典型访存模式的模拟结果显示,相比传统的FR-FCFS,ALHS算法运行速度平均提升了10.98%,并且可以获得近似于最优策略的性能提升,表明该算法能够自主探索环境并自我优化。  相似文献   

7.
We present a polynomial preconditioner that can be used with the conjugate gradient method to solve symmetric and positive definite systems of linear equations. Each step of the preconditioning is achieved by simultaneously taking an iteration of the SOR method and an iteration of the reverse SOR method (equations taken in reverse order) and averaging the results. This yields a symmetric preconditioner that can be implemented on parallel computers by performing the forward and reverse SOR iterations simultaneously. We give necessary and sufficient conditions for additive preconditioners to be positive definite.

We find an optimal parameter, ω, for the SOR-Additive linear stationary iterative method applied to 2-cyclic matrices. We show this method is asymptotically twice as fast as SSOR when the optimal ω is used.

We compare our preconditioner to the SSOR polynomial preconditioner for a model problem. With the optimal ω, our preconditioner was found to be as effective as the SSOR polynomial preconditioner in reducing the number of conjugate gradient iterations. Parallel implementations of both methods are discussed for vector and multiple processors. Results show that if the same number of processors are used for both preconditioners, the SSOR preconditioner is more effective. If twice as many processors are used for the SOR-Additive preconditioner, it becomes more efficient than the SSOR preconditioner when the number of equations assigned to a processor is small. These results are confirmed by the Blue Chip emulator at the University of Washington.  相似文献   


8.
The Building-Cube Method (BCM) based on equally-spaced Cartesian meshes has been proposed as a next generation CFD method. Due to the equally-spaced meshes, it is well suited for highly parallel computation. This paper proposes a parallel implementation scheme of BCM on a GPU cluster system, which needs efficient hierarchical parallel processing to exploit the potential of the cluster system. The proposed scheme employs the Red-Black SOR method for the pressure calculations, which is the most time-consuming part of BCM, to obtain massive data parallelism of BCM. By exploiting the coarse-grain and fine-grain parallelism of BCM, the proposed scheme hierarchically assigns equally-divided tasks into the GPU cluster system. Furthermore, to exploit the computational power of GPUs in the cluster system, the proposed scheme employs an efficient data management such as coalesced data transfer and reusing data on an on-chip memory. Experimental results show that the single GPU implementation can achieve about three times higher performance than the single CPU one. Moreover, the multiple GPU implementation can achieve an almost ideal scalability. Finally, the possibility of further acceleration of not only the pressure calculation but also the whole BCM is discussed.  相似文献   

9.
In this paper the author make a comprehensive comparison of different parallelizations of a sequential number theoretic algorithm having large memory requirements. Brunotte’s algorithm is one of the currently known best methods for the decision of the canonical number system (or more generally shift radix system) property. Still, it can be very space-consuming in some cases. Pushing the algorithm to its limits may hopefully shed light on mathematical patterns that would otherwise not be discernible. The algorithm contains many n-dimensional vector operations and set operations like insert, find, clear, etc. The parallel algorithms encounter two difference kinds of concurrency problems. First, they need computationally intensive arithmetic vector operations, second, the set implementations require a huge amount of memory and general purpose processors. The algorithms described in this article are basically designed for two platforms. The first platform is a generic symmetric multiprocessing (SMP) architecture without any vector processor extension, the second is the Cell Broadband Engine. The SMP platforms have several general purpose processors in contrast with the Cell Broadband Engine where the processors have Synergistic vector processors.  相似文献   

10.
In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitz eigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficient implementation on modern parallel architectures is not trivial.In this paper, we present an efficient implementation on multicore processors which takes advantage of the features of this architecture. Several optimization techniques have been incorporated to the algorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg-Semencul formulas to solve the Toeplitz linear systems, optimization of the workload distribution among processors, and others. Although the algorithm follows a distributed memory parallel programming paradigm that is led by the nature of the mathematical derivation, special attention has been paid to obtaining the best performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, have been used to increase the performance in these environments. Experimental results show that our implementation takes advantage of multicore architectures and clearly outperforms the results obtained with LAPACK or ScaLAPACK.  相似文献   

11.
A parallelized point rowwise Successive Over-Relaxation (SOR) iterative algorithm is developed for the Heterogeneous Element Processor (HEP) multiple instruction stream computer. The classical point SOR method is not easily vectorizable with rowwise ordering of the grid points, but it can be effectively parallelized on a multiple instruction stream machine without suffering in computational and convergence rate. The details of the implementation including restructuring of a serial FORTRAN program and techniques needed to exploit the parallel processing architectural concept of the HEP are presented. The parallelized algorithm is analyzed in detail. The lessons learned in this study are documented and may provide some guidelines for similar future coding since new approaches and restructuring techniques are required for programming a multiple instruction stream machine, which are totally different than those required for programming an algorithm on a vector processor. To assess the capabilitiesof the parallelized algorithm it was used to solve the Laplace's equation on a rectangular field with Dirichlet boundary conditions. Computer run times are presented which indicate significant speed gain over a scalar version of the code. For a moderate to large size problem seventeen or more processes are required to make efficient use of the parallel processing hardware. Also, to demonstrate the capability of the algorithm for a realistic problem, it was used to obtain the numerical solution of a viscous incompressible fluid in a square cavity. Since point iterative relaxation schemes are at the core of many systems of elliptic as well as non-elliptic partial differential equations occuring in engineering and scientific applications, the present study suggests the possibility of both reducing the real time processing and increasing the scope of computational modeling.  相似文献   

12.
《Parallel Computing》1988,6(2):157-164
A general algorithm for calculating finite element matrices efficiently on both serial and vector processors is presented. In contrast to other element calculation algorithms, the degree of vectorisation of the proposed algorithm is not limited by the order of the element approximation, thus the potential for increased efficiency is applicable to any existing finite element code. To illustrate the degree of vectorisation and scalar optimisation that can be realised in practice, computation times using a serial computer and a CYBER 205 vector processor are compared for three distinct types of element matrix each having physical significance.  相似文献   

13.
Contour (iso-line) maps are used extensively by the Meteorological Office for both routine and research tasks. The techniques employed in a new FORTRAN subroutine (CMAP) for drawing such maps are described. The algorithm employed is raster based, suited to both scalar and vector processors and inherently device independent. In addition some normally cpu intensive operations such as contour shading, chart masking (eg. with respect togeographical areas) and isometric projections are natural extensions of the method requiring comparatively little cpu time.  相似文献   

14.
The performance of memory and I/O systems is insufficient to catch up with that of COTS (Commercial Off-The-Shelf) CPU. PC clusters using COTS CPU have been employed for HPC. A cache-based processor is far less effective than a vector processor in applications with low spatial locality. Moreover, for HPC, Google-like server farms and database processing, insufficient capacity of main memory poses a serious problem. Power consumption of a Google-like server farm or a high-end HPC PC cluster is huge. In order to overcome these problems, we propose a concept of a memory and network enhancer equipped with scatter and gather vector access functions, high-performance network connectivity, and capacity extensibility. Communication mechanisms named LHS and LHC are also proposed. LHS and LHC are architectures for reducing latency for mixed messages with small controlling data and large data body. Examples of the killer applications of this new type of hardware are presented. This paper presents not only concepts and simulations but also real hardware prototypes named DIMMnet-2 and DIMMnet-3. This paper presents the evaluations concerning memory issues and network issues. We evaluate the module with NAS CG benchmark class C and Wisconsin benchmarks as applications with memory issues. Although evaluation for CG class C is difficult with conventional cycle-accurate simulation methods, we obtained the result for class C with our original method. As a result, we find that the module can improve its maximum performance about 25 times more with Wisconsin benchmarks. However, the results on a cache-based PC show the cache-line flushing degrades acceleration ratio. This shows the high potential of the proposed extended memory module and processors in combination with DMA-based main memory access such as SPU on Cell/B.E. that does not need cache-line flushing. The LHS and LHC communication mechanisms are evaluated in this paper. The evaluations of their effects on latency are shown.  相似文献   

15.
Advances in computer technology are now so profound that the arithmetic capability and repertoire of computers can and should be expanded. Nowadays the elementary floating-point operations +, −, ×, / give computed results that coincide with the rounded exact result for any operands. Advanced computer arithmetic extends this accuracy requirement to all operations in the usual product spaces of computation: the real and complex vector spaces as well as their interval correspondents. This enhances the mathematical power of the digital computer considerably. A new computer operation, the scalar product, is fundamental to the development of advanced computer arithmetic.This paper studies the design of arithmetic units for advanced computer arithmetic. Scalar product units are developed for different kinds of computers like personal computers, workstations, mainframes, super computers or digital signal processors. The new expanded computational capability is gained at modest cost. The units put a methodology into modern computer hardware which was available on old calculators before the electronic computer entered the scene. In general the new arithmetic units increase both the speed of computation as well as the accuracy of the computed result. The circuits developed in this paper show that there is no way to compute an approximation of a scalar product faster than the correct result.A collection of constructs in terms of which a source language may accommodate advanced computer arithmetic is described in the paper. The development of programming languages in the context of advanced computer arithmetic is reviewed. The simulation of the accurate scalar product on existing, conventional processors is discussed. Finally the theoretical foundation of advanced computer arithmetic is reviewed and a comparison with other approaches to achieving higher accuracy in computation is given. Shortcomings of existing processors and standards are discussed.  相似文献   

16.
Because of the ever increasing size of output data from scientific simulations, supercomputers are increasingly relied upon to generate visualizations. One use of supercomputers is to generate field lines from large scale flow fields. When generating field lines in parallel, the vector field is generally decomposed into blocks, which are then assigned to processors. Since various regions of the vector field can have different flow complexity, processors will require varying amounts of computation time to trace their particles, causing load imbalance, and thus limiting the performance speedup. To achieve load-balanced streamline generation, we propose a workload-aware partitioning algorithm to decompose the vector field into partitions with near equal workloads. Since actual workloads are unknown beforehand, we propose a workload estimation algorithm to predict the workload in the local vector field. A graph-based representation of the vector field is employed to generate these estimates. Once the workloads have been estimated, our partitioning algorithm is hierarchically applied to distribute the workload to all partitions. We examine the performance of our workload estimation and workload-aware partitioning algorithm in several timings studies, which demonstrates that by employing these methods, better scalability can be achieved with little overhead.  相似文献   

17.
One of the many interesting architectural features of the CRAY-2 supercomputer is that each processor has access to 16K 64-bit words of local memory. This is in addition to the extremely large, 268-million-word common memory that is accessible by all four processors. By using local memory judiciously, it is possible to achieve increased performance on the CRAY-2. This is partly because accesses to local memory can be done simultaneously with accesses to common memory and other operations. Also, it is slightly faster to start up a vector access to local memory, and a processor does not have to compete with other processors when accessing its local memory. In this paper, we present an algorithm for computing the fast Fourier transform that takes advantage of the CRAY-2's local memory. It operates by solving subproblems, which are themselves Fourier transforms, entirely within local memory. By doing so it achieves a performance increase of between 25 and 40 percent over an equivalent algorithm that uses only common memory, and for some input sizes is able to outperform the CRAY-2 library FFT.  相似文献   

18.
The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined. We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes. The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message passing interface (MPI). We show that 1/P scaling for execution time at fixed problem size N3 (i.e., linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed. This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. An example is presented using a 3D passive scalar code.  相似文献   

19.
There is extensive literature concerning the divisible load theory. Based on the divisible load theory (DLT) the load can be divided into some arbitrary independent parts, in which each part can be processed independently by a processor. The divisible load theory has also been examined on the processors that cheat the algorithm, i.e., the processors do not report their true computation rates. According to the literature, if the processors do not report their true computation rates, the divisible load scheduling model fails to achieve its optimal performance. This paper focuses on the divisible load scheduling, where the processors cheat the algorithm. In this paper, a multi-objective method for divisible load scheduling is proposed. The goal is to improve the performance of the divisible load scheduling when the processors cheat the algorithm. The proposed method has been examined on several function approximation problems. The experimental results indicate the proposed method has approximately 66% decrease in finish time in the best case.  相似文献   

20.
The implementations of the domain decomposition, SOR, multigrid and conjugate gradient method on CM-5 and Cray C-90 are described for the Laplace's equation on the unit square and L-shaped region. Domain decomposition method uses the Schwarz alternating method. In each domain we take the one-dimensional FFT to convert the problem into the tridiagonal systems which are solved by the scientific libraries installed in the CM-5 and the Cray C-90. On the CM-5 the V-cycle multigrid with symmetric smoothings on P-1 finite element spaces is run with red/black Gauss-Seidel relaxation. Multigrid with natural order Gauss-Seidel relaxation is used on the Cray C-90. While natural order SOR is used in the Cray C-90, R/B SOR is performed on the CM-5. Multigrid is the fastest method on the CM-5 and three methods except SOR give similar performances on Cray C-90.This research was partially supported by the National Science Foundation under Grant No. CDA-9024618.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号