首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 687 毫秒
1.
We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 483·96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than $l/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations.  相似文献   

2.
This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code.Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software.  相似文献   

3.
We report an empirical study of n-gram posterior probability confidence measures for statistical machine translation (SMT). We first describe an efficient and practical algorithm for rapidly computing n-gram posterior probabilities from large translation word lattices. These probabilities are shown to be a good predictor of whether or not the n-gram is found in human reference translations, motivating their use as a confidence measure for SMT. Comprehensive n-gram precision and word coverage measurements are presented for a variety of different language pairs, domains and conditions. We analyze the effect on reference precision of using single or multiple references, and compare the precision of posteriors computed from k-best lists to those computed over the full evidence space of the lattice. We also demonstrate improved confidence by combining multiple lattices in a multi-source translation framework.  相似文献   

4.
《Parallel Computing》1999,25(10-11):1257-1280
We present physics results from large-scale simulations of Quantum Chromodynamics (QCD) on the space-time lattice carried out with the CP-PACS computer. The CP-PACS is a massively parallel system with a peak speed of 614 Gflops and 320 Gbyte of main memory developed at the Center for Computational Physics, University of Tsukuba.Since the start of full operation of CP-PACS in October 1996, precision calculation of the light hadron spectrum in the quenched approximation of QCD and a systematic attempt at a calculation without this approximation have been pursued. Physics motivations of these calculations, the computational difficulties,and advances brought in by the CP-PACS are discussed. Performance of the CP-PACS for lattice QCD computations are described in a companion paper by S. Aoki et al.  相似文献   

5.
In this paper we introduce a mixed formulation of the Bingham fluid flow problem. We consider both the original and a regularized version of the problem, where a parameter ε is introduced, forcing the entire domain to be formally a fluid region. In general, common solvers for the regularized problem experience a performance degradation when the parameter ε gets smaller. The method studied here introduces an auxiliary tensor variable and shows enhanced numerical properties for small values of ε. A good performance is also observed for the non-regularized case. The well posedness for the regularized problem and the equivalence – at the continuous level – between the original (primitive variables) and the mixed formulation are demonstrated. We analyze properties of linearized problems that are relevant for the convergence of numerical solvers. A finite element method for the mixed formulation is discussed. Numerical results confirm the predicted better performances of the mixed formulation when compared to the primitive variables formulation. A comparison to a non-regularized solver based on the augmented Duvaut–Lions–Glowinski formulation of the problem is carried out as well.  相似文献   

6.
This paper presents, to the author's knowledge, the first graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe. We present the implementation in NVIDIA's Compute Unified Device Architecture (CUDA) and compare the performance to other similar programs in chaotic inflation models. We report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. The program is available at http://www.physics.utu.fi/theory/particlecosmology/cudaeasy/ under the GNU General Public License.  相似文献   

7.
This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for the Cell processor. It describes in detail the implementation of code to solve linear system of equations using Gaussian elimination in single precision with iterative refinement of the solution to the full double‐precision accuracy. By utilizing this approach the algorithm achieves close to an order of magnitude higher performance on the Cell processor than the performance offered by the standard double‐precision algorithm. The code is effectively an implementation of the high‐performance LINPACK benchmark, as it meets all of the requirements concerning the problem being solved and the numerical properties of the solution. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

8.
Fast neural net simulation with a DSP processor array   总被引:1,自引:0,他引:1  
This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications.  相似文献   

9.
It is well known that the block Krylov subspace solvers work efficiently for some cases of the solution of differential equations with multiple right-hand sides. In lattice QCD calculation of physical quantities on a given configuration demands us to solve the Dirac equation with multiple sources. We show that a new block Krylov subspace algorithm recently proposed by the authors reduces the computational cost significantly without losing numerical accuracy for the solution of the O(a)-improved Wilson-Dirac equation.  相似文献   

10.
Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. In this paper, we develop a general purpose Lattice Boltzmann code that runs entirely on a single GPU. The results show that: (1) simple precision floating point arithmetic is sufficient for LBM computation in comparison to double precision; (2) the implementation of LBM on GPUs allows us to achieve up to about one billion lattice update per second using single precision floating point; (3) GPUs provide an inexpensive alternative to large clusters for fluid dynamics prediction.  相似文献   

11.
Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs.  相似文献   

12.
《Parallel Computing》1999,25(10-11):1357-1370
The locally lexicographic symmetric successive overrelaxation algorithm (ll-SSOR) is the most effective parallel preconditioner known for iterative solvers used in lattice gauge theory. After reviewing the basic properties of ll-SSOR, the focus of this contribution is put on its parallel aspects: the administrative overhead of the parallel implementation of ll-SSOR, which is due to many conditional operations, decreases its efficiency by a factor of up to one third. A simple generalization of the algorithm is proposed that allows the application of the lexicographic ordering along specified axes, while along the other dimensions odd–even preconditioning is used. In this way one can tune the preconditioner towards optimal performance by balancing ll-SSOR effectivity and administrative overhead.  相似文献   

13.
Numerical characteristics of various Kalman filter algorithms are illustrated with a realistic orbit determination study. The case study of this paper highlights the numerical deficiencies of the conventional and stabilized Kalman algorithms. Computational errors associated with these algorithms are found to be so large as to obscure important mismodeling effects and thus cause misleading estimates of filter accuracy. The positive result of this study is that the U-D covariance factorization algorithm has excellent numerical properties and is computationally efficient, having CPU costs that differ negligibly from the conventional Kalman costs. Accuracies of the U-D filter using single precision arithmetic consistently match the double precision reference results. Numerical stability of the U-D filter is further demonstrated by its insensitivity to variations in the a priori statistics.  相似文献   

14.
We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU‐CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

15.
The ability to generate crew pairings quickly is essential to solving the airline crew scheduling problem. Although techniques for doing so are well-established, they are also highly customized and require significant implementation efforts. This greatly impedes researchers studying important problems such as robust planning, integrated planning, and automated recovery, all of which also require the generating of crew pairings. As an alternative, we present an integer programming (IP) approach to generating crew pairings, which can be solved via traditional methods such as branch-and-bound using off-the-shelf commercial solvers. This greatly facilitates the prototyping and testing of new research ideas. In addition, we suggest that our modeling approach, which uses both connection variables and marker variables to capture the non-linear cost function and constraints of the crew scheduling problem, can be applicable in other scheduling contexts as well. Computational results using data from a major US hub-and-spoke carrier demonstrate the performance of our approach.  相似文献   

16.
We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5 Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively.  相似文献   

17.
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.  相似文献   

18.
Significant discrepancies have been observed and discussed on the lattice stability of Cr between the predictions from the ab initio calculations and the CALPHAD approach. In the current work, we carefully examined the possible structures for pure Cr and reviewed the history back from how Kaufman originally determined the Gibbs energy of FCC-Cr in the 1970s. The reliability of Cr lattice stability derived by the CALPHAD and ab initio approaches was systematically discussed. It is concluded that the Cr lattice stability based on the CALPHAD approach has large uncertainty. Meanwhile, we cannot claim that the ab initio HFCC-Cr is error-free as FCC-Cr is an unstable phase under ambient conditions. The present work shows that the ab initio HFCC-Cr can be a viable scientific approach. As both approaches have their limitations, the present work propose to integrate the ab initio results into the CALPHAD platform for the development of the next generation CALPHAD database. The Fe-Cr and Ni-Cr binary systems were chosen as two case studies demonstrating the capability to adopt the ab initio Cr lattice stability directly into the current CALPHAD database framework.  相似文献   

19.
We present the novel parallel linear least squares solvers ARPLS-IR and ARPLS-MPIR for dense overdetermined linear systems. All internode communication of our ARPLS solvers arises in the context of all-reduce operations across the parallel system and therefore they benefit directly from efficient implementations of such operations. Our approach is based on the semi-normal equations, which are in general not backward stable. However, the method is stabilised by using iterative refinement. We show that performing iterative refinement in mixed precision also increases the parallel performance of the algorithm. We consider different variants of the ARPLS algorithm depending on the conditioning of the problem and in this context also evaluate the method of normal equations with iterative refinement. For ill-conditioned systems, we demonstrate that the semi-normal equations with standard iterative refinement achieve the best accuracy compared to other parallel solvers.We discuss the conceptual advantages of ARPLS-IR and ARPLS-MPIR over alternative parallel approaches based on QR factorisation or the normal equations. Moreover, we analytically compare the communication cost to an approach based on communication-avoiding QR factorisation. Numerical experiments on a high performance cluster illustrate speed-ups up to 3820 on 2048 cores for ill-conditioned tall and skinny matrices over state-of-the-art solvers from DPLASMA or ScaLAPACK.  相似文献   

20.
In this paper, an efficient solver for high dimensional lattice equations will be introduced. We will present a new concept, the recovery method, to define a bilinear form on the continuous level which has equivalent energy as the original lattice equation. The finite element discretisation of the continuous bilinear form will lead to a stiffness matrix which serves as an quasi-optimal preconditioner for the lattice equations. Since a large variety of efficient solvers are available for linear finite element problems the new recovery method allows to apply these solvers for unstructured lattice problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号