首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU–GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy.  相似文献   

3.
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.  相似文献   

4.
Li  Kun  Li  Shigang  Huang  Shan  Chen  Yifeng  Zhang  Yunquan 《The Journal of supercomputing》2020,76(7):5501-5520

In the molecular dynamics simulation, an important step is the establishment of neighbor list for each particle, which involves the distance calculation for each particle pair in the simulation space. However, the distance calculation will cause costly floating-point operations. In this paper, we propose a novel algorithm, called Fast Neighbor List, which establishes the neighbor lists mainly using the bitwise operations. Firstly, we design a data layout, which uses an integer value to represent the three-dimensional coordinates of a particle. Then, a bunch of bitwise operations and two subtraction operations are used to judge whether the distance between a pair of particles is within the cutoff radius. We demonstrate that our algorithm can deal with the periodic boundary seamlessly. We also use single instruction multiple data (SIMD) instructions to further improve the performance. We implement our algorithm on Intel Xeon E5-2670, ARM v8, and Sunway many-core processors, respectively. Compared with the traditional method, our algorithm achieves on average 1.79x speedup on Intel Xeon E5-2670 processor, 3.43x speedup on ARM v8 processor, and 4.03x speedup on Sunway many-core processor. After using SIMD instructions, our algorithm achieves on average 2.64x speedup and 14.43x speedup on Intel Xeon E5-2670 and ARM v8 processors, respectively.

  相似文献   

5.
We present simple changes to the cell method for neighbor list construction that enable it to be used in molecular dynamics studies of systems subject to a planar elongational flow field. The modifications for planar elongational flow are similar to those required for planar shear flow and should be easy to incorporate into any cell neighbor list method that is used in simulations of homogeneous shear. The execution time of the code at equilibrium is shown to be proportional to the number of particles N. The introduction of the modifications allowing shear, and more importantly, elongational flow are shown to affect the performance of the code in both CPU time and memory usage. The modifications to enable the simulation of planar elongational flow using the cell method of neighbor list construction will not introduce any higher order dependency if applied to code that is N-dependent in planar shear flow. We use this code to study large systems of diatomic molecules at low strain-rates, and find that the linear regime in planar elongational flow can be determined by the ratio of the two planar elongational viscosity functions. The properties investigated in planar shear flow, such as angular velocity and alignment angle, were inconsistent with the shear viscosity results in their evaluation of where the linear regime ends. The high precision of the results allowed us to accurately determine the coefficients in the retarded-motion expansion.  相似文献   

6.
An optimized implementation of a block tridiagonal solver based on the block cyclic reduction (BCR) algorithm is introduced and its portability to graphics processing units (GPUs) is explored. The computations are performed on the NVIDIA GTX480 GPU. The results are compared with those obtained on a single core of Intel Core i7-920 (2.67 GHz) in terms of calculation runtime. The BCR linear solver achieves the maximum speedup of 5.84x with block size of 32 over the CPU Thomas algorithm in double precision. The proposed BCR solver is applied to discontinuous Galerkin (DG) simulations on structured grids via alternating direction implicit (ADI) scheme. The GPU performance of the entire computational fluid dynamics (CFD) code is studied for different compressible inviscid flow test cases. For a general mesh with quadrilateral elements, the ADI-DG solver achieves the maximum total speedup of 7.45x for the piecewise quadratic solution over the CPU platform in double precision.  相似文献   

7.
We implemented a GPU-based parallel code to perform Monte Carlo simulations of the two-dimensional q-state Potts model. The algorithm is based on a checkerboard update scheme and assigns independent random number generators to each thread. The implementation allows to simulate systems up to ~109 spins with an average time per spin flip of 0.147 ns on the fastest GPU card tested, representing a speedup up to 155×, compared with an optimized serial code running on a high-end CPU.The possibility of performing high speed simulations at large enough system sizes allowed us to provide a positive numerical evidence about the existence of metastability on very large systems based on Binder?s criterion, namely, on the existence or not of specific heat singularities at spinodal temperatures different of the transition one.  相似文献   

8.
Cross-Approximate Entropy (Cross-ApEn) is a useful measure to quantify the statistical dissimilarity of two time series. In spite of the advantage of Cross-ApEn over its one-dimensional counterpart (Approximate Entropy), only a few studies have applied it to biomedical signals, mainly due to its high computational cost. In this paper, we propose a fast GPU-based implementation of the Cross-ApEn that makes feasible its use over a large amount of multidimensional data. The scheme followed is fully scalable, thus maximizes the use of the GPU despite of the number of neural signals being processed. The approach consists in processing many trials or epochs simultaneously, with independence of its origin. In the case of MEG data, these trials can proceed from different input channels or subjects. The proposed implementation achieves an average speedup greater than 250× against a CPU parallel version running on a processor containing six cores. A dataset of 30 subjects containing 148 MEG channels (49 epochs of 1024 samples per channel) can be analyzed using our development in about 30 min. The same processing takes 5 days on six cores and 15 days when running on a single core. The speedup is much larger if compared to a basic sequential Matlab® implementation, that would need 58 days per subject. To our knowledge, this is the first contribution of Cross-ApEn measure computation using GPUs. This study demonstrates that this hardware is, to the day, the best option for the signal processing of biomedical data with Cross-ApEn.  相似文献   

9.
To improve the simulation efficiency of turbulent fluid flows at high Reynolds numbers with large eddy dynamics, a CUDA-based simulation solution of lattice Boltzmann method for large eddy simulation (LES) using multiple graphics processing units (GPUs) is proposed. Our solution adopts the “collision after propagation” lattice evolution way and puts the misaligned propagation phase at global memory read process. The latest GPU platform allows a single CPU thread to control up to four GPUs that run in parallel. In order to make use of multiple GPUs, the whole working set is evenly partitioned into sub-domains. We implement Smagorinsky model and Vreman model respectively to verify our multi-GPU solution. These two LES models have different relaxation time calculation behavior and lead to different CUDA implementation characteristics. The implementation based on Smagorinsky model achieves 190 times speedup over the sequential implementation on CPU, while the implementation based on Vreman model archives more than 90 times speedup. The experimental results show that the parallel performance of our multi-GPU solution scales very well on multiple GPUs. Therefore large-scale (up to 10,240 $\times $ 10,240 lattices) LES–LBM simulation becomes possible at a low cost, even using double-precision floating point calculation.  相似文献   

10.
Dissipative particle dynamics (DPD) simulation is implemented on multiple GPUs by using NVIDIA’s Compute Unified Device Architecture (CUDA) in this paper. Data communication between each GPU is executed based on the POSIX thread. Compared with the single-GPU implementation, this implementation can provide faster computation speed and more storage space to perform simulations on a significant larger system. In benchmark, the performance of GPUs is compared with that of Material Studio running on a single CPU core. We can achieve more than 90x speedup by using three C2050 GPUs to perform simulations on an 80∗80∗80 system. This implementation is applied to the study on the dispersancy of lubricant succinimide dispersants. A series of simulations are performed on lubricant–soot–dispersant systems to study the impact factors including concentration and interaction with lubricant on the dispersancy, and the simulation results are agreed with the study in our present work.  相似文献   

11.
Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel routine. We explore parallelization granularities for this kernel routine on Graphical Processing Units (GPUs). We show that fine-grained parallelism can outperform coarse-grained parallelization, given that control-flow and communication effects are minimized. We propose two techniques for transforming control-flow-based code to control-free code. We also show how to reduce the communication effect by optimizing for commonly used sequences of calls to this routine. In our implementation on NVIDIA 8800 GTX, we were able to achieve an 8.3x speedup over an SSE2 optimized version on 2.8 GHz Intel Xeon CPU.  相似文献   

12.
Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.  相似文献   

13.
A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy.  相似文献   

14.
耗散粒子动力学(DPD)模拟是一种重要的研究流体动力学特性的计算模拟方法,基于Intel MIC平台设计实现了面向大规模耗散粒子动力学模拟,充分结合了DPD模拟本身的特性和MIC平台的特征。对DPD模拟中的近邻列表构建和短程作用力关键代码实现了向量化优化,在CPU和MIC协处理器之间采用任务计算负载平衡机制,支持MPI进程内线程数量负载平衡控制。分别在原型程序上和LAMMPS集成中做了性能对比分析,实验结果显示了引入相关优化技术的有效性,为进一步研究面向MIC众核平台的分子动力学相关工作奠定了基础。  相似文献   

15.
The popularity and availability of Internet connection has opened up the opportunity for network-centric collaborative work that was impossible a few years ago. Contending traffic flows in this collaborative scenario share different kinds of resources such as network links, buffers, and router CPU. The goal should hence be overall fairness in the allocation of multiple resources rather than a specific resource. In this paper, firstly, we present a novel QoS-aware resource scheduling algorithm called Weighted Composite Bandwidth and CPU Scheduler (WCBCS), which jointly allocates the fair share of the link bandwidth as well as processing resource to all competing flows. WCBCS also uses a simple and adaptive online prediction scheme for reliably estimating the processing times of the incoming data packets. Secondly, we present some analytical results, extensive NS-2 simulation work, and experimental results from our implementation on Intel IXP2400 network processor. The simulation and implementation results show that our low complexity scheduling algorithm can efficiently maximise the CPU and bandwidth utilisation while maintaining guaranteed Quality of Service (QoS) for each individual flow.  相似文献   

16.
As today’s standard screening methods frequently fail to detect breast cancer before metastases have developed, early diagnosis is still a major challenge. With the promise of high-quality volume images, three-dimensional ultrasound computer tomography is likely to improve this situation, but has high computational needs. In this work, we investigate the acceleration of the ray-based transmission reconstruction by a GPU-based implementation of the iterative numerical optimization algorithm TVAL3. We identified the regular and transposed sparse-matrix–vector multiply as the performance limiting operations. For accelerated reconstruction we propose two different concepts and devise a hybrid scheme as optimal configuration. In addition we investigate multi-GPU scalability and derive the optimal number of devices for our two primary use-cases: a fast preview mode and a high-resolution mode. In order to achieve a fair estimation of the speedup, we compare our implementation to an optimized CPU version of the algorithm. Using our accelerated implementation we reconstructed a preview 3D volume with 24,576 unknowns, a voxel size of (8 mm)3 and approximately 200,000 equations in 0.5 s. A high-resolution volume with 1,572,864 unknowns, a voxel size of (2mm)3 and approximately 1.6 million equations was reconstructed in 23 s. This constitutes an acceleration of over one order of magnitude in comparison to the optimized CPU version.  相似文献   

17.
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod...  相似文献   

18.
GPU上计算流体力学的加速   总被引:1,自引:0,他引:1  
本文将计算流体力学中的可压缩的纳维叶-斯托克斯(Navier-Stokes),不可压缩的Navier-Stokes和欧拉(Euler)方程移植到NVIDIA GPU上.模拟了3个测试例子,2维的黎曼问题,方腔流问题和RAE2822型的机翼绕流.相比于CPU,我们在GPU平台上最高得到了33.2倍的加速比.为了最大程度提...  相似文献   

19.
petaPar 粒子模拟程序面向千万亿次级计算,在统一框架下实现两种广受关注的粒子模拟算法:光滑粒子流体动力学(Smoothed Particle Hydrodynamics,SPH)和物质点法(Material Point Method,MPM)。代码支持多种材料模型、强度模型和失效模型,适合模拟大变形、高应变率和流固耦合问题。支持纯 MPI 和 MPI+X 混合两种并行模型。系统具有可容错性,支持无人值守变进程重启。在Titan 上测试表明,petaPar 可线性扩展到 26 万 CPU 核,SPH 和 MPM 算法并行效率相对 8 192 核分别为 87% 和 90%。  相似文献   

20.
《Parallel Computing》2007,33(4-5):314-327
CPU cycle sharing among distributed heterogeneous computers is the key function in large-scale volunteer computing and desktop grid applications. One important problem in large-scale distributed cycle sharing system is how to account for the amount of computation work performed by a CPU cycle provider, in a uniform and portable fashion across heterogeneous hardware and operating system platforms. Such an accounting mechanism is especially desirable when CPU resources are traded and a lack of uniform workload accounting will hinder the enforcement of market-driven CPU pricing/trading policies in distributed cycle sharing systems. Java Virtual Machine (JVM) has proved to be a good match for distributed cycle sharing because of its abilities to run applications on a wide variety of platforms without modification (portability) and to host untrusted applications (safety). In this paper, we present the design, implementation, and evaluation of an efficient, application-transparent virtual cycle accounting scheme integrated into JVM. Our scheme achieves portable workload accounting across heterogeneous computing platforms by accounting for JVM virtual instructions instead of real processor cycles. Different from the existing JVM CPU accounting mechanisms that involve bytecode rewriting, our scheme is transparent to applications and does not require visible changes to application and library code interfaces which would break applications that use Reflection API. Moreover, our scheme is efficient via the use of processor registers for accounting. Our experimental results demonstrate both high accounting accuracy and low runtime overhead of virtual cycle accounting.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号