首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we introduce a fast and consistent smoothed particle hydrodynamics (SPH) technique which is suitable for convection–diffusion simulations of incompressible fluids. We apply our temporal blending technique to reduce the number of particles in the simulation while smoothly changing quantity fields. Our approach greatly reduces the error introduced in the pressure term when changing particle configurations. Compared to other methods, this enables larger integration time‐steps in the transition phase. Our implementation is fully GPU‐based to take advantage of the parallel nature of particle simulations.  相似文献   

2.
We present a scalable dissipative particle dynamics simulation code, fully implemented on the Graphics Processing Units (GPUs) using a hybrid CUDA/MPI programming model, which achieves 10–30 times speedup on a single GPU over 16 CPU cores and almost linear weak scaling across a thousand nodes. A unified framework is developed within which the efficient generation of the neighbor list and maintaining particle data locality are addressed. Our algorithm generates strictly ordered neighbor lists in parallel, while the construction is deterministic and makes no use of atomic operations or sorting. Such neighbor list leads to optimal data loading efficiency when combined with a two-level particle reordering scheme. A faster in situ generation scheme for Gaussian random numbers is proposed using precomputed binary signatures. We designed custom transcendental functions that are fast and accurate for evaluating the pairwise interaction. The correctness and accuracy of the code is verified through a set of test cases simulating Poiseuille flow and spontaneous vesicle formation. Computer benchmarks demonstrate the speedup of our implementation over the CPU implementation as well as strong and weak scalability. A large-scale simulation of spontaneous vesicle formation consisting of 128 million particles was conducted to further illustrate the practicality of our code in real-world applications.  相似文献   

3.
This paper discusses a parallel collision detection algorithm. Implemented using software executed on ubiquitous Graphics Processing Unit (GPU) cards, the algorithm demonstrates two orders of magnitude speedup over a state-of-the art sequential implementation when handling multimillion object collision detection tasks. GPUs are composed of many (on the order of hundreds) scalar processors that can simultaneously execute an operation; this strength is leveraged in the proposed algorithm, which combines the use of multiple CPU cores with multiple GPUs. The software implementation of the algorithm can be used to detect collisions between five million objects in less than two seconds and was used to detect 1.4 billion contact events in less than 40 seconds. A spherical padding approach is used to represent surface geometries as large collections of spheres when dealing with collision detection between bodies with complex geometries. The proposed methodology is expected to be relevant in computational mechanics with applications in granular flow dynamics and smoothed particle hydrodynamics (SPH), where the number of contact events ranges from millions to billions.  相似文献   

4.
The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

5.
为了实现小尺度范围流体场景的实时、真实感模拟,采用弱可压SPH方法对水体进行建模,提出了流体计算的CPU GPU混合架构计算方法。针对邻域粒子查找算法影响流体计算效率的问题,采用三维空间网格对整个模拟区域进行均匀网格划分,利用并行前缀求和和并行计数排序实现邻域粒子的查找。最后,采用基于CUDA并行加速的Marching Cubes算法实现流体表面提取,利用环境贴图表现流体的反射和折射效果,实现流体表面着色。实验结果表明,所提出的流体建模和模拟算法能实现小尺度范围流体的实时计算和渲染,绘制出水的波动、翻卷和木块在水中晃动的动态效果,当粒子数达到1 048 576个时,GPU并行计算方法相较CPU方法的加速比为60.7。  相似文献   

6.
A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy.  相似文献   

7.
在千万亿次计算能力的驱动下,数值软件的发展进入了一个以海量并行为基本特征的历史转折期,可扩展和可容错成为大规模数值模拟的两大关键技术.petaPar模拟程序是以对传统数值技术形成优势互补的无网格类方法为切入点,面向千万亿次级计算而开发的下一代新兴通用数值模拟程序.petaPar在统一架构下实现了光滑粒子动力学(smoothed particle hydrodynamics, SPH)和物质点法(material point method, MPM)两种最为成熟和有效的无网格/粒子算法,支持多种强度、失效模型和状态方程;其中MPM支持改进的接触算法,可以处理上百万离散物体的非连续变形和相互作用计算.系统具有以下特点:1)高可扩展.实现单核单Patch极端情形下计算和通信的完全重叠,支持动态负载均衡;2)可容错.支持无人值守变进程重启动,在系统硬件出现局部热故障时可以不中止计算;3)适应硬件体系结构异构架构的变化趋势,同时支持flat MPI和MPI+Pthreads并行模型.程序在Titan千万亿次超级计算机上进行了全系统规模的可扩展性测试,结果表明该代码可线性扩展到26万个CPU核,SPH和MPM的并行效率分别为100%和96%.  相似文献   

8.
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.  相似文献   

9.
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future.  相似文献   

10.
We introduce efficient, large scale fluid simulation on GPU hardware using the fluid‐implicit particle (FLIP) method over a sparse hierarchy of grids represented in NVIDIA® GVDB Voxels. Our approach handles tens of millions of particles within a virtually unbounded simulation domain. We describe novel techniques for parallel sparse grid hierarchy construction and fast incremental updates on the GPU for moving particles. In addition, our FLIP technique introduces sparse, work efficient parallel data gathering from particle to voxel, and a matrix‐free GPU‐based conjugate gradient solver optimized for sparse grids. Our results show that our method can achieve up to an order of magnitude faster simulations on the GPU as compared to FLIP simulations running on the CPU.  相似文献   

11.
基于平面波的第一原理计算方法是目前材料科学中最常用的方法,但传统的CPU并行计算遇到可扩展性瓶颈,无法改善其求解的绝对速度。系统地介绍了利用图形处理器(graphic processing unit,GPU)加速技术开发的大规模第一原理材料计算软件:Ultra-Mat。该软件对第一原理平面波算法进行了系统的算法设计和软件实现:(1)通过采用并行方案,实现了快速傅里叶变换(fast Fourier transform,FFT)的GPU局部操作;(2)设计了基于数据压缩的混合精度算法,显著减少了电子结构计算部分的MPI(message passing interface)通信;(3)完成了逾90%代码的GPU实现,目的是最大限度地减少中间流程,以避免CPU-GPU切换引发的数据传输,这是GPU应用中公认的性能瓶颈。测试结果显示Ultra-Mat具有很好的计算性能,对于512原子的GaAs系统,在电子结构计算部分,使用256 GPU卡相比4096 CPU核心有18倍的加速。  相似文献   

12.
In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

13.
The simulation of electromagnetic (EM) waves propagation in the dielectric media is presented using Compute Unified Device Architecture (CUDA) implementation of finite‐difference time‐domain (FDTD) method on graphic processing unit (GPU). The FDTD formulation in the dielectric media is derived in detail, and GPU‐accelerated FDTD method based on CUDA programming model is described in the flowchart. The accuracy and speedup of the presented CUDA‐implemented FDTD method are validated by the numerical simulation of the EM waves propagating into the lossless and lossy dielectric media from the free space on GPU, by comparison with the original FDTD method on CPU. The comparison of the numerical results of CUDA‐implemented FDTD method on GPU and original FDTD method on CPU demonstrates that the CUDA‐implemented FDTD method on GPU can obtain better application speedup performance with reasonable accuracy. © 2016 Wiley Periodicals, Inc. Int J RF and Microwave CAE 26:512–518, 2016.  相似文献   

14.
《Parallel Computing》2014,40(8):425-447
EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:
  • •method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
  • •method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
  • •method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
  • •approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.
Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively.  相似文献   

15.
We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi‐core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports self‐collision detection. HPCCD takes advantage of hybrid multi‐core architectures – using the general‐purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock‐free parallel algorithm in the main loop of our BVH‐based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi‐core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.  相似文献   

16.
Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-based clusters, multicore CPUs, or small clusters of CPUs and GPUs. To the best of our knowledge, however, none of the existing methods is able to simultaneously and fully utilize both CPUs and GPUs for the computations, leaving idle either the CPU part or the GPU part of modern heterogeneous supercomputers. Aiming at an optimized utilization of heterogeneous computing resources, which is a promising hardware architecture for future bioinformatics applications, we present a new hybrid parallel algorithm and implementation of Bayesian phylogenetic inference, which combines MPI, OpenMP, and CUDA programming. The novelty of our algorithm, denoted as oMC3, is its ability of using CPU cores simultaneously with GPUs for the computations, while ensuring a fair work division between the two types of hardware components. We have implemented oMC3 based on MrBayes, which is one of the most popular software packages for Bayesian phylogenetic inference. Numerical experiments show that oMC3 obtains 2.5× speedup over nMC3, which is a cutting-edge GPU implementation of MrBayes, on a single server consisting of two GPUs and sixteen CPU cores. Moreover, oMC3 scales nicely when 128 GPUs and 1536 CPU cores are in use.  相似文献   

17.
In biological research, alignment of protein sequences by computer is often needed to find similarities between them. Although results can be computed in a reasonable time for alignment of two sequences, it is still very central processing unit (CPU) time-consuming when solving massive sequences alignment problems such as protein database search. In this paper, an optimized protein database search method is presented and tested with Swiss-Prot database on graphic processing unit (GPU) devices, and further, the power of CPU multi-threaded computing is also involved to realize a GPU-based heterogeneous parallelism. In our proposed method, a hybrid alignment approach is implemented by combining Smith–Waterman local alignment algorithm with Needleman–Wunsch global alignment algorithm, and parallel database search is realized with compute unified device architecture (CUDA) parallel computing framework. In the experiment, the algorithm is tested on a lower-end and a higher-end personal computers equipped with GeForce GTX 750 Ti and GeForce GTX 1070 graphics cards, respectively. The results show that the parallel method proposed in this paper can achieve a speedup up to 138.86 times over the serial counterpart, improving efficiency and convenience of protein database search significantly.  相似文献   

18.
王栋栋  庄雷 《计算机应用》2009,29(6):1702-1710
采用基于粒子插值的SPH方法对火焰流体进行模拟,用GPU加速粒子状态地计算,同时用CPU并行地计算粒子邻接关系并控制粒子产生速率。在SPH模型中,较为高效地加入了漩涡场的计算,增加了粒子运动的细节。在粒子渲染过程中,采用了色度场、有向点扩散和颜色锐化技术,由离散的粒子空间分布得到了较为理想的连续火焰图像。由于该方法属于流体模拟的拉格朗日法,所以火焰具有物理真实性,又由于采用GPU为主CPU为辅的计算架构,使得模拟达到了实时。  相似文献   

19.
wpa/wpa2-psk高速暴力破解器的设计和实现   总被引:1,自引:0,他引:1       下载免费PDF全文
针对基于单核CPU的wpa/wpa2-psk暴力破解器破解速度慢的缺点,提出一种分布式多核CPU加GPU的高速暴力破解器.采用分布式技术将密钥列表合理地分配到各台机器上,在单机上利用多核CPU和GPU形成多个计算核心并行破解,利用GPU计算密集型并行任务强大的计算能力提高破解速度.实验结果证明,该暴力破解器的破解速度相...  相似文献   

20.
在实际工程应用中,使用传统的CPU串行计算来开展燃烧数值模拟往往难以满足对模拟速度的要求。利用GPU比CPU更强的计算能力,通过在交错网格上将燃烧物理方程离散化,使用预处理稳定双共轭梯度法(PBiCGSTAB)求解离散化方程,并且探索面向GPU编程的矩阵向量乘并行算法和逆矩阵向量乘并行算法,从而给出一种在GPU上数值求解层流扩散燃烧的可行方法。实验结果表明,GPU并行程序获得了相对串行CPU程序约10倍以上的加速效果,且计算结果与实际情况相符,因而所提方法是可行且高效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号