共查询到20条相似文献,搜索用时 15 毫秒
1.
We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA® Tesla T20 series GPUs, we find double-precision speedups of up to 30× compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60× faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU. 相似文献
2.
Trung Dac Nguyen Carolyn L. Phillips Joshua A. Anderson Sharon C. Glotzer 《Computer Physics Communications》2011,(11):2307-2313
Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton?s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermal-isobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters. 相似文献
3.
D.C. Rapaport 《Computer Physics Communications》2011,182(4):926-934
Design considerations for molecular dynamics algorithms capable of taking advantage of the computational power of a graphics processing unit (GPU) are described. Accommodating the constraints of scalable streaming-multiprocessor hardware necessitates a reformulation of the underlying algorithm. Performance measurements demonstrate the considerable benefit and cost-effectiveness of such an approach, which produces a factor of 2.5 speed improvement over previous work for the case of the soft-sphere potential. 相似文献
4.
基于CUDA架构在GPU上实现了神经网络前向传播算法,该算法利用神经网络各层内神经元计算的并行性,每层使用一个Kernel函数来并行计算该层神经元的值,每个Kernel函数都根据神经网络的特性和CUDA架构的特点进行优化。实验表明,该算法比普通的CPU上的算法快了约7倍。研究结果对于提高神经网络的运算速度以及CUDA的适用场合都有参考价值。 相似文献
5.
S. Bianchi 《Computer Physics Communications》2010,181(8):1444-1424
Holographic optical tweezers allow the three-dimensional, dynamic, multipoint manipulation of micron sized objects using laser light. Exploiting the massive parallel architecture of modern GPUs we can generate highly optimized holograms at video frame-rate allowing the precise interactive micro-manipulation of complex structures. 相似文献
6.
Mikolaj Szydlarski Pierre Esterie Joel Falcou Laura Grigori Radek Stompor 《Concurrency and Computation》2014,26(3):683-711
Spherical harmonic transforms (SHT) are at the heart of many scientific and practical applications ranging from climate modelling to cosmological observations. In many of these areas, new cutting‐edge science goals have been recently proposed requiring simulations and analyses of experimental or observational data at very high resolutions and of unprecedented volumes. Both these aspects pose formidable challenge for the currently existing implementations of the transforms. This paper describes parallel algorithms for computing SHT with two variants of intra‐node parallelism appropriate for novel supercomputer architectures, multi‐core processors and Graphic Processing Units (GPU). It also discusses their performance, alone and embedded within a top‐level, Message Passing Interface‐based parallelisation layer ported from the S2HAT library, in terms of their accuracy, overall efficiency and scalability. We show that our inverse SHT run on GeForce 400 Series GPUs equipped with latest Compute Unified Device Architecture architecture (Fermi) outperforms the state of the art implementation for a multi‐core processor executed on a current Intel Core i7‐2600K. Furthermore, we show that an Message Passing Interface/Compute Unified Device Architecture version of the inverse transform run on a cluster of 128 Nvidia Tesla S1070 is as much as 3 times faster than the hybrid Message Passing Interface/OpenMP version executed on the same number of quad‐core processors Intel Nehalem for problem sizes motivated by our target applications. Performance of the direct transforms is however found to be at the best comparable in these cases. We discuss in detail the algorithmic solutions devised for the major steps involved in the transforms calculation, emphasising those with a major impact on their overall performance and elucidates the sources of the dichotomy between the direct and the inverse operations.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献
7.
A performance study of general-purpose applications on graphics processors using CUDA 总被引:1,自引:0,他引:1
Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W. Sheaffer Kevin Skadron 《Journal of Parallel and Distributed Computing》2008
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. 相似文献
8.
9.
The Finite-Difference Time-Domain (FDTD) method is commonly used for electromagnetic field simulations. Recently, successful hardware-accelerations using Graphics Processing Unit (GPU) have been reported for the large-scale FDTD simulations. In this paper, we present a performance analysis of the three-dimensional (3D) FDTD on GPU using the roofline model. We find that theoretical predictions on maximum performance agrees well with the experimental results. We also suggest the suitable optimization methods for the best performance of FDTD on GPU. In particular, the optimized 3D FDTD program on GPU (NVIDIA Geforce GTX 480) is shown to be 64 times faster than the naively implemented program on CPU (Intel Core i7 2600). 相似文献
10.
图形处理器在通用计算中的应用 总被引:1,自引:1,他引:0
基于图形处理器(GPU)的计算统一设备体系结构(compute unified device architecture,CUDA)构架,阐述了GPU用于通用计算的原理和方法.在Geforce8800GT下,完成了矩阵乘法运算实验.实验结果表明,随着矩阵阶数的递增,无论是GPU还是CPU处理,速度都在减慢.数据增加100倍后,GPU上的运算时间仅增加了3.95倍,而CPU的运算时间增加了216.66倍. 相似文献
11.
Exact set similarity join is a notoriously expensive operation, for which several solutions have been proposed. Recently, there have been studies that present a comparative analysis using MapReduce or a non-parallel setting. Our contribution is that we complement these works through conducting a thorough evaluation of the state-of-the-art GPU-enabled techniques. These techniques are highly diverse in their key features and our experiments manage to reveal the key strengths of each one. As we explain, in real-life applications there is no dominant solution. Depending on specific dataset and query characteristics, each solution, even not using the GPU at all, has its own sweet spot. All our work is repeatable and extensible. 相似文献
12.
Offline processing of acoustic emission (AE) signal waveforms recorded during a long-term AE monitoring session is a challenging problem in AE testing area. This is due to the fact that today's AE systems can work with up to hundreds of channels and are able to process tens of thousands of AE events per second. The amount of data recorded during the session is very high.This paper proposes a way to accelerate signal processing methods for acoustic emission and to accelerate similarity calculation using the Graphic Processing Unit (GPU). GPU-based accelerators are an affordable High Performance Computing (HPC) solution which can be used in any industrial workstation or laptop. They are therefore suitable for onsite AE monitoring.Our implementation, which is based on Compute Unified Device Architecture (CUDA), proves that GPU is able to achieve 30 times faster processing speed than CPU for AE signal preprocessing. The similarity calculation is accelerated by up to 80 times. These results prove that GPU processing is a powerful and low-cost accelerator for AE signal processing algorithms. 相似文献
13.
We propose a novel technique that has the potential to realize interrogation of surface plasmon resonance (SPR) sensors at very high speed. In contrast to the incoherent light source used in the traditional wavelength interrogation schemes, a broadband coherent laser generating short optical pulses at a high repetition rate is used along with a highly dispersive optical element. The dispersion causes strong broadening of the optical pulses, and the temporal pulse shape could exactly resemble the spectral distribution of the pulses due to the induced linear chirp. Therefore, by measuring the changes in the pulse shapes with a single high-speed photodetector, the spectral response of the SPR sensor can be obtained for each input pulse and the interrogation speed could reach the repetition rate of the pulse train. This could enable SPR measurements at the speed of tens of MHz or higher, which is well beyond that of other current SPR interrogation techniques. We experimentally demonstrate that, by measuring the variations in the pulse shapes of the chirped pulses, sensitive SPR measurements can be made. Implementing this scheme with a femtosecond fiber laser and other fiber optic components also show the potential to realize more compact and integrated SPR systems. 相似文献
14.
在嵌入式计算平台上实现双向约束LK金字塔高精度光流的实时计算,是该算法能否用于自动驾驶等场景的重要影响因素。为了实现该目的,提出了基于网格划分的特征提取方法及新的双向约束方法;然后设计了动态窗口的金字塔模型,解决了光流计算过程中的负载不均衡问题;最后通过降低计算位宽,使得整体性能获得进一步提升。实验结果表明:在Jetson TX2上,针对真实场景所用的720P视频,所提出方法的性能比OpenCV的GPU版本提升了4.1倍,达到30 fps以上;将采用该方法的SLAM系统成功应用于车载场景并在真实环境中测试,使得系统的性能达到了28 fps。新方法有效地提升了位姿和点云的精度,较好地满足了车载场景的实时处理需求。 相似文献
15.
Many engineering and scientific problems need to solve boundary value problems for partial differential equations or systems of them. For most cases, to obtain the solution with desired precision and in acceptable time, the only practical way is to harness the power of parallel processing. In this paper, we present some effective applications of parallel processing based on hybrid CPU/GPU domain decomposition method. Within the family of domain decomposition methods, the so-called optimized Schwarz methods have proven to have good convergence behaviour compared to classical Schwarz methods. The price for this feature is the need to transfer more physical information between subdomain interfaces. For solving large systems of linear algebraic equations resulting from the finite element discretization of the subproblem for each subdomain, Krylov method is often a good choice. Since the overall efficiency of such methods depends on effective calculation of sparse matrix–vector product, approaches that use graphics processing unit (GPU) instead of central processing unit (CPU) for such task look very promising. In this paper, we discuss effective implementation of algebraic operations for iterative Krylov methods on GPU. In order to ensure good performance for the non-overlapping Schwarz method, we propose to use optimized conditions obtained by a stochastic technique based on the covariance matrix adaptation evolution strategy. The performance, robustness, and accuracy of the proposed approach are demonstrated for the solution of the gravitational potential equation for the data acquired from the geological survey of Chicxulub crater. 相似文献
16.
Miguel Lastra José M. Mantas Carlos Ureña Manuel J. Castro José A. García-Rodríguez 《Mathematics and computers in simulation》2009
This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver. 相似文献
17.
M.A. Clark R. Babich K. Barros R.C. Brower C. Rebbi 《Computer Physics Communications》2010,181(9):1517-1878
Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision. 相似文献
18.
We implement, optimize, and validate the linear-scaling Kubo–Greenwood quantum transport simulation on graphics processing units by examining resonant scattering in graphene. We consider two practical representations of the Kubo–Greenwood formula: a Green–Kubo formula based on the velocity auto-correlation and an Einstein formula based on the mean square displacement. The code is fully implemented on graphics processing units with a speedup factor of up to 16 (using double-precision) relative to our CPU implementation. We compare the kernel polynomial method and the Fourier transform method for the approximation of the Dirac delta function and conclude that the former is more efficient. In the ballistic regime, the Einstein formula can produce the correct quantized conductance of one-dimensional graphene nanoribbons except for an overshoot near the band edges. In the diffusive regime, the Green–Kubo and the Einstein formalisms are demonstrated to be equivalent. A comparison of the length-dependence of the conductance in the localization regime obtained by the Einstein formula with that obtained by the non-equilibrium Green’s function method reveals the challenges in defining the length in the Kubo–Greenwood formalism at the strongly localized regime. 相似文献
19.
In this paper we focus on two complementary approaches to significantly decrease pre-training time of a deep belief network (DBN). First, we propose an adaptive step size technique to enhance the convergence of the contrastive divergence (CD) algorithm, thereby reducing the number of epochs to train the restricted Boltzmann machine (RBM) that supports the DBN infrastructure. Second, we present a highly scalable graphics processing unit (GPU) parallel implementation of the CD-k algorithm, which boosts notably the training speed. Additionally, extensive experiments are conducted on the MNIST and the HHreco databases. The results suggest that the maximum useful depth of a DBN is related to the number and quality of the training samples. Moreover, it was found that the lower-level layer plays a fundamental role for building successful DBN models. Furthermore, the results contradict the pre-conceived idea that all the layers should be pre-trained. Finally, it is shown that by incorporating multiple back-propagation (MBP) layers, the DBNs generalization capability is remarkably improved. 相似文献
20.
Control of autonomous systems subject to stochastic uncertainty is a challenging task. In guided airdrop applications, random wind disturbances play a crucial role in determining landing accuracy and terrain avoidance. This paper describes a stochastic parafoil guidance system which couples uncertainty propagation with optimal control to protect against wind and parameter uncertainty in the presence of impact area obstacles. The algorithm uses real-time Monte Carlo simulation performed on a graphics processing unit (GPU) to evaluate robustness of candidate trajectories in terms of delivery accuracy, obstacle avoidance, and other considerations. Building upon prior theoretical developments, this paper explores performance of the stochastic guidance law compared to standard deterministic guidance schemes, particularly with respect to obstacle avoidance. Flight test results are presented comparing the proposed stochastic guidance algorithm with a standard deterministic one. Through a comprehensive set of simulation results, key implementation aspects of the stochastic algorithm are explored including tradeoffs between the number of candidate trajectories considered, algorithm runtime, and overall guidance performance. Overall, simulation and flight test results demonstrate that the stochastic guidance scheme provides a more robust approach to obstacle avoidance while largely maintaining delivery accuracy. 相似文献