期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Polymer field-theory simulations on graphics processing units

Kris T. Delaney Glenn H. Fredrickson 《Computer Physics Communications》2013

We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA^® Tesla T20 series GPUs, we find double-precision speedups of up to 30×

30 \times

compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60×

60 \times

faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU. 相似文献

2.

Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units

Trung Dac Nguyen Carolyn L. Phillips Joshua A. Anderson Sharon C. Glotzer 《Computer Physics Communications》2011,(11):2307-2313

Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton?s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermal-isobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters. 相似文献

3.

Enhanced molecular dynamics performance with a programmable graphics processor 总被引：1，自引：0，他引：1

D.C. Rapaport 《Computer Physics Communications》2011,182(4):926-934

Design considerations for molecular dynamics algorithms capable of taking advantage of the computational power of a graphics processing unit (GPU) are described. Accommodating the constraints of scalable streaming-multiprocessor hardware necessitates a reformulation of the underlying algorithm. Performance measurements demonstrate the considerable benefit and cost-effectiveness of such an approach, which produces a factor of 2.5 speed improvement over previous work for the case of the soft-sphere potential. 相似文献

4.

神经网络前向传播在GPU上的实现

刘进锋郭雷《微型机与应用》2011,30(18):69-71,75

基于CUDA架构在GPU上实现了神经网络前向传播算法,该算法利用神经网络各层内神经元计算的并行性,每层使用一个Kernel函数来并行计算该层神经元的值,每个Kernel函数都根据神经网络的特性和CUDA架构的特点进行优化。实验表明,该算法比普通的CPU上的算法快了约7倍。研究结果对于提高神经网络的运算速度以及CUDA的适用场合都有参考价值。相似文献

5.

Real-time optical micro-manipulation using optimized holograms generated on the GPU

S. Bianchi 《Computer Physics Communications》2010,181(8):1444-1424

Holographic optical tweezers allow the three-dimensional, dynamic, multipoint manipulation of micron sized objects using laser light. Exploiting the massive parallel architecture of modern GPUs we can generate highly optimized holograms at video frame-rate allowing the precise interactive micro-manipulation of complex structures. 相似文献

6.

Parallel spherical harmonic transforms on heterogeneous architectures (graphics processing units/multi‐core CPUs)

Mikolaj Szydlarski Pierre Esterie Joel Falcou Laura Grigori Radek Stompor 《Concurrency and Computation》2014,26(3):683-711

Spherical harmonic transforms (SHT) are at the heart of many scientific and practical applications ranging from climate modelling to cosmological observations. In many of these areas, new cutting‐edge science goals have been recently proposed requiring simulations and analyses of experimental or observational data at very high resolutions and of unprecedented volumes. Both these aspects pose formidable challenge for the currently existing implementations of the transforms. This paper describes parallel algorithms for computing SHT with two variants of intra‐node parallelism appropriate for novel supercomputer architectures, multi‐core processors and Graphic Processing Units (GPU). It also discusses their performance, alone and embedded within a top‐level, Message Passing Interface‐based parallelisation layer ported from the S²HAT library, in terms of their accuracy, overall efficiency and scalability. We show that our inverse SHT run on GeForce 400 Series GPUs equipped with latest Compute Unified Device Architecture architecture (Fermi) outperforms the state of the art implementation for a multi‐core processor executed on a current Intel Core i7‐2600K. Furthermore, we show that an Message Passing Interface/Compute Unified Device Architecture version of the inverse transform run on a cluster of 128 Nvidia Tesla S1070 is as much as 3 times faster than the hybrid Message Passing Interface/OpenMP version executed on the same number of quad‐core processors Intel Nehalem for problem sizes motivated by our target applications. Performance of the direct transforms is however found to be at the best comparable in these cases. We discuss in detail the algorithmic solutions devised for the major steps involved in the transforms calculation, emphasising those with a major impact on their overall performance and elucidates the sources of the dichotomy between the direct and the inverse operations.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

7.

A performance study of general-purpose applications on graphics processors using CUDA 总被引：1，自引：0，他引：1

Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W. Sheaffer Kevin Skadron 《Journal of Parallel and Distributed Computing》2008

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. 相似文献

8.

基于宇宙计算的图形处理器算法实现

郭祖华贾积身马世霞《计算机应用研究》2014,(11)

下一代观测望远镜将会产生数以亿计的星系测量数据值,这将导致使用中央处理器处理数据时效率低下、成本较高。为了解决这一问题,提出了基于宇宙计算的图形处理器算法。研究了两点式角相关函数以及孔径质量统计这两种宇宙学的计算方法,构建算法代码,并使用统一计算设备架构在图形处理器上实现了这两种算法;比较了算法在中央处理器和图形处理器上使用的运行速度。实验结果表明,与中央处理器相比,使用图形处理器的计算速度得到了显著提高。相似文献

9.

Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model

Ki-Hwan Kim KyoungHo Kim Q-Han Park 《Computer Physics Communications》2011,182(6):1201-1207

The Finite-Difference Time-Domain (FDTD) method is commonly used for electromagnetic field simulations. Recently, successful hardware-accelerations using Graphics Processing Unit (GPU) have been reported for the large-scale FDTD simulations. In this paper, we present a performance analysis of the three-dimensional (3D) FDTD on GPU using the roofline model. We find that theoretical predictions on maximum performance agrees well with the experimental results. We also suggest the suitable optimization methods for the best performance of FDTD on GPU. In particular, the optimized 3D FDTD program on GPU (NVIDIA Geforce GTX 480) is shown to be 64 times faster than the naively implemented program on CPU (Intel Core i7 2600). 相似文献

10.

图形处理器在通用计算中的应用 总被引：1，自引：1，他引：0

张健陈瑞《计算机工程与设计》2009,30(14)

基于图形处理器(GPU)的计算统一设备体系结构(compute unified device architecture,CUDA)构架,阐述了GPU用于通用计算的原理和方法.在Geforce8800GT下,完成了矩阵乘法运算实验.实验结果表明,随着矩阵阶数的递增,无论是GPU还是CPU处理,速度都在减慢.数据增加100倍后,GPU上的运算时间仅增加了3.95倍,而CPU的运算时间增加了216.66倍. 相似文献

11.

An empirical evaluation of exact set similarity join techniques using GPUs

《Information Systems》2020

Exact set similarity join is a notoriously expensive operation, for which several solutions have been proposed. Recently, there have been studies that present a comparative analysis using MapReduce or a non-parallel setting. Our contribution is that we complement these works through conducting a thorough evaluation of the state-of-the-art GPU-enabled techniques. These techniques are highly diverse in their key features and our experiments manage to reveal the key strengths of each one. As we explain, in real-life applications there is no dominant solution. Depending on specific dataset and query characteristics, each solution, even not using the GPU at all, has its own sweet spot. All our work is repeatable and extensible. 相似文献

12.

Acceleration of acoustic emission signal processing algorithms using CUDA standard

Lubomir Riha Radislav Smid 《Computer Standards & Interfaces》2011,33(4):389-400

Offline processing of acoustic emission (AE) signal waveforms recorded during a long-term AE monitoring session is a challenging problem in AE testing area. This is due to the fact that today's AE systems can work with up to hundreds of channels and are able to process tens of thousands of AE events per second. The amount of data recorded during the session is very high.This paper proposes a way to accelerate signal processing methods for acoustic emission and to accelerate similarity calculation using the Graphic Processing Unit (GPU). GPU-based accelerators are an affordable High Performance Computing (HPC) solution which can be used in any industrial workstation or laptop. They are therefore suitable for onsite AE monitoring.Our implementation, which is based on Compute Unified Device Architecture (CUDA), proves that GPU is able to achieve 30 times faster processing speed than CPU for AE signal preprocessing. The similarity calculation is accelerated by up to 80 times. These results prove that GPU processing is a powerful and low-cost accelerator for AE signal processing algorithms. 相似文献

13.

Interrogation of surface plasmon resonance sensors using temporally stretched ultrashort optical pulses

Zheng Yuhang Xin Jinsong 《Sensors and actuators. B, Chemical》2008,133(2):671-676

We propose a novel technique that has the potential to realize interrogation of surface plasmon resonance (SPR) sensors at very high speed. In contrast to the incoherent light source used in the traditional wavelength interrogation schemes, a broadband coherent laser generating short optical pulses at a high repetition rate is used along with a highly dispersive optical element. The dispersion causes strong broadening of the optical pulses, and the temporal pulse shape could exactly resemble the spectral distribution of the pulses due to the induced linear chirp. Therefore, by measuring the changes in the pulse shapes with a single high-speed photodetector, the spectral response of the SPR sensor can be obtained for each input pulse and the interrogation speed could reach the repetition rate of the pulse train. This could enable SPR measurements at the speed of tens of MHz or higher, which is well beyond that of other current SPR interrogation techniques. We experimentally demonstrate that, by measuring the variations in the pulse shapes of the chirped pulses, sensitive SPR measurements can be made. Implementing this scheme with a femtosecond fiber laser and other fiber optic components also show the potential to realize more compact and integrated SPR systems. 相似文献

14.

基于嵌入式GPU的pyramid LK光流法高速计算方法研究

孙瑞鑫朱国梁谢双镱郭雪亮柴志雷《计算机应用研究》2022,39(7)

在嵌入式计算平台上实现双向约束LK金字塔高精度光流的实时计算,是该算法能否用于自动驾驶等场景的重要影响因素。为了实现该目的,提出了基于网格划分的特征提取方法及新的双向约束方法;然后设计了动态窗口的金字塔模型,解决了光流计算过程中的负载不均衡问题;最后通过降低计算位宽,使得整体性能获得进一步提升。实验结果表明：在Jetson TX2上,针对真实场景所用的720P视频,所提出方法的性能比OpenCV的GPU版本提升了4.1倍,达到30 fps以上;将采用该方法的SLAM系统成功应用于车载场景并在真实环境中测试,使得系统的性能达到了28 fps。新方法有效地提升了位姿和点云的精度,较好地满足了车载场景的实时处理需求。相似文献

15.

Optimized Schwarz method without overlap for the gravitational potential equation on cluster of graphics processing unit

Frédéric Magoulès Abal-Kassim Cheik Ahamed Roman Putanowicz 《国际计算机数学杂志》2016,93(6):955-980

Many engineering and scientific problems need to solve boundary value problems for partial differential equations or systems of them. For most cases, to obtain the solution with desired precision and in acceptable time, the only practical way is to harness the power of parallel processing. In this paper, we present some effective applications of parallel processing based on hybrid CPU/GPU domain decomposition method. Within the family of domain decomposition methods, the so-called optimized Schwarz methods have proven to have good convergence behaviour compared to classical Schwarz methods. The price for this feature is the need to transfer more physical information between subdomain interfaces. For solving large systems of linear algebraic equations resulting from the finite element discretization of the subproblem for each subdomain, Krylov method is often a good choice. Since the overall efficiency of such methods depends on effective calculation of sparse matrix–vector product, approaches that use graphics processing unit (GPU) instead of central processing unit (CPU) for such task look very promising. In this paper, we discuss effective implementation of algebraic operations for iterative Krylov methods on GPU. In order to ensure good performance for the non-overlapping Schwarz method, we propose to use optimized conditions obtained by a stochastic technique based on the covariance matrix adaptation evolution strategy. The performance, robustness, and accuracy of the proposed approach are demonstrated for the solution of the gravitational potential equation for the data acquired from the geological survey of Chicxulub crater. 相似文献

16.

Simulation of shallow-water systems using graphics processing units

Miguel Lastra José M. Mantas Carlos Ureña Manuel J. Castro José A. García-Rodríguez 《Mathematics and computers in simulation》2009

This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver. 相似文献

17.

Solving lattice QCD systems of equations using mixed precision solvers on GPUs

M.A. Clark R. Babich K. Barros R.C. Brower C. Rebbi 《Computer Physics Communications》2010,181(9):1517-1878

Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision. 相似文献

18.

Efficient linear-scaling quantum transport calculations on graphics processing units and applications on electron transport in graphene

Zheyong Fan Andreas UppstuTopi Siro Ari Harju 《Computer Physics Communications》2014

We implement, optimize, and validate the linear-scaling Kubo–Greenwood quantum transport simulation on graphics processing units by examining resonant scattering in graphene. We consider two practical representations of the Kubo–Greenwood formula: a Green–Kubo formula based on the velocity auto-correlation and an Einstein formula based on the mean square displacement. The code is fully implemented on graphics processing units with a speedup factor of up to 16 (using double-precision) relative to our CPU implementation. We compare the kernel polynomial method and the Fourier transform method for the approximation of the Dirac delta function and conclude that the former is more efficient. In the ballistic regime, the Einstein formula can produce the correct quantized conductance of one-dimensional graphene nanoribbons except for an overshoot near the band edges. In the diffusive regime, the Green–Kubo and the Einstein formalisms are demonstrated to be equivalent. A comparison of the length-dependence of the conductance in the localization regime obtained by the Einstein formula with that obtained by the non-equilibrium Green’s function method reveals the challenges in defining the length in the Kubo–Greenwood formalism at the strongly localized regime. 相似文献

19.

Towards adaptive learning with improved convergence of deep belief networks on graphics processing units

Noel Lopes Bernardete Ribeiro 《Pattern recognition》2014

In this paper we focus on two complementary approaches to significantly decrease pre-training time of a deep belief network (DBN). First, we propose an adaptive step size technique to enhance the convergence of the contrastive divergence (CD) algorithm, thereby reducing the number of epochs to train the restricted Boltzmann machine (RBM) that supports the DBN infrastructure. Second, we present a highly scalable graphics processing unit (GPU) parallel implementation of the CD-k algorithm, which boosts notably the training speed. Additionally, extensive experiments are conducted on the MNIST and the HHreco databases. The results suggest that the maximum useful depth of a DBN is related to the number and quality of the training samples. Moreover, it was found that the lower-level layer plays a fundamental role for building successful DBN models. Furthermore, the results contradict the pre-conceived idea that all the layers should be pre-trained. Finally, it is shown that by incorporating multiple back-propagation (MBP) layers, the DBNs generalization capability is remarkably improved. 相似文献

20.

Experimental investigation of stochastic parafoil guidance using a graphics processing unit

《Control Engineering Practice》2015

Control of autonomous systems subject to stochastic uncertainty is a challenging task. In guided airdrop applications, random wind disturbances play a crucial role in determining landing accuracy and terrain avoidance. This paper describes a stochastic parafoil guidance system which couples uncertainty propagation with optimal control to protect against wind and parameter uncertainty in the presence of impact area obstacles. The algorithm uses real-time Monte Carlo simulation performed on a graphics processing unit (GPU) to evaluate robustness of candidate trajectories in terms of delivery accuracy, obstacle avoidance, and other considerations. Building upon prior theoretical developments, this paper explores performance of the stochastic guidance law compared to standard deterministic guidance schemes, particularly with respect to obstacle avoidance. Flight test results are presented comparing the proposed stochastic guidance algorithm with a standard deterministic one. Through a comprehensive set of simulation results, key implementation aspects of the stochastic algorithm are explored including tradeoffs between the number of candidate trajectories considered, algorithm runtime, and overall guidance performance. Overall, simulation and flight test results demonstrate that the stochastic guidance scheme provides a more robust approach to obstacle avoidance while largely maintaining delivery accuracy. 相似文献