期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Global magnetohydrodynamic simulations on multiple GPUs

Un-Hong Wong Hon-Cheng Wong Yonghui Ma 《Computer Physics Communications》2014

Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind–magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision. 相似文献

2.

Polymer field-theory simulations on graphics processing units

Kris T. Delaney Glenn H. Fredrickson 《Computer Physics Communications》2013

We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA^® Tesla T20 series GPUs, we find double-precision speedups of up to 30×

30 \times

compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60×

60 \times

faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU. 相似文献

3.

Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct–MPI hybrid approach

Un-Hong Wong Takayuki Aoki Hon-Cheng Wong 《Computer Physics Communications》2014

Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 1200³ grid points using 216 GPUs. 相似文献

4.

A performance study of general-purpose applications on graphics processors using CUDA 总被引：1，自引：0，他引：1

Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W. Sheaffer Kevin Skadron 《Journal of Parallel and Distributed Computing》2008

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. 相似文献

5.

Parallelizing the Cellular Potts Model on graphics processing units

José Juan Tapia Roshan M. D'Souza 《Computer Physics Communications》2011,182(4):857-865

The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >¹⁰⁶ cells with lattice sizes of up to 256³ on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications. 相似文献

6.

基于CUDA的Prewitt算子并行实现

曾胜田刘羽马梦琦《微计算机应用》2011,32(11):71-75

Prewitt算子是数字图像分割中最常用的边缘检测算法,由于计算量大,传统的基于CPU的串行算法耗时较长.为了提高算法的计算效率,本文把Prewitt算子在CUDA架构下并行实现,并通过对不同分辨率图像的处理实验,与串行算法的处理时间进行比对,列出加速比.实验结果表明并行算法的加速效果显著,对提高图像处理系统的运行效率具有实际意义. 相似文献

7.

基于CUDA的SVM算法并行化研究

张巍张功萱王永利张永平朱昭萌《计算机科学》2013,40(4):69-72

SVM算法在统计分类以及回归分析中得到了广泛的应用。而随着物联网的迅速发展,SVM算法在各种应用中往往需要解决大量数据的快速处理问题。在SVM算法并行化研究中,首先对SVM算法进行分析研究,提出了基于CUDA的SVM算法并行化方案;其次,进一步研究海量数据的处理,提出海量数据处理的并行化方案;最后,通过实验分析对比了并行化算法的性能。相似文献

8.

A block-asynchronous relaxation method for graphics processing units

Hartwig Anzt Stanimire Tomov Jack Dongarra Vincent Heuveline 《Journal of Parallel and Distributed Computing》2013

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the “subdomain” handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing. 相似文献

9.

Line-by-line spectroscopic simulations on graphics processing units

Sylvain Collange David Defour 《Computer Physics Communications》2008,178(2):135-143

We report here on software that performs line-by-line spectroscopic simulations on gases. Elaborate models (such as narrow band and correlated-K) are accurate and efficient for bands where various components are not simultaneously and significantly active. Line-by-line is probably the most accurate model in the infrared for blends of gases that contain high proportions of H₂O and CO₂ as this was the case for our prototype simulation. Our implementation on graphics processing units sustains a speedup close to 330 on computation-intensive tasks and 12 on memory intensive tasks compared to implementations on one core of high-end processors. This speedup is due to data parallelism, efficient memory access for specific patterns and some dedicated hardware operators only available in graphics processing units. It is obtained leaving most of processor resources available and it would scale linearly with the number of graphics processing units in parallel machines. Line-by-line simulation coupled with simulation of fluid dynamics was long believed to be economically intractable but our work shows that it could be done with some affordable additional resources compared to what is necessary to perform simulations on fluid dynamics alone.

Program summary

Program title: GPU4RECatalogue identifier: ADZY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADZY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 62 776No. of bytes in distributed program, including test data, etc.: 1 513 247Distribution format: tar.gzProgramming language: C++Computer: x86 PCOperating system: Linux, Microsoft Windows. Compilation requires either gcc/g++ under Linux or Visual C++ 2003/2005 and Cygwin under Windows. It has been tested using gcc 4.1.2 under Ubuntu Linux 7.04 and using Visual C++ 2005 with Cygwin 1.5.24 under Windows XP.RAM: 1 gigabyteClassification: 21.2External routines: OpenGL (http://www.opengl.org)Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases.Solution method: Line-by-line Monte-Carlo ray-tracing.Unusual features: Parallel computations are moved to the GPU.Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required.Running time: A few minutes. 相似文献

10.

Air pollution modelling using a Graphics Processing Unit with CUDA

F. Molnár Jr. R. Mészáros 《Computer Physics Communications》2010,181(1):105-85

The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future. 相似文献

11.

基于CUDA平台的规则LDPC码的译码实现研究

俞永盛陆佩忠《计算机应用与软件》2010,27(4):230-232,266

低密度奇偶校验(LDPC)码性能优越,允许全并行高速译码,已经在个人数字设备、移动无线通信等领域显示出了很大的应用价值,极可能取代Turbo码成为第四代移动通信的首选编码方案。NVIDIA公司的CUDA是一种新的用于GPU通用计算的软硬件架构。基于CUDA平台程序员可以写出C风格的代码来启动大量的GPU线程并行工作。基于CUDA平台提出和研究一种AWGN信道下的规则LDPC码的译码实现方案。仿真实验对LDPC码译码的CPU实现和CUDA实现的性能作了详细比较。研究表明CUDA能够带来明显的性能提升。相似文献

12.

Accelerating FCM neural network classifier using graphics processing units with CUDA

Lin Wang Bo Yang Yuehui Chen Zhenxiang Chen Hongwei Sun 《Applied Intelligence》2014,40(1):143-153

With the advancement in experimental devices and approaches, scientific data can be collected more easily. Some of them are huge in size. The floating centroids method (FCM) has been proven to be a high performance neural network classifier. However, the FCM is difficult to learn from a large data set, which restricts its practical application. In this study, a parallel floating centroids method (PFCM) is proposed to speed up the FCM based on the Compute Unified Device Architecture, especially for a large data set. This method performs all stages as a batch in one block. Blocks and threads are responsible for evaluating classifiers and performing subtasks, respectively. Experimental results indicate that the speed and accuracy are improved by employing this novel approach. 相似文献

13.

Parallel unmixing of remotely sensed hyperspectral images on commodity graphics processing units

Sergio Snchez Abel Paz Gabriel Martín Antonio Plaza 《Concurrency and Computation》2011,23(13):1538-1557

Hyperspectral imaging instruments are capable of collecting hundreds of images, corresponding to different wavelength channels, for the same area on the surface of the Earth. One of the main problems in the analysis of hyperspectral data cubes is the presence of mixed pixels, which arise when the spatial resolution of the sensor is not enough to separate spectrally distinct materials. Hyperspectral unmixing is one of the most popular techniques to analyze hyperspectral data. It comprises two stages: (i) automatic identification of pure spectral signatures (endmembers) and (ii) estimation of the fractional abundance of each endmember in each pixel. The spectral unmixing process is quite expensive in computational terms, mainly due to the extremely high dimensionality of hyperspectral data cubes. Although this process maps nicely to high performance systems such as clusters of computers, these systems are generally expensive and difficult to adapt to real‐time data processing requirements introduced by several applications, such as wildland fire tracking, biological threat detection, monitoring of oil spills, and other types of chemical contamination. In this paper, we develop an implementation of the full hyperspectral unmixing chain on commodity graphics processing units (GPUs). The proposed methodology has been implemented, using the CUDA (compute device unified architecture), and tested on three different GPU architectures: NVidia Tesla C1060, NVidia GeForce GTX 275, and NVidia GeForce 9800 GX2, achieving near real‐time unmixing performance in some configurations tested when analyzing two different hyperspectral images, collected over the World Trade Center complex in New York City and the Cuprite mining district in Nevada. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

14.

基于CUDA的并行K-近邻连接算法实现

潘茜张育平陈海燕《计算机科学》2016,43(10):190-192, 219

针对大规模空间数据的K-近邻连接查询问题,设计了一种CUDA编程模型下K-近邻连接算法的并行优化方法。将K-近邻连接算法的并行过程分两个阶段:1)对参与查询的数据集P和Q分别建立R-Tree索引;2)基于R-Tree索引进行KNNJ查询。首先根据结点所在位置划分最小外包框,在CUDA下基于递归网格排序算法创建R-Tree索引。然后在CUDA下基于R-Tree索引进行KNNJ查询,其中涉及并行求距离和并行距离排序两个阶段:求距离阶段利用每一个线程计算任意两点之间的距离,点与点之间距离的求取无依赖并行;排序阶段将快速排序基于CUDA以实现并行化。实验结果表明,随着样本量的不断增大,基于R-Tree索引的并行K-近邻连接算法的优势更加明显,具有高效性和可扩展性。相似文献

15.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

16.

Exploiting graphical processing units for data‐parallel scientific applications

A. Leist D. P. Playne K. A. Hawick 《Concurrency and Computation》2009,21(18):2400-2437

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

17.

Efficient linear-scaling quantum transport calculations on graphics processing units and applications on electron transport in graphene

Zheyong Fan Andreas UppstuTopi Siro Ari Harju 《Computer Physics Communications》2014

We implement, optimize, and validate the linear-scaling Kubo–Greenwood quantum transport simulation on graphics processing units by examining resonant scattering in graphene. We consider two practical representations of the Kubo–Greenwood formula: a Green–Kubo formula based on the velocity auto-correlation and an Einstein formula based on the mean square displacement. The code is fully implemented on graphics processing units with a speedup factor of up to 16 (using double-precision) relative to our CPU implementation. We compare the kernel polynomial method and the Fourier transform method for the approximation of the Dirac delta function and conclude that the former is more efficient. In the ballistic regime, the Einstein formula can produce the correct quantized conductance of one-dimensional graphene nanoribbons except for an overshoot near the band edges. In the diffusive regime, the Green–Kubo and the Einstein formalisms are demonstrated to be equivalent. A comparison of the length-dependence of the conductance in the localization regime obtained by the Einstein formula with that obtained by the non-equilibrium Green’s function method reveals the challenges in defining the length in the Kubo–Greenwood formalism at the strongly localized regime. 相似文献

18.

Rigid body constraints realized in massively-parallel molecular dynamics on graphics processing units

Trung Dac Nguyen Carolyn L. Phillips Joshua A. Anderson Sharon C. Glotzer 《Computer Physics Communications》2011,(11):2307-2313

Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton?s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermal-isobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters. 相似文献

19.

基于CUDA实现MRRR算法并行

汪丽杰赵永华《计算机科学》2012,39(3):286-289

MRRR(Multiple Relatively Robust Representations)算法是求解对称三对角矩阵本征值问题高效、精确的算法之一。在分析MRRR算法及CUDA(Compute Unified Device Architecture)并行体系结构的基础上,针对算法的可并行性,采用单指令多线程并行方式实现了基于CUDA的MRRR算法并行,并从存储结构方面优化算法。实验结果显示,与LAPACK库中串行MRRR实现相比,并行方法在保证精度的基础上获得了20倍的加速比,进而从计算精度和计算时间上说明MRRR算法适合在GPU上并行。相似文献

20.

基于CUDA的k-means算法并行化研究

刘端阳郑江帆沈国江刘志《计算机科学》2018,45(11):292-297

k-means算法在面对大规模数据集时,计算时间将随着数据集的增大而成倍增长。为了提升算法的运算性能,设计了一种基于CUDA(Compute Unified Device Architecture)编程模型的并化行k-means算法,即GS_k-means算法。对k-means算法进行了并行化分析,在距离计算前,运用全局选择判断数据所属聚簇是否改变,减少冗余计算;在距离计算时,采用通用矩阵乘加速,加快计算速度;在簇中心点更新时,将所有数据按照簇标签排序分组,将组内数据简单相加,减少原子内存操作,从而提高整体性能。使用KDDCUP99数据集对改进算法进行实验,结果表明,在保证实验结果的准确性的情况下,改进算法加快了计算速度,与经典的GPUMiner算法相比加速比提升5倍。相似文献