期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Genetic programming on graphics processing units 总被引：1，自引：0，他引：1

Denis Robilliard Virginie Marion-Poty Cyril Fonlupt 《Genetic Programming and Evolvable Machines》2009,10(4):447-471

The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. In a first work we have showed that this setup allows to develop fine grain parallelization schemes to evaluate several GP programs in parallel, while obtaining speedups for usual training sets and program sizes. Here we present another parallelization scheme and optimizations about program representation and use of GPU fast memory. This increases the computation speed about three times faster, up to 4 billion GP operations per second. The code has been developed within the well known ECJ library and is open source. 相似文献

2.

Application of numerical optimization methods to molecular docking on graphics processing units

M. A. Farkov A. I. Legalov 《Automatic Control and Computer Sciences》2016,50(7):471-476

This paper is devoted to analyzing numerical optimization methods for solving the problem of molecular docking. Some additional requirements for optimization methods that take into account certain architectural features of graphics processing units (GPUs) have been formulated. A promising optimization method for use on graphics processors has been selected, its implementation is described, and its efficiency and accuracy have been estimated. 相似文献

3.

High resolution topology optimization using graphics processing units (GPUs)

Vivien J. Challis Anthony P. Roberts Joseph F. Grotowski 《Structural and Multidisciplinary Optimization》2014,49(2):315-325

We present a Graphics Processing Unit (GPU) implementation of the level set method for topology optimization. The solution of three-dimensional topology optimization problems with millions of elements becomes computationally tractable with this GPU implementation and NVIDIA supercomputer-grade GPUs. We demonstrate this by solving the inverse homogenization problem for the design of isotropic materials with maximized bulk modulus. We trace the maximum bulk modulus optimization results to very high porosities to demonstrate the detail achievable with a high computational resolution. By utilizing a parallel GPU implementation rather than a sequential CPU implementation, similar increases in tractable computational resolution would be expected for other topology optimization problems. 相似文献

4.

Polymer field-theory simulations on graphics processing units

Kris T. Delaney Glenn H. Fredrickson 《Computer Physics Communications》2013

We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA^® Tesla T20 series GPUs, we find double-precision speedups of up to 30×

30 \times

compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60×

60 \times

faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU. 相似文献

5.

Line-by-line spectroscopic simulations on graphics processing units

Sylvain Collange David Defour 《Computer Physics Communications》2008,178(2):135-143

We report here on software that performs line-by-line spectroscopic simulations on gases. Elaborate models (such as narrow band and correlated-K) are accurate and efficient for bands where various components are not simultaneously and significantly active. Line-by-line is probably the most accurate model in the infrared for blends of gases that contain high proportions of H₂O and CO₂ as this was the case for our prototype simulation. Our implementation on graphics processing units sustains a speedup close to 330 on computation-intensive tasks and 12 on memory intensive tasks compared to implementations on one core of high-end processors. This speedup is due to data parallelism, efficient memory access for specific patterns and some dedicated hardware operators only available in graphics processing units. It is obtained leaving most of processor resources available and it would scale linearly with the number of graphics processing units in parallel machines. Line-by-line simulation coupled with simulation of fluid dynamics was long believed to be economically intractable but our work shows that it could be done with some affordable additional resources compared to what is necessary to perform simulations on fluid dynamics alone.

Program summary

Program title: GPU4RECatalogue identifier: ADZY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADZY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 62 776No. of bytes in distributed program, including test data, etc.: 1 513 247Distribution format: tar.gzProgramming language: C++Computer: x86 PCOperating system: Linux, Microsoft Windows. Compilation requires either gcc/g++ under Linux or Visual C++ 2003/2005 and Cygwin under Windows. It has been tested using gcc 4.1.2 under Ubuntu Linux 7.04 and using Visual C++ 2005 with Cygwin 1.5.24 under Windows XP.RAM: 1 gigabyteClassification: 21.2External routines: OpenGL (http://www.opengl.org)Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases.Solution method: Line-by-line Monte-Carlo ray-tracing.Unusual features: Parallel computations are moved to the GPU.Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required.Running time: A few minutes. 相似文献

6.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

7.

Systolic neighborhood search on graphics processing units

Pablo Vidal Francisco Luna Enrique Alba 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2014,18(1):125-142

In this paper, we propose a parallel processing model based on systolic computing merged with concepts of evolutionary algorithms. The proposed model works over a Graphics Processing Unit using the structure of threads as cells that form a systolic mesh. Data passes through those cells, each one performing a simple computing operation. The systolic algorithm is implemented using NVIDIA’s compute unified device architecture. To investigate the behavior and performance of the proposed model we test it over a NP-complete problem. The study of systolic algorithms on GPU and the different versions of the proposal show that our canonical model is a competitive solver with efficacy and presents a good scalability behavior across different instance sizes. 相似文献

8.

A survey of graph processing on graphics processing units

Ha-Nguyen Tran Erik Cambria 《The Journal of supercomputing》2018,74(5):2086-2115

Graphics processing units (GPUs) have become popular high-performance computing platforms for a wide range of applications. The trend of processing graph structures on modern GPUs has also attracted an increasing interest in recent years. This article aims to review research works on adapting the massively parallel architecture of GPUs to accelerate the performance of fundamental graph operations. Despite their merits, some factors such as the unique architecture of GPUs, limited programming models, and irregular structures of graphs prevent GPU implementations from achieving high performance. Thus, this survey also discusses challenges and optimization techniques used by recent studies to fully utilize the GPU capability. A categorization of the existing research works is also presented based on the specific issues these attempted to solve. 相似文献

9.

Accelerating frequent itemset mining on graphics processing units

Fan Zhang Yan Zhang Jason D. Bakos 《The Journal of supercomputing》2013,66(1):94-117

In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU. 相似文献

10.

Special issue: General-purpose processing using graphics processing units

David R. Kaeli Miriam Leeser 《Journal of Parallel and Distributed Computing》2008

相似文献

11.

基于宇宙计算的图形处理器算法实现

郭祖华贾积身马世霞《计算机应用研究》2014,(11)

下一代观测望远镜将会产生数以亿计的星系测量数据值,这将导致使用中央处理器处理数据时效率低下、成本较高。为了解决这一问题,提出了基于宇宙计算的图形处理器算法。研究了两点式角相关函数以及孔径质量统计这两种宇宙学的计算方法,构建算法代码,并使用统一计算设备架构在图形处理器上实现了这两种算法;比较了算法在中央处理器和图形处理器上使用的运行速度。实验结果表明,与中央处理器相比,使用图形处理器的计算速度得到了显著提高。相似文献

12.

Massively parallel chemical potential calculation on graphics processing units

Kevin B. Daly Jay B. Benziger Pablo G. Debenedetti Athanassios Z. Panagiotopoulos 《Computer Physics Communications》2012,183(10):2054-2062

相似文献

13.

Efficient magnetohydrodynamic simulations on graphics processing units with CUDA

Hon-Cheng Wong Un-Hong Wong Xueshang Feng Zesheng Tang 《Computer Physics Communications》2011,(10):2132-2160

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs. 相似文献

14.

Parallelizing the Cellular Potts Model on graphics processing units

José Juan Tapia Roshan M. D'Souza 《Computer Physics Communications》2011,182(4):857-865

The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >¹⁰⁶ cells with lattice sizes of up to 256³ on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications. 相似文献

15.

Solving diffractive optics problems using graphics processing units

D. L. Golovashkin N. L. Kasanskiy 《Optical Memory & Neural Networks》2011,20(2):85-89

Techniques for applying graphics processing units (GPU) to the general-purpose nongraphics computations proposed in recent years by the companies ATI (AMD FireStream, 2006) and NVIDIA (CUDA: Compute Unified Device Architecture, 2007) have given an impetus to developing algorithms and software packages for solving problems of diffractive optics with the aid of the GPU. The computations based on the wide-spread Ray Tracing method were among the first to be implemented using the GPU. The method attracted the attention of the CUDA technology architects, who proposed its GPU-based implementation at the conference NVISION08 (2008). The potential of this direction is associated both with the research into the general issues of mapping of the Ray Tracing method onto the GPU architecture (involving the use of various grid domains and trees) and with developing dedicated software packages (RTE and Linzik projects). In this work, a special attention is given to the overview of techniques for the GPU-aided implementation of the FDTD (finite-difference time-domain) method, which offers an instrument for solving problems of micro- and nanooptics using the rigorous electromagnetic theory. The review of the related papers ranges from the initial research (based on the use of textures) to the complete software solutions (like FDTD Software and FastFDTD). 相似文献

16.

A block-asynchronous relaxation method for graphics processing units

Hartwig Anzt Stanimire Tomov Jack Dongarra Vincent Heuveline 《Journal of Parallel and Distributed Computing》2013

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the “subdomain” handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing. 相似文献

17.

Exact diagonalization of the Hubbard model on graphics processing units

T. Siro A. Harju 《Computer Physics Communications》2012,183(9):1884-1889

We solve the Hubbard model with the exact diagonalization method on a graphics processing unit (GPU). We benchmark our GPU program against a sequential CPU code by using the Lanczos algorithm to solve the ground state energy in two cases: a one-dimensional ring and a two-dimensional square lattice. In the one-dimensional case, we obtain speedups of over 100 and 60 in single and double precision arithmetic, respectively. In the two-dimensional case, the corresponding speedups are over 110 and 70. 相似文献

18.

Simulation of shallow-water systems using graphics processing units

Miguel Lastra José M. Mantas Carlos Ureña Manuel J. Castro José A. García-Rodríguez 《Mathematics and computers in simulation》2009

This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver. 相似文献

19.

Medical image segmentation with deformable models on graphics processing units

Rigo Alvarado Juan J. Tapia Julio C. Rolón 《The Journal of supercomputing》2014,68(1):339-364

In this work, the parallel implementation of a segmentation algorithm based on the gradient vector flow (GVF) deformable model in a graphics processing unit (GPU) is presented. The proposed implementation focuses on the parallelization of the computation of the GVF field. In order to make a performance comparison of the proposed GPU algorithm, an OpenMP-based implementation is presented too. We also present an analysis of the textures and global memory performance in the computing of the GVF field. To improve the efficiency and the performance of the active contour segmentation, a novel snaxel reallocation method is proposed. The main advantage of the reallocation process is the small linear system needed to perform the segmentation and its low computational load. To assure the convergence of the active contour deformation, we propose a stopping criterion based on the root mean square error for the iterative solution of the evolution equations. 相似文献

20.

Resolving small random symmetric linear systems on graphics processing units

L. A. Abbas-Turki S. Graillat 《The Journal of supercomputing》2017,73(4):1360-1386

This paper focuses on the resolution of a large number of small random symmetric linear systems and its parallel implementation in single precision on graphics processing units (GPUs). The computations involved by each linear system are independent from the others, and the number of unknowns does not exceed 64. For this purpose, we present the adaptation to our context of largely used methods that include: LDLt factorization, Householder reduction to a tridiagonal matrix, parallel cyclic reduction (PCR) that is not a power of two and the divide and conquer algorithm for tridiagonal eigenproblems. We not only detail the implementation and optimization of each method, but we also compare the sustainability of each solution and its performance which include both parallel complexity and cache memory occupation. In the context of solving a large number of small random linear systems on GPUs with no information about their conditioning, our research indicates that the best strategy requires the use of Householder tridiagonalization + PCR followed if necessary by a divide and conquer diagonalization. 相似文献