期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Massively parallel chemical potential calculation on graphics processing units

Kevin B. Daly Jay B. Benziger Pablo G. Debenedetti Athanassios Z. Panagiotopoulos 《Computer Physics Communications》2012,183(10):2054-2062

相似文献

2.

Medical image segmentation with deformable models on graphics processing units

Rigo Alvarado Juan J. Tapia Julio C. Rolón 《The Journal of supercomputing》2014,68(1):339-364

In this work, the parallel implementation of a segmentation algorithm based on the gradient vector flow (GVF) deformable model in a graphics processing unit (GPU) is presented. The proposed implementation focuses on the parallelization of the computation of the GVF field. In order to make a performance comparison of the proposed GPU algorithm, an OpenMP-based implementation is presented too. We also present an analysis of the textures and global memory performance in the computing of the GVF field. To improve the efficiency and the performance of the active contour segmentation, a novel snaxel reallocation method is proposed. The main advantage of the reallocation process is the small linear system needed to perform the segmentation and its low computational load. To assure the convergence of the active contour deformation, we propose a stopping criterion based on the root mean square error for the iterative solution of the evolution equations. 相似文献

3.

Special issue: General-purpose processing using graphics processing units

David R. Kaeli Miriam Leeser 《Journal of Parallel and Distributed Computing》2008

相似文献

4.

Genetic programming on graphics processing units 总被引：1，自引：0，他引：1

Denis Robilliard Virginie Marion-Poty Cyril Fonlupt 《Genetic Programming and Evolvable Machines》2009,10(4):447-471

The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. In a first work we have showed that this setup allows to develop fine grain parallelization schemes to evaluate several GP programs in parallel, while obtaining speedups for usual training sets and program sizes. Here we present another parallelization scheme and optimizations about program representation and use of GPU fast memory. This increases the computation speed about three times faster, up to 4 billion GP operations per second. The code has been developed within the well known ECJ library and is open source. 相似文献

5.

A survey of graph processing on graphics processing units

Ha-Nguyen Tran Erik Cambria 《The Journal of supercomputing》2018,74(5):2086-2115

Graphics processing units (GPUs) have become popular high-performance computing platforms for a wide range of applications. The trend of processing graph structures on modern GPUs has also attracted an increasing interest in recent years. This article aims to review research works on adapting the massively parallel architecture of GPUs to accelerate the performance of fundamental graph operations. Despite their merits, some factors such as the unique architecture of GPUs, limited programming models, and irregular structures of graphs prevent GPU implementations from achieving high performance. Thus, this survey also discusses challenges and optimization techniques used by recent studies to fully utilize the GPU capability. A categorization of the existing research works is also presented based on the specific issues these attempted to solve. 相似文献

6.

Polymer field-theory simulations on graphics processing units

Kris T. Delaney Glenn H. Fredrickson 《Computer Physics Communications》2013

We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA^® Tesla T20 series GPUs, we find double-precision speedups of up to 30×

30 \times

compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60×

60 \times

faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU. 相似文献

7.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

8.

Systolic neighborhood search on graphics processing units

Pablo Vidal Francisco Luna Enrique Alba 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2014,18(1):125-142

In this paper, we propose a parallel processing model based on systolic computing merged with concepts of evolutionary algorithms. The proposed model works over a Graphics Processing Unit using the structure of threads as cells that form a systolic mesh. Data passes through those cells, each one performing a simple computing operation. The systolic algorithm is implemented using NVIDIA’s compute unified device architecture. To investigate the behavior and performance of the proposed model we test it over a NP-complete problem. The study of systolic algorithms on GPU and the different versions of the proposal show that our canonical model is a competitive solver with efficacy and presents a good scalability behavior across different instance sizes. 相似文献

9.

Line-by-line spectroscopic simulations on graphics processing units

Sylvain Collange David Defour 《Computer Physics Communications》2008,178(2):135-143

We report here on software that performs line-by-line spectroscopic simulations on gases. Elaborate models (such as narrow band and correlated-K) are accurate and efficient for bands where various components are not simultaneously and significantly active. Line-by-line is probably the most accurate model in the infrared for blends of gases that contain high proportions of H₂O and CO₂ as this was the case for our prototype simulation. Our implementation on graphics processing units sustains a speedup close to 330 on computation-intensive tasks and 12 on memory intensive tasks compared to implementations on one core of high-end processors. This speedup is due to data parallelism, efficient memory access for specific patterns and some dedicated hardware operators only available in graphics processing units. It is obtained leaving most of processor resources available and it would scale linearly with the number of graphics processing units in parallel machines. Line-by-line simulation coupled with simulation of fluid dynamics was long believed to be economically intractable but our work shows that it could be done with some affordable additional resources compared to what is necessary to perform simulations on fluid dynamics alone.

Program summary

Program title: GPU4RECatalogue identifier: ADZY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADZY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 62 776No. of bytes in distributed program, including test data, etc.: 1 513 247Distribution format: tar.gzProgramming language: C++Computer: x86 PCOperating system: Linux, Microsoft Windows. Compilation requires either gcc/g++ under Linux or Visual C++ 2003/2005 and Cygwin under Windows. It has been tested using gcc 4.1.2 under Ubuntu Linux 7.04 and using Visual C++ 2005 with Cygwin 1.5.24 under Windows XP.RAM: 1 gigabyteClassification: 21.2External routines: OpenGL (http://www.opengl.org)Nature of problem: Simulating radiative transfer on high-temperature high-pressure gases.Solution method: Line-by-line Monte-Carlo ray-tracing.Unusual features: Parallel computations are moved to the GPU.Additional comments: nVidia GeForce 7000 or ATI Radeon X1000 series graphics processing unit is required.Running time: A few minutes. 相似文献

10.

Inferring physical units in formal models

Sebastian Krings Michael Leuschel 《Software and Systems Modeling》2017,16(1):25-47

Most state-based formal methods, like B, Event-B or Z, provide support for static typing. However, these methods and the associated tools lack support for annotating variables with (physical) units of measurement. There is thus no obvious way to reason about correct or incorrect usage of such units. We present a technique that analyzes the usage of physical units throughout B and Event-B machines infers missing units and notifies the user of incorrectly handled units. The technique combines abstract interpretation with classical animation, constraint solving and model checking and has been integrated into the ProB validation tool, both for classical B and for Event-B. It provides source-level feedback about errors detected in the models. We also describe how to extend our approach to TLA \(^+\), an untyped formal language. We provide an in-depth empirical evaluation and demonstrate that our technique scales up to real-life industrial models. 相似文献

11.

Grex: An efficient MapReduce framework for graphics processing units

Can Basaran Kyoung-Don Kang 《Journal of Parallel and Distributed Computing》2013

In this paper, we present a new MapReduce framework, called Grex, designed to leverage general purpose graphics processing units (GPUs) for parallel data processing. Grex provides several new features. First, it supports a parallel split method to tokenize input data of variable sizes, such as words in e-books or URLs in web documents, in parallel using GPU threads. Second, Grex evenly distributes data to map/reduce tasks to avoid data partitioning skews. In addition, Grex provides a new memory management scheme to enhance the performance by exploiting the GPU memory hierarchy. Notably, all these capabilities are supported via careful system design without requiring any locks or atomic operations for thread synchronization. The experimental results show that our system is up to 12.4× and 4.1× faster than two state-of-the-art GPU-based MapReduce frameworks for the tested applications. 相似文献

12.

A block-asynchronous relaxation method for graphics processing units

Hartwig Anzt Stanimire Tomov Jack Dongarra Vincent Heuveline 《Journal of Parallel and Distributed Computing》2013

In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the “subdomain” handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing. 相似文献

13.

Simulation of shallow-water systems using graphics processing units

Miguel Lastra José M. Mantas Carlos Ureña Manuel J. Castro José A. García-Rodríguez 《Mathematics and computers in simulation》2009

This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver. 相似文献

14.

Accelerating frequent itemset mining on graphics processing units

Fan Zhang Yan Zhang Jason D. Bakos 《The Journal of supercomputing》2013,66(1):94-117

In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU. 相似文献

15.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

16.

Solving diffractive optics problems using graphics processing units

D. L. Golovashkin N. L. Kasanskiy 《Optical Memory & Neural Networks》2011,20(2):85-89

Techniques for applying graphics processing units (GPU) to the general-purpose nongraphics computations proposed in recent years by the companies ATI (AMD FireStream, 2006) and NVIDIA (CUDA: Compute Unified Device Architecture, 2007) have given an impetus to developing algorithms and software packages for solving problems of diffractive optics with the aid of the GPU. The computations based on the wide-spread Ray Tracing method were among the first to be implemented using the GPU. The method attracted the attention of the CUDA technology architects, who proposed its GPU-based implementation at the conference NVISION08 (2008). The potential of this direction is associated both with the research into the general issues of mapping of the Ray Tracing method onto the GPU architecture (involving the use of various grid domains and trees) and with developing dedicated software packages (RTE and Linzik projects). In this work, a special attention is given to the overview of techniques for the GPU-aided implementation of the FDTD (finite-difference time-domain) method, which offers an instrument for solving problems of micro- and nanooptics using the rigorous electromagnetic theory. The review of the related papers ranges from the initial research (based on the use of textures) to the complete software solutions (like FDTD Software and FastFDTD). 相似文献

17.

Dynamic models of parallel computations for macropipelined programs

A. E. Doroshenko 《Cybernetics and Systems Analysis》1995,31(6):818-834

Conclusion We have considered dynamic models of parallel computations and transformation algorithms for macropipelined programs that increase the internal exchange asynchronism of the parallel components. The models generalize the well-known CSP synchronous exchange paradigm (the rendezvous mechanism) [18] and provide a theoretical justification for the development of a whole range of interconnected semantic models of asynchronous computation with increasing degree of exchange asynchronism. Some publications ensure asynchronous exchange only by buffering [15, 16]. The dynamic model approach developed in this study combines buffering and analysis of data dependences of the exchange operators. It not only reduces losses associated with synchronization of communicating parallel processes, but also ensures automatic resolution of some classes of data exchange deadlocks. Experience with dynamic models as a means of increasing the efficiency of macropipelined programs and multilevel memory in multiprocessor systems [12, 17] can be applied in other programming systems and parallel program design systems of both compiling (S1, S2) and interpreting (S3) type. Translated from Kibernetika i Sistemnyi Analiz, No. 6, pp. 45–65, November–December, 1995. 相似文献

18.

CFD-based analysis and two-level aerodynamic optimization on graphics processing units

I.C. Kampolis X.S. Trompoukis V.G. Asouti K.C. Giannakoglou 《Computer Methods in Applied Mechanics and Engineering》2010,199(9-12):712-722

This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code.Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software. 相似文献

19.

Efficient magnetohydrodynamic simulations on graphics processing units with CUDA

Hon-Cheng Wong Un-Hong Wong Xueshang Feng Zesheng Tang 《Computer Physics Communications》2011,(10):2132-2160

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs. 相似文献

20.

Parallelizing the Cellular Potts Model on graphics processing units

José Juan Tapia Roshan M. D'Souza 《Computer Physics Communications》2011,182(4):857-865

The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >¹⁰⁶ cells with lattice sizes of up to 256³ on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications. 相似文献