期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Boosting CUDA Applications with CPU–GPU Hybrid Computing

Changmin Lee Won Woo Ro Jean-Luc Gaudiot 《International journal of parallel programming》2014,42(2):384-404

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08 $\times $ in the best case and 1.42 $\times $ on average compared to the baseline GPU-only processing. 相似文献

2.

The 2D wavelet transform on emerging architectures: GPUs and multicores

Joaquín Franco Gregorio Bernabé Juan Fernández Manuel Ujaldón 《Journal of Real-Time Image Processing》2012,7(3):145-152

Because of the computational power of today??s GPUs, they are starting to be harnessed more and more to help out CPUs on high-performance computing. In addition, an increasing number of today??s state-of-the-art supercomputers include commodity GPUs to bring us unprecedented levels of performance in terms of raw GFLOPS and GFLOPS/cost. In this work, we present a GPU implementation of an image processing application of growing popularity: The 2D fast wavelet transform (2D-FWT). Based on a pair of Quadrature Mirror Filters, a complete set of application-specific optimizations are developed from a CUDA perspective to achieve outstanding factor gains over a highly optimized version of 2D-FWT run in the CPU. An alternative approach based on the Lifting Scheme is also described in Franco et al. (Acceleration of the 2D wavelet transform for CUDA-enabled Devices, 2010). Then, we investigate hardware improvements like multicores on the CPU side, and exploit them at thread-level parallelism using the OpenMP API and pthreads . Overall, the GPU exhibits better scalability and parallel performance on large-scale images to become a solid alternative for computing the 2D-FWT versus those thread-level methods run on emerging multicore architectures. 相似文献

3.

A matrix-free approach to efficient affine-linear image registration on CPU and GPU

Jan Rühaak Lars König Florian Tramnitzke Harald Köstler Jan Modersitzki 《Journal of Real-Time Image Processing》2017,13(1):205-225

This paper presents a generic approach to highly efficient image registration in two and three dimensions. Both monomodal and multimodal registration problems are considered. We focus on the important class of affine-linear transformations in a derivative-based optimization framework. Our main contribution is an explicit formulation of the objective function gradient and Hessian approximation that allows for very efficient, parallel derivative calculation with virtually no memory requirements. The flexible parallelism of our concept allows for direct implementation on various hardware platforms. Derivative calculations are fully matrix free and operate directly on the input data, thereby reducing the auxiliary space requirements from ${\mathcal {O}}(n)$ to ${\mathcal {O}}(1)$. The proposed approach is implemented on multicore CPU and GPU. Our GPU code outperforms a conventional matrix-based CPU implementation by more than two orders of magnitude, thus enabling usage in real-time scenarios. The computational properties of our approach are extensively evaluated, thereby demonstrating the performance gain for a variety of real-life medical applications. 相似文献

4.

Predictive Modeling in a Polyhedral Optimization Space

Eunjung Park John Cavazos Louis-Noël Pouchet Cédric Bastoul Albert Cohen P. Sadayappan 《International journal of parallel programming》2013,41(5):704-750

High-level program optimizations, such as loop transformations, are critical for high performance on multi-core targets. However, complex sequences of loop transformations are often required to expose parallelism (both coarse-grain and fine-grain) and improve data locality. The polyhedral compilation framework has proved to be very effective at representing these complex sequences and restructuring compute-intensive applications, seamlessly handling perfectly and imperfectly nested loops. It models arbitrarily complex sequences of loop transformations in a unified mathematical framework, dramatically increasing the expressiveness (and expected effectiveness) of the loop optimization stage. Nevertheless identifying the most effective loop transformations remains a major challenge: current state-of-the-art heuristics in polyhedral frameworks simply fail to expose good performance over a wide range of numerical applications. Their lack of effectiveness is mainly due to simplistic performance models that do not reflect the complexity today’s processors (CPU, cache behavior, etc.). We address the problem of selecting the best polyhedral optimizations with dedicated machine learning models, trained specifically on the target machine. We show that these models can quickly select high-performance optimizations with very limited iterative search. We decouple the problem of selecting good complex sequences of optimizations in two stages: (1) we narrow the set of candidate optimizations using static cost models to select the loop transformations that implement specific high-level optimizations (e.g., tiling, parallelism, etc.); (2) we predict the performance of each high-level complex optimization sequence with trained models that take as input a performance-counter characterization of the original program. Our end-to-end framework is validated using numerous benchmarks on two modern multi-core platforms. We investigate a variety of different machine learning algorithms and hardware counters, and we obtain performance improvements over productions compilers ranging on average from $3.2\times $ to $8.7\times $ , by running not more than $6$ program variants from a polyhedral optimization space. 相似文献

5.

Aircraft noise scattering prediction using different accelerator architectures

M. López-Portugués J. A. López-Fernández N. Díaz-Gracia R. G. Ayestarán José Ranilla 《The Journal of supercomputing》2014,70(2):612-622

In this work, we present a tool that exploits heterogeneous computing to calculate the noise scattered by an object from the pressure distribution over its surface and its normal derivative. The method mainly deals with a large Matrix–Vector Product where the matrix elements must be calculated on the fly in such a way that the problem fits in main memory. To prove the performance of the heterogeneous implementations, the tool is tested using one NVIDIA K20c GPU, one Intel Xeon Phi 5110P, and two Intel Xeon E5-2650 CPUs. The speedup of the accelerated implementations ranges from $3\times $ (Xeon Phi) to $8\times $ (Xeon Phi $+$ K20c) when compared to our parallel CPU code with $32$ threads. This work, combined with the authors’ previous works for the computation of the acoustic pressure over the obstacle surface, results in a valuable toolset for noise control applications during aircraft design. 相似文献

6.

A GPU implementation of a hybrid evolutionary algorithm: GPuEGO

J. M. García-Martínez E. M. Garzón P. M. Ortigosa 《The Journal of supercomputing》2014,70(2):684-695

The high computation requirements of global optimization algorithms, when used to solve real optimization problems, have caused the appearance of different parallelization strategies using several parallel computing architectures. In this work, the Universal Evolutionary Global Optimizer is implemented in CUDA to be run on GPU architectures (GPuEGO). This parallelization of the referred evolutionary multimodal optimization algorithm is rather different from other previous parallel implementations designed to be executed into shared or distributed memory processors. In this case, due to the special characteristics of a GPU architecture, the original data structures are not valid and it has been necessary to redefine them and all the functions that operate with them. When this approach is applied the acceleration factors achieved by GPuEGO range from ${\times }$ 6.33 to ${\times }$ 23.20 depending on the test function. 相似文献

7.

GPU-accelerated simulations of mass-action kinetics models with cupSODA

Marco S. Nobile Paolo Cazzaniga Daniela Besozzi Giancarlo Mauri 《The Journal of supercomputing》2014,69(1):17-24

In the last years, graphics processing units (GPUs) witnessed ever growing applications for a wide range of computational analyses in the field of life sciences. Despite its large potentiality, GPU computing risks remaining a niche for specialists, due to the programming and optimization skills it requires. In this work we present cupSODA, a simulator of biological systems that exploits the remarkable memory bandwidth and computational capability of GPUs. cupSODA allows to efficiently execute in parallel large numbers of simulations, which are usually required to investigate the emergent dynamics of a given biological system under different conditions. cupSODA works by automatically deriving the system of ordinary differential equations from a reaction-based mechanistic model, defined according to the mass-action kinetics, and then exploiting the numerical integration algorithm, LSODA. We show that cupSODA can achieve a $86 \times $ speedup on GPUs with respect to equivalent executions of LSODA on the CPU. 相似文献

8.

Accelerating 2D orthogonal matching pursuit algorithm on GPU

Yuan Dai Dongjian He Yong Fang Long Yang 《The Journal of supercomputing》2014,69(3):1363-1381

Two-dimensional orthogonal matching pursuit (2D-OMP) algorithm is an extension of the one-dimensional OMP (1D-OMP), whose complexity and memory usage are lower than the 1D-OMP when they are applied to 2D sparse signal recovery. However, the major shortcoming of the 2D-OMP still resides in long computing time. To overcome this disadvantage, we develop a novel parallel design strategy of the 2D-OMP algorithm on a graphics processing unit (GPU) in this paper. We first analyze the complexity of the 2D-OMP and point out that the bottlenecks lie in matrix inverse and projection. After adopting the strategy of matrix inverse update whose performance is superior to traditional methods to reduce the complexity of original matrix inverse, projection becomes the most time-consuming module. Hence, a parallel matrix–matrix multiplication leveraging tiling algorithm strategy is launched to accelerate projection computation on GPU. Moreover, a fast matrix–vector multiplication, a parallel reduction algorithm, and some other parallel skills are also exploited to boost the performance of the 2D-OMP further on GPU. In the case of the sensing matrix of size 128 $\times $ 256 (176 $\times $ 256, resp.) for a 256 $\times $ 256 scale image, experimental results show that the parallel 2D-OMP achieves 17 $\times $ to 41 $\times $ (24 $\times $ to 62 $\times $ , resp.) speedup over the original C code compiled with the O $_2$ optimization option. Higher speedup would be further obtained with larger-size image recovery. 相似文献

9.

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Ayaz?ul?Hassan?Khan Email author Mayez?Al-Mouhamed Allam?Fatayer Nazeeruddin?Mohammad 《International journal of parallel programming》2016,44(4):801-830

Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the $N \times N$ sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying $N \ge 2048$ and $N \ge 3072$, respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20$\times $ and 80$\times $ for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14–26 % for $N \ge 1024$. Proposed implementations achieve 2.35 TFLOPS (67 % of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries. 相似文献

10.

Evaluating the SAT problem on P systems for different high-performance architectures

José M. Cecilia José M. García Ginés D. Guerrero Manuel Ujaldón 《The Journal of supercomputing》2014,69(1):248-272

Membrane computing is an emergent research area studying the behavior of living cells to define bio-inspired computing devices, also called P systems. Such devices provide polynomial time solutions to NP-complete problems by trading time for space. The efficient simulation of P systems poses three major challenging issues: an intrinsic massive parallelism of P systems, an exponential computational workspace, and a non-intensive floating point nature. This paper analyzes the simulation of a family of recognizer P systems with active membranes that solves the satisfiability problem in linear time on three different architectures: a shared memory multiprocessor, a distributed memory system, and a manycore graphics processing unit (GPU). For an efficient handling of the exponential workspace created by the P systems computation, we enable different data policies on those architectures to increase memory bandwidth and exploit data locality through tiling. Parallelism inherent to the target P system is also managed on each architecture to demonstrate that GPUs offer a valid alternative for high-performance computing at a considerably lower cost. Our results lead to execution time improvements exceeding 310 $\times $ and 78 $\times $ , respectively, for a much cheaper high-performance alternative. 相似文献

11.

An efficient parallel solution for Caputo fractional reaction–diffusion equation

Chunye Gong Weimin Bao Guojian Tang Bo Yang Jie Liu 《The Journal of supercomputing》2014,68(3):1521-1537

The computational complexity of Caputo fractional reaction–diffusion equation is $O(MN^2)$ compared with $O(MN)$ of traditional reaction–diffusion equation, where $M$ , $N$ are the number of time steps and grid points. A efficient parallel solution for Caputo fractional reaction–diffusion equation with explicit difference method is proposed. The parallel solution, which is implemented with MPI parallel programming model, consists of three procedures: preprocessing, parallel solver and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector vector addition and constant vector multiplication. The sum of constant vector multiplication is optimized. As to the authors’ knowledge, this is the first parallel solution for Caputo fractional reaction–diffusion equation. The experimental results show that the parallel solution compares well with the analytic solution. The parallel solution on single Intel Xeon X5540 CPU runs more than three times faster than the serial solution on single X5540 CPU core, and scales quite well on a distributed memory cluster system. 相似文献

12.

Data Parallel Implementation of Belief Propagation in Factor Graphs on Multi-core Platforms

Nam Ma Yinglong Xia Viktor K. Prasanna 《International journal of parallel programming》2014,42(1):219-237

We investigate data parallel techniques for belief propagation in acyclic factor graphs on multi-core systems. Belief propagation is a key inference algorithm in factor graph, a probabilistic graphical model that has found applications in many domains. In this paper, we explore data parallelism for basic operations over the potential tables in belief propagation. Data parallel techniques for these table operations are developed for shared memory platforms. We then propose a complete belief propagation algorithm using these table operations to perform exact inference in factor graphs. The proposed algorithms are implemented on state-of-the-art multi-socket multi-core systems with additional NUMA-aware optimizations. Our proposed algorithms exhibit good scalability using a representative set of factor graphs. On a four-socket Intel Westmere-EX system with 40 cores, we achieve 39.5 $\times $ speedup for the table operations and 39 $\times $ speedup for the complete algorithm using factor graphs with large potential tables. 相似文献

13.

Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture

Jie Yan Guangming Tan Ninghui Sun 《The Journal of supercomputing》2014,69(3):1462-1490

相似文献

14.

SCOPE: parallel databases meet MapReduce

Jingren Zhou Nicolas Bruno Ming-Chuan Wu Per-Ake Larson Ronnie Chaiken Darren Shakib 《The VLDB Journal The International Journal on Very Large Data Bases》2012,21(5):611-636

相似文献

15.

High-Performance Computation of Bézier Surfaces on Parallel and Heterogeneous Platforms

Rafael Palomar Juan Gómez-Luna Faouzi A. Cheikh Joaquín Olivares-Bueno Ole J. Elle 《International journal of parallel programming》2018,46(6):1035-1062

Bézier surfaces are mathematical tools employed in a wide variety of applications. Some works in the literature propose parallelization strategies to improve performance for the computation of Bézier surfaces. These approaches, however, are mainly focused on graphics applications and often are not directly applicable to other domains. In this work, we propose a new method for the computation of Bézier surfaces, together with approaches to efficiently map the method onto different platforms (CPUs, discrete and integrated GPUs). Additionally, we explore CPU–GPU cooperation mechanisms for computing Bézier surfaces using two integrated heterogeneous systems with different characteristics. An exhaustive performance evaluation—including different data-types, rendering and several hardware platforms—is performed. The results show that our method achieves speedups as high as 3.12x (double-precision) and 2.47x (single-precision) on CPU, and 3.69x (double-precision) and 13.14x (single-precision) on GPU compared to other methods in the literature. In heterogeneous platforms, the CPU–GPU cooperation increases the performance up to 2.09x with respect to the GPU-only version. Our method and the associated parallelization approaches can be easily employed in domains other than computer-graphics (e.g., image registration, bio-mechanical modeling and flow simulation), and extended to other Bézier formulations and Bézier constructions of higher order than surfaces. 相似文献

16.

libMesh : a C++ library for parallel adaptive mesh refinement/coarsening simulations

Benjamin S. Kirk John W. Peterson Roy H. Stogner Graham F. Carey 《Engineering with Computers》2006,22(3-4):237-254

In this paper we describe the libMesh (http://libmesh.sourceforge.net) framework for parallel adaptive finite element applications. libMesh is an open-source software library that has been developed to facilitate serial and parallel simulation of multiscale, multiphysics applications using adaptive mesh refinement and coarsening strategies. The main software development is being carried out in the CFDLab (http://cfdlab.ae.utexas.edu) at the University of Texas, but as with other open-source software projects; contributions are being made elsewhere in the US and abroad. The main goals of this article are: (1) to provide a basic reference source that describes libMesh and the underlying philosophy and software design approach; (2) to give sufficient detail and references on the adaptive mesh refinement and coarsening (AMR/C) scheme for applications analysts and developers; and (3) to describe the parallel implementation and data structures with supporting discussion of domain decomposition, message passing, and details related to dynamic repartitioning for parallel AMR/C. Other aspects related to C++ programming paradigms, reusability for diverse applications, adaptive modeling, physics-independent error indicators, and similar concepts are briefly discussed. Finally, results from some applications using the library are presented and areas of future research are discussed. 相似文献

17.

A Speculative Parallel DFA Membership Test for Multicore,SIMD and Cloud Computing Environments

Yousun Ko Minyoung Jung Yo-Sub Han Bernd Burgstaller 《International journal of parallel programming》2014,42(3):456-489

We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into chunks, match chunks in parallel, and combine the matching results. Our parallel matching algorithm exploits structural DFA properties to minimize the speculative overhead. Unlike previous approaches, our speculation is failure-free, i.e., (1) sequential semantics are maintained, and (2) speed-downs are avoided altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching operation is fully vectorized. The proposed load-balancing scheme uses an off-line profiling step to determine the matching capacity of each participating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous parallel architectures such as cloud computing environments. We evaluated our speculative DFA membership test for a representative set of benchmarks from the Perl-compatible Regular Expression (PCRE) library and the PROSITE protein database. Evaluation was conducted on a 4 CPU (40 cores) shared-memory node of the Intel Academic Program Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. Obtained speedups are on the order of $\mathcal O \left( 1+\frac{|P|-1}{|Q|\cdot \gamma }\right) $ , where $|P|$ denotes the number of processors or SIMD units, $|Q|$ denotes the number of DFA states, and $0<\gamma \le 1$ represents a statically computed DFA property. For all observed cases, we found that $0.02<\gamma <0.47$ . Actual speedups range from 2.3 $\times $ to 38.8 $\times $ for up to 512 DFA states for PCRE, and between 1.3 $\times $ and 19.9 $\times $ for up to 1,288 DFA states for PROSITE on a 40-core MTL node. Speedups on the EC2 computing cloud range from 5.0 $\times $ to 65.8 $\times $ for PCRE, and from 5.0 $\times $ to 138.5 $\times $ for PROSITE. Speedups of our C-based DFA matcher over the Perl-based ScanProsite scan tool range from 559.3 $\times $ to 15079.7 $\times $ on a 40-core MTL node. We show the scalability of our approach for input-sizes of up to 10 GB. 相似文献

18.

A Parallel Dynamic Binary Translator for Efficient Multi-Core Simulation

Oscar Almer Igor Böhm Tobias Edler von Koch Björn Franke Stephen Kyle Volker Seeker Christopher Thompson Nigel Topham 《International journal of parallel programming》2013,41(2):212-235

In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations. 相似文献

19.

基于CUDA的并行布谷鸟搜索算法设计与实现 总被引：1，自引：0，他引：1

韦向远 ;杨辉华 ;谢谱模《计算机科学与探索》2014,(6):665-673

布谷鸟搜索（cuckoo search,CS）算法是近几年发展起来的智能元启发式算法,已经被成功应用于多种优化问题中。针对CS算法在求解大数据、大规模复杂问题时,计算时间过长的问题,提出了一种基于统一计算设备架构（compute unified device architecture,CUDA）的并行布谷鸟搜索算法。该算法的并行实现采用任务并行与数据并行相结合的方式,利用图形处理器（graphic processing unit,GPU）线程块与线程分别映射布谷鸟个体与个体的每一维数据,并行实现CS算法中的鸟巢位置更新、个体适应度评估、鸟巢重建、寻找最优个体操作。整个CS算法的寻优迭代过程完全通过GPU实现,降低了算法计算过程中CPU与GPU的通信开销。对4个经典基准测试函数进行了仿真实验,结果表明,相比标准CS算法,基于CUDA架构的并行CS算法在求解收敛性一致的前提下,在求解速度上获得了高达110倍的计算加速比。相似文献

20.

ParFUM: a parallel framework for unstructured meshes for scalable dynamic physics applications 总被引：1，自引：0，他引：1

Orion S. Lawlor Sayantan Chakravorty Terry L. Wilmarth Nilesh Choudhury Isaac Dooley Gengbin Zheng Laxmikant V. Kalé 《Engineering with Computers》2006,22(3-4):215-235

Unstructured meshes are used in many engineering applications with irregular domains, from elastic deformation problems to crack propagation to fluid flow. Because of their complexity and dynamic behavior, the development of scalable parallel software for these applications is challenging. The Charm++ Parallel Framework for Unstructured Meshes allows one to write parallel programs that operate on unstructured meshes with only minimal knowledge of parallel computing, while making it possible to achieve excellent scalability even for complex applications. Charm++’s message-driven model enables computation/communication overlap, while its run-time load balancing capabilities make it possible to react to the changes in computational load that occur in dynamic physics applications. The framework is highly flexible and has been enhanced with numerous capabilities for the manipulation of unstructured meshes, such as parallel mesh adaptivity and collision detection. 相似文献