首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08 $\times $ in the best case and 1.42 $\times $ on average compared to the baseline GPU-only processing.  相似文献   

2.
Because of the computational power of today??s GPUs, they are starting to be harnessed more and more to help out CPUs on high-performance computing. In addition, an increasing number of today??s state-of-the-art supercomputers include commodity GPUs to bring us unprecedented levels of performance in terms of raw GFLOPS and GFLOPS/cost. In this work, we present a GPU implementation of an image processing application of growing popularity: The 2D fast wavelet transform (2D-FWT). Based on a pair of Quadrature Mirror Filters, a complete set of application-specific optimizations are developed from a CUDA perspective to achieve outstanding factor gains over a highly optimized version of 2D-FWT run in the CPU. An alternative approach based on the Lifting Scheme is also described in Franco et al. (Acceleration of the 2D wavelet transform for CUDA-enabled Devices, 2010). Then, we investigate hardware improvements like multicores on the CPU side, and exploit them at thread-level parallelism using the OpenMP API and pthreads . Overall, the GPU exhibits better scalability and parallel performance on large-scale images to become a solid alternative for computing the 2D-FWT versus those thread-level methods run on emerging multicore architectures.  相似文献   

3.
This paper presents a generic approach to highly efficient image registration in two and three dimensions. Both monomodal and multimodal registration problems are considered. We focus on the important class of affine-linear transformations in a derivative-based optimization framework. Our main contribution is an explicit formulation of the objective function gradient and Hessian approximation that allows for very efficient, parallel derivative calculation with virtually no memory requirements. The flexible parallelism of our concept allows for direct implementation on various hardware platforms. Derivative calculations are fully matrix free and operate directly on the input data, thereby reducing the auxiliary space requirements from \({\mathcal {O}}(n)\) to \({\mathcal {O}}(1)\). The proposed approach is implemented on multicore CPU and GPU. Our GPU code outperforms a conventional matrix-based CPU implementation by more than two orders of magnitude, thus enabling usage in real-time scenarios. The computational properties of our approach are extensively evaluated, thereby demonstrating the performance gain for a variety of real-life medical applications.  相似文献   

4.
High-level program optimizations, such as loop transformations, are critical for high performance on multi-core targets. However, complex sequences of loop transformations are often required to expose parallelism (both coarse-grain and fine-grain) and improve data locality. The polyhedral compilation framework has proved to be very effective at representing these complex sequences and restructuring compute-intensive applications, seamlessly handling perfectly and imperfectly nested loops. It models arbitrarily complex sequences of loop transformations in a unified mathematical framework, dramatically increasing the expressiveness (and expected effectiveness) of the loop optimization stage. Nevertheless identifying the most effective loop transformations remains a major challenge: current state-of-the-art heuristics in polyhedral frameworks simply fail to expose good performance over a wide range of numerical applications. Their lack of effectiveness is mainly due to simplistic performance models that do not reflect the complexity today’s processors (CPU, cache behavior, etc.). We address the problem of selecting the best polyhedral optimizations with dedicated machine learning models, trained specifically on the target machine. We show that these models can quickly select high-performance optimizations with very limited iterative search. We decouple the problem of selecting good complex sequences of optimizations in two stages: (1) we narrow the set of candidate optimizations using static cost models to select the loop transformations that implement specific high-level optimizations (e.g., tiling, parallelism, etc.); (2) we predict the performance of each high-level complex optimization sequence with trained models that take as input a performance-counter characterization of the original program. Our end-to-end framework is validated using numerous benchmarks on two modern multi-core platforms. We investigate a variety of different machine learning algorithms and hardware counters, and we obtain performance improvements over productions compilers ranging on average from $3.2\times $ to $8.7\times $ , by running not more than $6$ program variants from a polyhedral optimization space.  相似文献   

5.
In this work, we present a tool that exploits heterogeneous computing to calculate the noise scattered by an object from the pressure distribution over its surface and its normal derivative. The method mainly deals with a large Matrix–Vector Product where the matrix elements must be calculated on the fly in such a way that the problem fits in main memory. To prove the performance of the heterogeneous implementations, the tool is tested using one NVIDIA K20c GPU, one Intel Xeon Phi 5110P, and two Intel Xeon E5-2650 CPUs. The speedup of the accelerated implementations ranges from \(3\times \) (Xeon Phi) to \(8\times \) (Xeon Phi  \(+\)  K20c) when compared to our parallel CPU code with \(32\) threads. This work, combined with the authors’ previous works for the computation of the acoustic pressure over the obstacle surface, results in a valuable toolset for noise control applications during aircraft design.  相似文献   

6.
The high computation requirements of global optimization algorithms, when used to solve real optimization problems, have caused the appearance of different parallelization strategies using several parallel computing architectures. In this work, the Universal Evolutionary Global Optimizer is implemented in CUDA to be run on GPU architectures (GPuEGO). This parallelization of the referred evolutionary multimodal optimization algorithm is rather different from other previous parallel implementations designed to be executed into shared or distributed memory processors. In this case, due to the special characteristics of a GPU architecture, the original data structures are not valid and it has been necessary to redefine them and all the functions that operate with them. When this approach is applied the acceleration factors achieved by GPuEGO range from \({\times }\) 6.33 to \({\times }\) 23.20 depending on the test function.  相似文献   

7.
In the last years, graphics processing units (GPUs) witnessed ever growing applications for a wide range of computational analyses in the field of life sciences. Despite its large potentiality, GPU computing risks remaining a niche for specialists, due to the programming and optimization skills it requires. In this work we present cupSODA, a simulator of biological systems that exploits the remarkable memory bandwidth and computational capability of GPUs. cupSODA allows to efficiently execute in parallel large numbers of simulations, which are usually required to investigate the emergent dynamics of a given biological system under different conditions. cupSODA works by automatically deriving the system of ordinary differential equations from a reaction-based mechanistic model, defined according to the mass-action kinetics, and then exploiting the numerical integration algorithm, LSODA. We show that cupSODA can achieve a \(86 \times \) speedup on GPUs with respect to equivalent executions of LSODA on the CPU.  相似文献   

8.
Two-dimensional orthogonal matching pursuit (2D-OMP) algorithm is an extension of the one-dimensional OMP (1D-OMP), whose complexity and memory usage are lower than the 1D-OMP when they are applied to 2D sparse signal recovery. However, the major shortcoming of the 2D-OMP still resides in long computing time. To overcome this disadvantage, we develop a novel parallel design strategy of the 2D-OMP algorithm on a graphics processing unit (GPU) in this paper. We first analyze the complexity of the 2D-OMP and point out that the bottlenecks lie in matrix inverse and projection. After adopting the strategy of matrix inverse update whose performance is superior to traditional methods to reduce the complexity of original matrix inverse, projection becomes the most time-consuming module. Hence, a parallel matrix–matrix multiplication leveraging tiling algorithm strategy is launched to accelerate projection computation on GPU. Moreover, a fast matrix–vector multiplication, a parallel reduction algorithm, and some other parallel skills are also exploited to boost the performance of the 2D-OMP further on GPU. In the case of the sensing matrix of size 128 \(\times \) 256 (176 \(\times \) 256, resp.) for a 256 \(\times \) 256 scale image, experimental results show that the parallel 2D-OMP achieves 17 \(\times \) to 41 \(\times \) (24 \(\times \) to 62 \(\times \) , resp.) speedup over the original C code compiled with the O \(_2\) optimization option. Higher speedup would be further obtained with larger-size image recovery.  相似文献   

9.
Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the \(N \times N\) sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying \(N \ge 2048\) and \(N \ge 3072\), respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20\(\times \) and 80\(\times \) for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14–26 % for \(N \ge 1024\). Proposed implementations achieve 2.35 TFLOPS (67 % of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries.  相似文献   

10.
Membrane computing is an emergent research area studying the behavior of living cells to define bio-inspired computing devices, also called P systems. Such devices provide polynomial time solutions to NP-complete problems by trading time for space. The efficient simulation of P systems poses three major challenging issues: an intrinsic massive parallelism of P systems, an exponential computational workspace, and a non-intensive floating point nature. This paper analyzes the simulation of a family of recognizer P systems with active membranes that solves the satisfiability problem in linear time on three different architectures: a shared memory multiprocessor, a distributed memory system, and a manycore graphics processing unit (GPU). For an efficient handling of the exponential workspace created by the P systems computation, we enable different data policies on those architectures to increase memory bandwidth and exploit data locality through tiling. Parallelism inherent to the target P system is also managed on each architecture to demonstrate that GPUs offer a valid alternative for high-performance computing at a considerably lower cost. Our results lead to execution time improvements exceeding 310 \(\times \) and 78 \(\times \) , respectively, for a much cheaper high-performance alternative.  相似文献   

11.
The computational complexity of Caputo fractional reaction–diffusion equation is \(O(MN^2)\) compared with \(O(MN)\) of traditional reaction–diffusion equation, where \(M\) , \(N\) are the number of time steps and grid points. A efficient parallel solution for Caputo fractional reaction–diffusion equation with explicit difference method is proposed. The parallel solution, which is implemented with MPI parallel programming model, consists of three procedures: preprocessing, parallel solver and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector vector addition and constant vector multiplication. The sum of constant vector multiplication is optimized. As to the authors’ knowledge, this is the first parallel solution for Caputo fractional reaction–diffusion equation. The experimental results show that the parallel solution compares well with the analytic solution. The parallel solution on single Intel Xeon X5540 CPU runs more than three times faster than the serial solution on single X5540 CPU core, and scales quite well on a distributed memory cluster system.  相似文献   

12.
We investigate data parallel techniques for belief propagation in acyclic factor graphs on multi-core systems. Belief propagation is a key inference algorithm in factor graph, a probabilistic graphical model that has found applications in many domains. In this paper, we explore data parallelism for basic operations over the potential tables in belief propagation. Data parallel techniques for these table operations are developed for shared memory platforms. We then propose a complete belief propagation algorithm using these table operations to perform exact inference in factor graphs. The proposed algorithms are implemented on state-of-the-art multi-socket multi-core systems with additional NUMA-aware optimizations. Our proposed algorithms exhibit good scalability using a representative set of factor graphs. On a four-socket Intel Westmere-EX system with 40 cores, we achieve 39.5 $\times $ speedup for the table operations and 39 $\times $ speedup for the complete algorithm using factor graphs with large potential tables.  相似文献   

13.
14.
15.
Bézier surfaces are mathematical tools employed in a wide variety of applications. Some works in the literature propose parallelization strategies to improve performance for the computation of Bézier surfaces. These approaches, however, are mainly focused on graphics applications and often are not directly applicable to other domains. In this work, we propose a new method for the computation of Bézier surfaces, together with approaches to efficiently map the method onto different platforms (CPUs, discrete and integrated GPUs). Additionally, we explore CPU–GPU cooperation mechanisms for computing Bézier surfaces using two integrated heterogeneous systems with different characteristics. An exhaustive performance evaluation—including different data-types, rendering and several hardware platforms—is performed. The results show that our method achieves speedups as high as 3.12x (double-precision) and 2.47x (single-precision) on CPU, and 3.69x (double-precision) and 13.14x (single-precision) on GPU compared to other methods in the literature. In heterogeneous platforms, the CPU–GPU cooperation increases the performance up to 2.09x with respect to the GPU-only version. Our method and the associated parallelization approaches can be easily employed in domains other than computer-graphics (e.g., image registration, bio-mechanical modeling and flow simulation), and extended to other Bézier formulations and Bézier constructions of higher order than surfaces.  相似文献   

16.
In this paper we describe the libMesh (http://libmesh.sourceforge.net) framework for parallel adaptive finite element applications. libMesh is an open-source software library that has been developed to facilitate serial and parallel simulation of multiscale, multiphysics applications using adaptive mesh refinement and coarsening strategies. The main software development is being carried out in the CFDLab (http://cfdlab.ae.utexas.edu) at the University of Texas, but as with other open-source software projects; contributions are being made elsewhere in the US and abroad. The main goals of this article are: (1) to provide a basic reference source that describes libMesh and the underlying philosophy and software design approach; (2) to give sufficient detail and references on the adaptive mesh refinement and coarsening (AMR/C) scheme for applications analysts and developers; and (3) to describe the parallel implementation and data structures with supporting discussion of domain decomposition, message passing, and details related to dynamic repartitioning for parallel AMR/C. Other aspects related to C++ programming paradigms, reusability for diverse applications, adaptive modeling, physics-independent error indicators, and similar concepts are briefly discussed. Finally, results from some applications using the library are presented and areas of future research are discussed.  相似文献   

17.
We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into chunks, match chunks in parallel, and combine the matching results. Our parallel matching algorithm exploits structural DFA properties to minimize the speculative overhead. Unlike previous approaches, our speculation is failure-free, i.e., (1) sequential semantics are maintained, and (2) speed-downs are avoided altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching operation is fully vectorized. The proposed load-balancing scheme uses an off-line profiling step to determine the matching capacity of each participating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous parallel architectures such as cloud computing environments. We evaluated our speculative DFA membership test for a representative set of benchmarks from the Perl-compatible Regular Expression (PCRE) library and the PROSITE protein database. Evaluation was conducted on a 4 CPU (40 cores) shared-memory node of the Intel Academic Program Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. Obtained speedups are on the order of $\mathcal O \left( 1+\frac{|P|-1}{|Q|\cdot \gamma }\right) $ , where $|P|$ denotes the number of processors or SIMD units, $|Q|$ denotes the number of DFA states, and $0<\gamma \le 1$ represents a statically computed DFA property. For all observed cases, we found that $0.02<\gamma <0.47$ . Actual speedups range from 2.3 $\times $ to 38.8 $\times $ for up to 512 DFA states for PCRE, and between 1.3 $\times $ and 19.9 $\times $ for up to 1,288 DFA states for PROSITE on a 40-core MTL node. Speedups on the EC2 computing cloud range from 5.0 $\times $ to 65.8 $\times $ for PCRE, and from 5.0 $\times $ to 138.5 $\times $ for PROSITE. Speedups of our C-based DFA matcher over the Perl-based ScanProsite scan tool range from 559.3 $\times $ to 15079.7 $\times $ on a 40-core MTL node. We show the scalability of our approach for input-sizes of up to 10 GB.  相似文献   

18.
In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations.  相似文献   

19.
基于CUDA的并行布谷鸟搜索算法设计与实现   总被引:1,自引:0,他引:1  
布谷鸟搜索(cuckoo search,CS)算法是近几年发展起来的智能元启发式算法,已经被成功应用于多种优化问题中。针对CS算法在求解大数据、大规模复杂问题时,计算时间过长的问题,提出了一种基于统一计算设备架构(compute unified device architecture,CUDA)的并行布谷鸟搜索算法。该算法的并行实现采用任务并行与数据并行相结合的方式,利用图形处理器(graphic processing unit,GPU)线程块与线程分别映射布谷鸟个体与个体的每一维数据,并行实现CS算法中的鸟巢位置更新、个体适应度评估、鸟巢重建、寻找最优个体操作。整个CS算法的寻优迭代过程完全通过GPU实现,降低了算法计算过程中CPU与GPU的通信开销。对4个经典基准测试函数进行了仿真实验,结果表明,相比标准CS算法,基于CUDA架构的并行CS算法在求解收敛性一致的前提下,在求解速度上获得了高达110倍的计算加速比。  相似文献   

20.
Unstructured meshes are used in many engineering applications with irregular domains, from elastic deformation problems to crack propagation to fluid flow. Because of their complexity and dynamic behavior, the development of scalable parallel software for these applications is challenging. The Charm++ Parallel Framework for Unstructured Meshes allows one to write parallel programs that operate on unstructured meshes with only minimal knowledge of parallel computing, while making it possible to achieve excellent scalability even for complex applications. Charm++’s message-driven model enables computation/communication overlap, while its run-time load balancing capabilities make it possible to react to the changes in computational load that occur in dynamic physics applications. The framework is highly flexible and has been enhanced with numerous capabilities for the manipulation of unstructured meshes, such as parallel mesh adaptivity and collision detection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号