期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

dOpenCL: Towards uniform programming of distributed heterogeneous multi-/many-core systems

Philipp Kegel Michel Steuwer Sergei Gorlatch 《Journal of Parallel and Distributed Computing》2013

Modern computer systems become increasingly distributed and heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the system’s full performance potential. In this paper, we present dOpenCL (distributed OpenCL)—a uniform approach to programming distributed heterogeneous systems with accelerators. dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. We describe the challenges of implementing the OpenCL programming model for distributed systems, as well as its extension for running multiple applications concurrently. Using several example applications, we compare the performance of dOpenCL with MPI + OpenCL and standard OpenCL implementations. 相似文献

2.

Accelerating the SRP-PHAT algorithm on multi- and many-core platforms using OpenCL

Badía Jose M. Belloch Jose A. Cobos Maximo Igual Francisco D. Quintana-Ortí Enrique S. 《The Journal of supercomputing》2019,75(3):1284-1297

The Steered Response Power with Phase Transform (SRP-PHAT) algorithm is a well-known method for sound source localization due to its robust performance in noisy and reverberant environments. This algorithm is used in a large number of acoustic applications such as automatic camera steering systems, human–machine interaction, video gaming and audio surveillance. SPR-PHAT implementations require to handle a high number of signals coming from a microphone array and a huge search grid that influences the localization accuracy of the system. In this context, high performance in the localization process can only be achieved by using massively parallel computational resources. Different types of multi-core machines based either on multiple CPUs or on GPUs are commonly employed in diverse fields of science for accelerating a number of applications, mainly using OpenMP and CUDA as programming frameworks, respectively. This implies the development of multiple source codes which limits the portability and application possibilities. On the contrary, OpenCL has emerged as an open standard for parallel programming that is nowadays supported by a wide range of architectures. In this work, we evaluate an OpenCL-based implementations of the SRP-PHAT algorithm in two state-of-the-art CPU and GPU platforms. Results demonstrate that OpenCL achieves close-to-CUDA performance in GPU (considered as upper bound) and outperforms in most of the CPU configurations based on OpenMP.

相似文献

3.

An OpenCL micro-benchmark suite for GPUs and CPUs

Xin Yan Xiaohua Shi Lina Wang Haiyan Yang 《The Journal of supercomputing》2014,69(2):693-713

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for any particular compute device. To develop efficient OpenCL applications for the particular platform, we still need a more profound understanding of architecture features on the OpenCL model and computing devices. For this purpose, we design and implement an OpenCL micro-benchmark suite for GPUs and CPUs. In this paper, we introduce the implementations of our OpenCL micro benchmarks, and present the measuring results of hardware and software features like performance of mathematical operations, bus bandwidths, memory architectures, branch synchronizations and scalability, etc., on two multi-core CPUs, i.e. AMD Athlon II X2 250 and Intel Pentium Dual-Core E5400, and two different GPUs, i.e. NVIDIA GeForce GTX 460se and AMD Radeon HD 6850. We also compared the measuring results with existing benchmarks to demonstrate the reasonableness and correctness of our benchmark suite. 相似文献

4.

Efficient Abstractions for GPGPU Programming

Mathias Bourgoin Emmanuel Chailloux Jean-Luc Lamotte 《International journal of parallel programming》2014,42(4):583-600

General purpose (GP)GPU programming demands to couple highly parallel computing units with classic CPUs to obtain a high performance. Heterogenous systems lead to complex designs combining multiple paradigms and programming languages to manage each hardware architecture. In this paper, we present tools to harness GPGPU programming through the high-level OCaml programming language. We describe the SPOC library that allows to handle GPGPU subprograms (kernels) and data transfers between devices. We then present how SPOC expresses GPGPU kernel: through interoperability with common low-level extensions (from Cuda and OpenCL frameworks) but also via an embedded DSL for OCaml. Using simple benchmarks as well as a real world HPC software, we show that SPOC can offer a high performance while efficiently easing development. To allow better abstractions over tasks and data, we introduce some parallel skeletons built upon SPOC as well as composition constructs over those skeletons. 相似文献

5.

Parallel computing of 3D smoking simulation based on OpenCL heterogeneous platform

Zhiyong Yuan Weixin Si Xiangyun Liao Zhaoliang Duan Yihua Ding Jianhui Zhao 《The Journal of supercomputing》2012,61(1):84-102

Open Computing Language (OpenCL) is an open royalty-free standard for general purpose parallel programming across Central Processing Units (CPUs), Graphic Processing Units (GPUs) and other processors. This paper introduces OpenCL to implement real-time smoking simulation in a virtual surgery training simulation system. Firstly, the Computational Fluid Dynamics (CFD) is adopted to construct the real-time smoking simulation model based on the Navier?CStokes (N-S) equations of an incompressible fluid under the condition of normal temperature and pressure. Then we propose a parallel computing technique based on OpenCL to accomplish the parallel computing of smoking simulation model on CPU and GPU, respectively. Finally, we render the smoke in real time by using a three-dimensional (3D) texture volume rendering method. Experimental results show that the parallel computing technique we have proposed achieve a satisfactory effect on image quality and rendering rate both on CPU and GPU. 相似文献

6.

SkelCL: a high-level extension of OpenCL for multi-GPU systems

Michel Steuwer Sergei Gorlatch 《The Journal of supercomputing》2014,69(1):25-33

Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. We present SkelCL—a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL makes three main enhancements to the OpenCL standard: (1) memory management is simplified using parallel container data types (vectors and matrices); (2) an automatic data (re)distribution mechanism allows for implicit data movements between GPUs and ensures scalability when using multiple GPUs; (3) computations are conveniently expressed using parallel algorithmic patterns (skeletons). We demonstrate how SkelCL is used to implement parallel applications, and we report experimental evaluation of our approach in terms of programming effort and performance. 相似文献

7.

Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

Michel Steuwer Malte Friese Sebastian Albers Sergei Gorlatch 《International journal of parallel programming》2014,42(4):601-618

Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code. 相似文献

8.

A uniform approach for programming distributed heterogeneous computing systems

《Journal of Parallel and Distributed Computing》2014,74(12):3228-3239

Large-scale compute clusters of heterogeneous nodes equipped with multi-core CPUs and GPUs are getting increasingly popular in the scientific community. However, such systems require a combination of different programming paradigms making application development very challenging.In this article we introduce libWater, a library-based extension of the OpenCL programming model that simplifies the development of heterogeneous distributed applications. libWater consists of a simple interface, which is a transparent abstraction of the underlying distributed architecture, offering advanced features such as inter-context and inter-node device synchronization. It provides a runtime system which tracks dependency information enforced by event synchronization to dynamically build a DAG of commands, on which we automatically apply two optimizations: collective communication pattern detection and device-host-device copy removal.We assess libWater’s performance in three compute clusters available from the Vienna Scientific Cluster, the Barcelona Supercomputing Center and the University of Innsbruck, demonstrating improved performance and scaling with different test applications and configurations. 相似文献

9.

Three‐dimensional thinning algorithms on graphics processing units and multicore CPUs

J. Jimnez J. Ruiz de Miras 《Concurrency and Computation》2012,24(14):1551-1571

Three‐dimensional curve skeletons are a very compact representation of three‐dimensional objects with many uses and applications in ﬁelds such as computer graphics, computer vision, and medical imaging. An important problem is that the calculation of the skeleton is a very time‐consuming process. Thinning is a widely used technique for calculating the curve skeleton because of the properties it ensures and the ease of implementation. In this paper, we present parallel versions of a thinning algorithm for eﬃcient implementation in both graphics processing units and multicore CPUs. The parallel programming models used in our implementations are Compute Uniﬁed Device Architecture (CUDA) and Open Computing Language (OpenCL). The speedup achieved with the optimized parallel algorithms for the graphics processing unit achieves 106.24x against the CPU single‐process version and more than 19x over the CPU multithreaded version. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

10.

Vector data flow analysis for SIMD optimizations on OpenCL programs

Yu‐Te Lin Jenq‐Kuen Lee 《Concurrency and Computation》2016,28(5):1629-1654

Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

11.

Providing Source Code Level Portability Between CPU and GPU with MapCG

下载免费PDF全文

洪春涛陈德颢陈羽北陈文光郑纬民林海波《计算机科学技术学报》2012,27(1):42-56

Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod... 相似文献

12.

Optimized OpenCL implementation of the Elastodynamic Finite Integration Technique for viscoelastic media

M. Molero-Armenta Ursula Iturrarán-Viveros S. Aparicio M.G. Hernández 《Computer Physics Communications》2014

Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). For that purpose we choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. The full implementation of the code is included. 相似文献

13.

Graphics Processing Units and Open Computing Language for parallel computing

Kyrylo PerelyginAuthor Vitae Shui LamXiaolong WuAuthor Vitae 《Computers & Electrical Engineering》2014

Graphics Processing Units (GPUs) have become increasingly powerful over the last decade. Programs taking advantage of this architecture can achieve large performance gains and almost all new solutions and initiatives in high performance computing are aimed in that direction. To write programs that can offload the computation onto the GPU and utilize its power, new technologies are needed. The recent introduction of Open Computing Language (OpenCL), a standard for cross-platform, parallel programming of modern processors, has made a step in the right direction. Code written with OpenCL can run on a wide variety of platforms, adapting to the underlying architecture. It is versatile yet easy to learn due to similarities with the C programming language. In this paper, we will review the current state of the art in the use of GPUs and OpenCL for parallel computations. We use an implementation of the n-body simulation to illustrate some important considerations in developing OpenCL programs. 相似文献

14.

基于Julia语言的并行计算方法初探

巩庆奎张常有张先轶张云泉《计算机科学》2015,42(1):44-46

Julia语言是一种在MIT许可证下免费的开发中脚本语言(beta 0.2.0),目标是降低并行程序的编程难度.基于Julia现有语法机制,逐步增强Julia语法特性,结合公交线路的平均走行时间统计案例,研究Julia并行编程框架和程序逐步精化的方法.Julia程序支持本地多核心/多CPU并行计算.为充分发挥实验平台的计算潜能,尝试了提高Julia程序计算性能的策略.对案例程序的实验分析表明,Julia并行程序在管理计算核心方面耗费了一定的工作时间,但随着问题规模的增大,其影响可逐渐忽略,从而可获得接近线性的加速比. 相似文献

15.

An OpenCL implementation for the solution of the time-dependent Schrödinger equation on GPUs and CPUs

Cathal Ó Broin L.A.A. Nikolopoulos 《Computer Physics Communications》2012,183(10):2071-2080

Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have applied the code in the case of the Time-Dependent Schrödinger Equation of atomic hydrogen in a strong laser field and studied its performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant speedup of the GPU over the CPU device. The speedup in the benchmark tended towards a value of about 40 with significant speedups expected against multi-core CPUs. Furthermore, though we do not present the detailed benchmarks here, we also have achieved speedup values of around 75 by performing a slight optimization of the described algorithm. 相似文献

16.

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献

17.

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Jesús Jiménez Juan Ruiz de Miras 《The Journal of supercomputing》2013,65(3):1327-1352

In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation. 相似文献

18.

在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析

王一超秦强施忠伟林新华《计算机科学》2015,42(1):75-78

OpenACC是一套基于指导语句方式的并行编程语言标准.编程者可以通过在代码中添加符合该标准的指导语句,经OpenACC编译器的编译,将串行代码并行化地移植到加速器或者协处理器上,进而获得异构加速器所带来的加速效果.OpenACC与CUDA和OpenCL这类异构并行编程技术的不同之处在于,它的目的是使编程者在应用移植过程中不需要考虑加速器或协处理器的底层硬件架构,从而降低编程难度.同时它也具有仅需维护一套代码便可在不同硬件平台上运行的优良跨平台性.因此,OpenACC是一个值得研究的并行编程标准.如今的异构加速硬件设备呈现出多元化趋势.在2013年11月的Top500榜单上排名第一的“天河二号”使用了48000块构建在IntelKnights Corner架构之上的协处理器.与此同时,发布不久的NVIDIA公司最新的Kepler架构GPU产品由于多年来的GPU市场积累也迅速形成了可观的用户群体.对于并非追求性能极限的应用移植者而言,寻求应用性能和移植简易性之间的平衡是相当重要的议题.只需要编写一套代码便可运行在这两种硬件平台上的OpenACC正迎合了用户在移植简易性上的需求.解决了移植的简易性之后,同一个应用在不同硬件平台上的性能表现便成了用户最想了解的问题.通过实验和构建性能模型向读者展示使用OpenACC移植的应用在Intel Knights Corner和NVIDIA Kepler架构硬件上的性能可移植性. 相似文献

19.

MPtostream:an OpenMP compiler for CPU-GPU heterogeneous parallel systems

《中国科学:信息科学(英文版)》2012,(9):1961-1971

In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献

20.

Swan: A tool for porting CUDA programs to OpenCL 总被引：1，自引：0，他引：1

M.J. Harvey G. De Fabritiis 《Computer Physics Communications》2011,(4):1093-1099

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

Program summary

Program title: SwanCatalogue identifier: AEIH_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License version 2No. of lines in distributed program, including test data, etc.: 17 736No. of bytes in distributed program, including test data, etc.: 131 177Distribution format: tar.gzProgramming language: CComputer: PCOperating system: LinuxRAM: 256 MbytesClassification: 6.5External routines: NVIDIA CUDA, OpenCLNature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion.Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time.Restrictions: No support for CUDA C++ featuresRunning time: Nominal 相似文献