期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

OpenCL-based optimization methods for utilizing forward DCT and quantization of image compression on a heterogeneous platform

Nasser Alqudami Shin-Dug Kim 《Journal of Real-Time Image Processing》2016,12(2):219-235

Recent computer systems and handheld devices are equipped with high computing capability, such as general purpose GPUs (GPGPU) and multi-core CPUs. Utilizing such resources for computation has become a general trend, making their availability an important issue for the real-time aspect. Discrete cosine transform (DCT) and quantization are two major operations in image compression standards that require complex computations. In this paper, we develop an efficient parallel implementation of the forward DCT and quantization algorithms for JPEG image compression using Open Computing Language (OpenCL). This OpenCL-based parallel implementation utilizes a multi-core CPU and a GPGPU to perform DCT and quantization computations. We demonstrate the capability of this design via two proposed working scenarios. The proposed approach also applies certain optimization techniques to improve the kernel execution time and data movements. We developed an optimal OpenCL kernel for a particular device using device-based optimization factors, such as thread granularity, work-items mapping, workload allocation, and vector-based memory access. We evaluated the performance in a heterogeneous environment, finding that the proposed parallel implementation was able to speed up the execution time of the DCT and quantization by factors of 7.97 and 8.65, respectively, obtained from 1024 × 1024 and 2084 × 2048 image sizes in 4:4:4 format. 相似文献

2.

Portable real-time DCT-based steganography using OpenCL

Ante Poljicak Guillermo Botella Carlos Garcia Luka Kedmenec Manuel Prieto-Matias 《Journal of Real-Time Image Processing》2018,14(1):87-99

In this paper, a steganographic method for real-time data hiding is proposed. The main goal of the research is to develop steganographic method with increased robustness to unintentional image processing attacks. In addition, we prove the validity of the method in real-time applications. The method is based on a discrete cosine transform (DCT) where the values of a DCT coefficients are modified in order to hide data. This modification is invisible to a human observer. We further the investigation by implementing the proposed method using different target architectures and analyze their performance. Results show that the proposed method is very robust to image compression, scaling and blurring. In addition, modification of the image is imperceptible even though the number of embedded bits is high. The steganalysis of the method shows that the detection of the modification of the image is unreliable for a lower relative payload size embedded. Analysis of OpenCL implementation of the proposed method on four different target architectures shows considerable speedups. 相似文献

3.

SkelCL: a high-level extension of OpenCL for multi-GPU systems

Michel Steuwer Sergei Gorlatch 《The Journal of supercomputing》2014,69(1):25-33

Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. We present SkelCL—a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL makes three main enhancements to the OpenCL standard: (1) memory management is simplified using parallel container data types (vectors and matrices); (2) an automatic data (re)distribution mechanism allows for implicit data movements between GPUs and ensures scalability when using multiple GPUs; (3) computations are conveniently expressed using parallel algorithmic patterns (skeletons). We demonstrate how SkelCL is used to implement parallel applications, and we report experimental evaluation of our approach in terms of programming effort and performance. 相似文献

4.

Optimized OpenCL implementation of the Elastodynamic Finite Integration Technique for viscoelastic media

M. Molero-Armenta Ursula Iturrarán-Viveros S. Aparicio M.G. Hernández 《Computer Physics Communications》2014

Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). For that purpose we choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. The full implementation of the code is included. 相似文献

5.

Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

Michel Steuwer Malte Friese Sebastian Albers Sergei Gorlatch 《International journal of parallel programming》2014,42(4):601-618

Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code. 相似文献

6.

使用OpenCL技术的影像快速畸变纠正方法在异构平台上的应用分析

韦博文李涛李广宇汪致恒何沐师悦龄刘路遥张瑞《计算机科学》2016,43(Z11):167-169, 196

针对海量遥感数据应用中日益显著的处理效率低下和计算瓶颈问题,基于通用计算机图形处理单元的编程开发使用OpenCL并行处理技术对遥感数据处理及其过程进行加速,旨在为遥感影像大数据处理提供一条更为高效的途径。在不同显卡平台上对影像畸变纠正实施并行处理,结果表明,OpenCL技术在提高影像畸变纠正的速度方面作用显著,可取得29.1倍的最高加速效果;与CUDA并行处理技术的交叉验证进一步凸显了OpenCL技术在异构平台上实施并行处理时所具有的通用性的优势。相似文献

7.

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

《Microprocessors and Microsystems》2020

The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization.This article presents a novel LLVM-based tool for decoupling memory access from computation when synthesizing massively parallel OpenCL kernels on FPGAs. To enable systematic decoupling, we use the idea of kernel parallelism and implement a new parallelism granularity that breaks down kernels to separate data-path and memory-path (memory read/write) which work concurrently to overlap the computation of current threads^[1] with the memory access of future threads (memory pre-fetching at large scale). At the same time, this paper proposes an LLVM-based static analysis to detect the decouplable data for resolving the data dependency and maximize concurrency across the kernels.The experimental results on eight Rodinia benchmarks on Intel Stratix V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3% increase in resource utilization, and 7% increase in power consumption which reduces the overall energy consumption more than 40%. 相似文献

8.

A Parallel Algorithm for 4×4 DCT

《Journal of Parallel and Distributed Computing》1999,57(2):257-269

By developing a generalized 1D approach and parallel computing algorithm, this paper presents a parallel algorithm design and hardware implementation for the computation of 4×4 DCT. This algorithm sorts all the 2D input pixel data into four groups. Each group is then forwarded to a 1D DCT arithmetic unit to complete the computation. After a few simple additions which are designed to follow the output of 1D DCTs, the computation of 2D DCT is implemented in parallel. Therefore, the efficiency of the algorithm is entirely dependent on the 1D DCT algorithm adopted, and all the existing fast algorithms for 1D DCT can be directly applied to further optimise the algorithm design. The development can also be further extended to compute general 2D DCT by a recursive procedure where the 4×4 DCT algorithm is used as the basic core. 相似文献

9.

并行时空处理模型下的快速N-body算法

下载免费PDF全文

王伟曾栩鸿王福焕傅丽丽曾国荪《计算机科学与探索》2011,5(11):1006-1013

图形处理器(graphic processing unit,GPU)的最新发展已经能够以低廉的成本提供高性能的通用计算。基于GPU的CUDA(compute unified device architecture)和OpenCL(open computing language)编程模型为程序员提供了充足的类似于C语言的应用程序接口(application programming interface,API),便于程序员发挥GPU的并行计算能力。采用图形硬件进行加速计算,通过一种新的GPU处理模型——并行时间空间模型,对现有GPU上的N-body实现进行了分析,从而提出了一种新的GPU上快速仿真N-body问题的算法,并在AMD的HD Radeon 5850上进行了实现。实验结果表明,相对于CPU上的实现,获得了400倍左右的加速;相对于已有GPU上的实现,也获得了2至5倍的加速。相似文献

10.

Graphics Processing Units and Open Computing Language for parallel computing

Kyrylo PerelyginAuthor Vitae Shui LamXiaolong WuAuthor Vitae 《Computers & Electrical Engineering》2014

Graphics Processing Units (GPUs) have become increasingly powerful over the last decade. Programs taking advantage of this architecture can achieve large performance gains and almost all new solutions and initiatives in high performance computing are aimed in that direction. To write programs that can offload the computation onto the GPU and utilize its power, new technologies are needed. The recent introduction of Open Computing Language (OpenCL), a standard for cross-platform, parallel programming of modern processors, has made a step in the right direction. Code written with OpenCL can run on a wide variety of platforms, adapting to the underlying architecture. It is versatile yet easy to learn due to similarities with the C programming language. In this paper, we will review the current state of the art in the use of GPUs and OpenCL for parallel computations. We use an implementation of the n-body simulation to illustrate some important considerations in developing OpenCL programs. 相似文献

11.

基于OpenCL的点云分割方法

下载免费PDF全文

范昱伶王美丽何东健《计算机工程与应用》2018,54(1):191-195

点云分割是逆向工程中模型重建的关键技术之一,然而在求取点云特征时非常耗时,通过OpenCL异构计算对其进行性能加速有着重要的现实意义。以散乱无序的点云为研究对象,通过OpenCL对点云分割算法加以改进。算法主要分为并行计算点云数据的特征值,并行计算点云数据的法向量和曲率3个步骤。在计算中,根据GPU的并行结构和硬件特点,优化了数据存储结构,提高了数据访问效率,降低了算法复杂度。实验结果表明,算法充分利用了OpenCL的并行处理能力,运行效率是基于CPU实现的16倍。相似文献

12.

基于GPGPU的生物序列快速比对 总被引：1，自引：0，他引：1

下载免费PDF全文

马海晨韦刚吴百蜂《计算机工程》2012,38(4):241-244

在CPU-GPU异构平台下,提出一种高效的生物序列比对方案。该方案利用GPU的并行处理能力,通过对读延迟、写延迟、重组函数及数据传输进行优化,在OpenCL框架下重构Smith-Waterman算法,加快生物序列比对速度。实验结果证明,与CPU上传统的串行算法相比,该算法最高可获得约100倍的性能提升。相似文献

13.

Parallel implementation and optimization of high definition video real-time dehazing

Huailiang Tan Xiaofei He Zijian Wang Gaoming Liu 《Multimedia Tools and Applications》2017,76(22):23413-23434

In some warning applications, such as aircraft taking-off and landing, ship sailing, and traffic guidance in foggy weather, the high definition (HD) and rapid dehazing of images and videos is increasingly necessary. Existing technologies for the dehazing of videos or images have not completely exploited the parallel computing capacity of modern multi-core CPU and GPU, and leads to the long dehazing time or the low frame rate of video dehazing which cannot meet the real-time requirement. In this paper, we propose a parallel implementation and optimization method for the real-time dehazing of the high definition videos based on a single image haze removal algorithm. Our optimization takes full advantage of the modern CPU+GPU architecture, which increases the parallelism of the algorithm, and greatly reduces the computational complexity and the execution time. The optimized OpenCL parallel implementation is integrate into FFmpeg as an independent module. The experimental results show that for a single image, the performance of the optimized OpenCL algorithm is improved approximately 500% compared with the existing algorithm, and approximately 153% over the basic OpenCL algorithm. The 1080p (1920?×?1080) high definition hazy video can also processed at a real-time rate (more than 41 frames per second). 相似文献

14.

特征点检测DoG并行算法

下载免费PDF全文

朱超吴素萍《计算机工程与应用》2020,56(10):36-43

特征点检测被广泛应用于目标识别、跟踪及三维重建等领域。针对三维重建算法中特征点检测算法运算量大、耗时多的特点,对高斯差分（Difference-of-Gaussian,DoG）算法进行改进,提出特征点检测DoG并行算法。基于OpenMP的多核CPU、CUDA及OpenCL架构的GPU并行环境,设计实现DoG特征点检测并行算法。对hallFeng图像集在不同实验平台进行对比实验,实验结果表明,基于OpenMP的多核CPU的并行算法表现出良好的多核可扩展性,基于CUDA及OpenCL架构的GPU并行算法可获得较高加速比,最高加速比可达96.79,具有显著的加速效果,且具有良好的数据和平台可扩展性。相似文献

15.

基于OpenCL的自动微分并行实现及其应用

下载免费PDF全文

叶爱芬王环沈雁《计算机测量与控制》2019,27(5):155-159

针对如光束平差这样的大规模优化问题,实现基于OpenCL的并行化自动微分。采用更有效的反向计算模式,实现对多参数函数的导数计算。在OpenCL框架下,主机端完成C/C++形式的函数构建以及基于拓扑排序的计算序列生成,设备端按照计算序列完成函数值以及导数的并行计算。测试结果表明,将实现的自动微分应用于光束平差的雅可比矩阵计算后,相比于采用OpenMP的Ceres Solver,运行速度提高了约3.6倍。相似文献

16.

An application-centric evaluation of OpenCL on multi-core CPUs

Jie Shen Jianbin Fang Henk Sips Ana Lucia Varbanescu 《Parallel Computing》2013

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. 相似文献

17.

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Jesús Jiménez Juan Ruiz de Miras 《The Journal of supercomputing》2013,65(3):1327-1352

In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation. 相似文献

18.

Recursive algorithm, architectures and FPGA implementation of the two-dimensional discrete cosine transform

An S. Wang C. 《Image Processing, IET》2008,2(6):286-294

A new recursive algorithm and two types of circuit architectures are presented for the computation of the two-dimensional discrete cosine transform (2D DCT). The new algorithm permits to compute the 2D DCT by a simple procedure of the 1D recursive calculations involving only cosine coefficients. The recursive kernel for the proposed algorithm contains a small number of operations. Also, it requires a smaller number of pre-computed data compared with many of existing algorithms in the same category. The kernel can be easily implemented in a simple circuit block with a short critical delay path. In order to evaluate the performance improvement resulting from the new algorithm, an architecture for the 2D DCT designed by direct mapping from the computation structure of the proposed algorithm has been implemented in an FPGA board. The results show that the reduction of the hardware consumption can easily reach 25% and the clock frequency can increase 17% compared with a system implementing a recently reported 2D DCT recursive algorithm. For a further reduction of the hardware, another architecture has been proposed for the same 2D DCT computation. Using one recursive computation block to perform different functions, this architecture needs only approximately one-half of the hardware that is required in the first architecture, which has been confirmed by an FPGA implementation. 相似文献

19.

基于Chan-Vese模型的面向多核CPU和GPU的人脸轮廓提取并行算法

王丽娜史晓华《计算机应用》2014,34(11):3121-3125

针对人脸轮廓提取中Chan-Vese模型计算量大、分割速度缓慢等问题,采用开放计算语言(OpenCL)并行编程模型,提出了一种基于图形处理器(GPU)和多核CPU加速的并行算法。该算法首先将模型的框架进行重构,消除模型中的数据依赖关系;然后,利用开放计算语言对算法进行并行化以及相应的优化。实验结果表明,与单线程算法相比,在NVIDIA GTX660和AMD FX-8530下达到了较高的加速比。相似文献

20.

Enabling OpenCL support for GPGPU in Kernel‐based Virtual Machine

Tsan‐Rong Tien Yi‐Ping You 《Software》2014,44(5):483-510

The importance of heterogeneous multicore programming is increasing, and Open Computing Language (OpenCL) is an open industrial standard for parallel programming that provides a uniform programming model for programmers to write efficient, portable code for heterogeneous computing devices. However, OpenCL is not supported in the system virtualization environments that are often used to improve resource utilization. In this paper, we propose an OpenCL virtualization framework based on Kernel‐based Virtual Machine with API remoting to enable multiplexing of multiple guest virtual machines (guest VMs) over the underlying OpenCL resources. The framework comprises three major components: (i) an OpenCL library implementation in guest VMs for packing/unpacking OpenCL requests/responses; (ii) a virtual device, called virtio‐CL, that is responsible for the communication between guest VMs and the hypervisor (also called the VM monitor); and (iii) a thread, called CL thread, that is used for the OpenCL API invocation. Although the overhead of the proposed virtualization framework is directly affected by the amount of data to be transferred between the OpenCL host and devices because of the primitive nature of API remoting, experiments demonstrated that our virtualization framework has a small virtualization overhead (mean of 6.8%) for six common device‐intensive OpenCL programs and performs well when the number of guest VMs involved in the system increases. These results indirectly infer that the framework allows for effective resource utilization of OpenCL devices.Copyright © 2012 John Wiley & Sons, Ltd. 相似文献