期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Highly parallel GEMV with register blocking method on GPU architecture

《Journal of Visual Communication and Image Representation》2014,25(7):1566-1573

GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices. 相似文献

2.

Efficient GPU and CPU-based LDPC decoders for long codewords

Stefan Gr?nroos Kristian Nybom Jerker Bj?rkqvist 《Analog Integrated Circuits and Signal Processing》2012,73(2):583-595

The next generation DVB-T2, DVB-S2, and DVB-C2 standards for digital television broadcasting specify the use of low-density parity-check (LDPC) codes with codeword lengths of up to 64800 bits. The real-time decoding of these codes on general purpose computing hardware is useful for completely software defined receivers, as well as for testing and simulation purposes. Modern graphics processing units (GPUs) are capable of massively parallel computation, and can in some cases, given carefully designed algorithms, outperform general purpose CPUs (central processing units) by an order of magnitude or more. The main problem in decoding LDPC codes on GPU hardware is that LDPC decoding generates irregular memory accesses, which tend to carry heavy performance penalties (in terms of efficiency) on GPUs. Memory accesses can be efficiently parallelized by decoding several codewords in parallel, as well as by using appropriate data structures. In this article we present the algorithms and data structures used to make log-domain decoding of the long LDPC codes specified by the DVB-T2 standard??at the high data rates required for television broadcasting??possible on a modern GPU. Furthermore, we also describe a similar decoder implemented on a general purpose CPU, and show that high performance LDPC decoders are also possible on modern multi-core CPUs. 相似文献

3.

SURF算法在通用GPU和OpenCL的优化与实现

王艳梅史晓华于湛麟《电子测试》2013,(12):51-55,42

Speeded Up Robust Feature（SURF）算法是在计算机视觉领域得到广泛应用的一种图像兴趣点检测和匹配方法。开放计算语言（OpenCL）提供了一个在异构体系结构上,包括GPU,CPU及其他类型处理器,编写并行程序的框架。本文介绍了如何在通用GPU和OpenCL平台上,对SURF算法进行优化与实现。本文对其中一些优化方法,例如kernel线程的配置,局部内存的使用方法等,进行了详细的对比和讨论。最终实现的OpenCL版本的算法在NVidiaGTX260平台上获得了比原始的CPU版本在IntelDual—CoreE54002．7G处理器上至少21倍的加速。相似文献

4.

Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories

Yuriko Ishitobi Tohru Ishihara Hiroto Yasuura 《Journal of Signal Processing Systems》2010,60(2):211-224

This paper proposes a code placement problem, its ILP formulation, and a heuristic algorithm for reducing the total energy consumption of embedded processor systems including a CPU core, on-chip and off-chip memories. Our approach exploits a non-cacheable memory region for an effective use of a cache memory and as a result, reduces the number of off-chip accesses. Our algorithm simultaneously finds a code layout for a cacheable region, a scratchpad region, and the other non-cacheable region of the address space so as to minimize the total energy consumption of the processor system. Experiments using a commercial embedded processor and an off-chip SDRAM demonstrate that our algorithm reduces the energy consumption of the processor system by 23% without any performance degradation compared to the best result achieved by the conventional approach. 相似文献

5.

Implementation of a High Throughput 3GPP Turbo Decoder on GPU

Michael Wu Yang Sun Guohui Wang Joseph R. Cavallaro 《Journal of Signal Processing Systems》2011,65(2):171-183

Turbo code is a computationally intensive channel code that is widely used in current and upcoming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commodity processor that achieves high performance computation power by using many simple cores. In this paper, we present a 3GPP LTE compliant Turbo decoder accelerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Several techniques are used to improve the performance of the decoder. To fully utilize the computational resources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack multiple codewords to fit the single instruction multiple data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of concurrent multiple threads while keeping frequently used data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the decoder throughput. 相似文献

6.

TMS320C6X的SPLOOP技术

下载免费PDF全文

方志红常越《雷达科学与技术》2014,12(4):437-440

软件流水是一种实现循环迭代中指令级并行的指令调度技术。它可以克服多周期指令延迟对CPU处理性能的影响,保证循环核的运行效率最优。从C64X+开始,TMS320C6X系列DSP引入SPLOOP技术,软件上增加SPLOOP(D/W)、SPKERNEL等相关指令,硬件上增加软件流水缓存等专用模块,通过模调度软件流水模式,有效缩小了软件代码量,提升了执行代码效率。一般情况下,采用SPLOOP技术后机器编译输出的循环代码质量很高,编程人员无需再对代码进行进一步的手工优化。相似文献

7.

GPU Computing 总被引：9，自引：0，他引：9

Owens J.D. Houston M. Luebke D. Green S. Stone J.E. Phillips J.C. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2008,96(5):879-899

The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications. 相似文献

8.

Accurate and efficient GPU ray‐casting algorithm for volume rendering of unstructured grid data

Gibeom Gu Duksu Kim 《ETRI Journal》2020,42(4):608-618

We present a novel GPU‐based ray‐casting algorithm for volume rendering of unstructured grid data. Our volume rendering system uses a ray‐casting method that guarantees accurate rendering results. We also employ the per‐pixel intersection list concept in the Bunyk algorithm to guarantee an accurate result for non‐convex meshes. For efficient memory access for the lists on the GPU, we represent the intersection lists for all faces as an array with our novel construction algorithm. With the intersection lists, we perform ray‐casting on a GPU, and a GPU thread handles each ray. To increase ray‐coherency in a thread block and improve memory access efficiency, we extend a prior image‐tile‐based work distribution method to fit modern GPU architectures. We also show that a prior approach using a per‐thread local buffer to reduce redundant computation is not appropriate for modern GPU architectures. Instead, we take an on‐demand calculation strategy that achieves better performance even though it allows duplicate computations. We applied our method to three unstructured grid datasets with different characteristics. With a GPU, our method achieved up to 36.5 times higher performance for the ray‐casting process and 19.7 times higher performance for the whole volume rendering process compared with the Bunyk algorithm using a CPU core. Also, our approach showed up to 8.2 times higher performance than a GPU‐based cell projection method while generating more accurate rendering results. These results demonstrate the efficiency and accuracy of our method. 相似文献

9.

Accelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU

Jiazhong Chen Zengwei Ju Cao Hua Bingpeng Ma Changnian Chen Leihua Qin Rong Li 《Signal Processing: Image Communication》2013,28(9):1202-1211

Because of the high data dependency between ADL (adaptive directional lifting) operations, such as interpolation, directional prediction and update, the existing CUDA-specific (Compute Unified Device Architecture) implementation of traditional rectilinear lifting-based transform is difficult to be used for ADL-based transform. This paper proposes a novel CUDA-specific method named Slice for implementation of the ADL-based wavelet transforms on GPU (Graphics Processing Unit). Compared with the previous CUDA-specific methods the proposed method makes each step handled by a different kernel to avoid unnecessary waiting time between lifting steps. Meanwhile the interpolation and decomposition including prediction and update are executed in an interleaving style for each filtered pixel. Moreover, the coalesced memory accesses are exploited to the greatest extent by coalesced reading a slice of data to the shared memory and coalesced writing them back to the global memory after being processed. The results show that the Slice method overcomes the limitation of high data dependency between the lifting steps and achieves more than 10 times speedup compared to the optimized CPU implementation for the ADL-based transform. 相似文献

10.

GPU Acceleration of a Configurable N-Way MIMO Detector for Wireless Systems

Michael Wu Bei Yin Guohui Wang Christoph Studer Joseph R. Cavallaro 《Journal of Signal Processing Systems》2014,76(2):95-108

Multiple-input multiple-output (MIMO) wireless is an enabling technology for high spectral efficiency and has been adopted in many modern wireless communication standards, such as 3GPP-LTE and IEEE 802.11n. However, (optimal) maximum a-posteriori (MAP) detection suffers from excessively high computational complexity, which prevents its deployment in practical systems. Hence, many algorithms have been proposed in the literature that trade-off performance versus detection complexity. In this paper, we propose a flexible N-Way MIMO detector that achieves excellent error-rate performance and high throughput on graphics processing units (GPUs). The proposed detector includes the required QR decomposition step and a tree-search detector, which exploits the massive parallelism available in GPUs. The proposed algorithm performs multiple tree searches in parallel, which leads to excellent error-rate performance at low computational complexity on different GPU architectures, such as Nvidia Fermi and Kepler. We highlight the flexibility of the proposed detector and demonstrate that it achieves higher throughput than existing GPU-based MIMO detectors while achieving the same or better error-rate performance. 相似文献

11.

Support OpenCL 2.0 Compiler on LLVM for PTX Simulators

Yang Chun-Chieh Wang Shao-Chung Hsu Min-Yi Chang Yuan-Ming Hwang Yuan-Shin Lee Jenq-Kuen 《Journal of Signal Processing Systems》2019,91(3-4):261-271

Heterogeneous systems that consist of multiple CPUs and GPUs for high-performance computing are becoming increasingly popular, and OpenCL (Open Computing Language) provides a framework for writing programs that can be executed across heterogeneous devices. Compared with OpenCL 1.2, the new features of OpenCL 2.0 provide developers with better expressive power for programming heterogeneous computing environments. Currently, gem5-gpu, which includes gem5 and GPGPU-Sim, can offer an experimental simulation environment for OpenCL. In gem5-gpu, gem5 only supports CUDA, although GPGPU-Sim can support OpenCL by compiling an OpenCL kernel code to PTX code using real GPU drivers. However, this compilation flow in GPGPU-Sim can only support up to OpenCL 1.2. OpenCL 2.0 provides new features such as workgroup built-in functions, extended atomic built-in functions, and device-side enqueue. To support OpenCL 2.0, the compiler must be extended to enable the compilation of OpenCL 2.0 kernel code to PTX code. In this paper, the proposed compiler is modified from the low level virtual machine (LLVM) compiler to extend such features to enhance the emulator to support OpenCL 2.0. The proposed compiler creates local buffers for each workgroup to enable workgroup built-in functions and adds atomic built-in functions with memory order and memory scope for OpenCL 2.0 in NVPTX. Furthermore, the APIs available in CUDA are utilized to implement the OpenCL 2.0 device-side enqueue kernel and compilation schemes in Clang are revised. The AMD APP SDK 3.0 and NTU OpenCL benchmarks are used to verify that the proposed compiler can support the features of OpenCL 2.0.

相似文献

12.

一种基于Kepler架构GPU的通信仿真加速方法

下载免费PDF全文

韩秉君黄诗铭杜滢《电信科学》2015,31(10):82-88

提出了一种在 Kepler 架构 GPU（graphics processing unit,图形处理器）上利用 CUDA（compute unified device architecture,统一计算设备架构）技术加速通信仿真中DFT（discrete Fourier transform,离散傅里叶变换）处理过程的方法。该方法的核心思想是利用线程级并行技术实现单条收发链路内部DFT运算的并行加速,并利用动态并行和Hyper-Q技术实现不同收发用户对之间链路处理过程的并行加速,从而最终达到加速仿真中DFT处理过程的目的。实验结果表明,相对单核单线程CPU程序和上一代Fermi架构GPU程序,该方法分别能够将DFT处理速度提升300倍和3倍,具有较好的加速效果。相似文献

13.

复杂网格模型仿射变换的GPU加速计算

钟庆华顺刚《光电技术应用》2011,26(1):59-62

为提高计算速度,复杂网格模型操作处理中的仿射变换计算被移植到具有可编程能力的GPU上实现.在并行计算中,每个线程计算一个顶点的k邻域权重值和坐标变换,多个线程同时执行.经过对线程结构安排、设备存储器分配等的优化,充分发挥GPU并行运算性能.实验结果表明,GPU加速计算对大规模网格顶点仿射变换的处理得到了较好的加速效果. 相似文献

14.

Large-Scale Pairwise Alignments on GPU Clusters: Exploring the Implementation Space

Huan Truong Da Li Kittisak Sajjapongse Gavin Conant Michela Becchi 《Journal of Signal Processing Systems》2014,77(1-2):131-149

Several problems in computational biology require the all-against-all pairwise comparisons of tens of thousands of individual biological sequences. Each such comparison can be performed with the well-known Needleman-Wunsch alignment algorithm. However, with the rapid growth of biological databases, performing all possible comparisons with this algorithm in serial becomes extremely time-consuming. The massive computational power of graphics processing units (GPUs) makes them an appealing choice for accelerating these computations. As such, CPU-GPU clusters can enable all-against-all comparisons on large datasets. In this work, we present four GPU implementations for large-scale pairwise sequence alignment: TiledDScan-mNW, DScan-mNW, RScan-mNW and LazyRScan-mNW. The proposed GPU kernels exhibit different parallelization patterns: we discuss how each parallelization strategy affects the memory accesses and the utilization of the underlying GPU hardware. We evaluate our implementations on a variety of low- and high-end GPUs with different compute capabilities. Our results show that all the proposed solutions outperform the existing open-source implementation from the Rodinia Benchmark Suite, and LazyRScan-mNW is the preferred solution for applications that require performing the trace-back operation only on a subset of the considered sequence pairs (for example, the pairs whose alignment score exceeds a predefined threshold). Finally, we discuss the integration of the proposed GPU kernels into a hybrid MPI-CUDA framework for deployment on CPU-GPU clusters. In particular, our proposed distributed design targets both homogeneous and heterogeneous clusters with nodes that differ amongst themselves in their hardware configuration. 相似文献

15.

A survey and measurement study of GPU DVFS on energy conservation

《Digital Communications & Networks》2017,3(2):89-100

Energy efficiency has become one of the top design criteria for current computing systems. The dynamic voltage and frequency scaling (DVFS) has been widely adopted by laptop computers, servers, and mobile devices to conserve energy, while the GPU DVFS is still at a certain early age. This paper aims at exploring the impact of GPU DVFS on the application performance and power consumption, and furthermore, on energy conservation. We survey the state-of-the-art GPU DVFS characterizations, and then summarize recent research works on GPU power and performance models. We also conduct real GPU DVFS experiments on NVIDIA Fermi and Maxwell GPUs. According to our experimental results, GPU DVFS has significant potential for energy saving. The effect of scaling core voltage/frequency and memory voltage/frequency depends on not only the GPU architectures, but also the characteristic of GPU applications. 相似文献

16.

面向CUDA程序的性能预测框架

下载免费PDF全文

曲海成于思淼刘万军王鑫源《电子学报》2020,48(4):654-661

为对CUDA并行程序内核性能进行分析和预测，从而指导并行程序设计及性能优化，提出一种性能预测框架.1）从GPU编程模型和设备架构细节入手，以线程束为研究单位，通过整合与GPU程序用时密切相关的软硬件基本特征，定义了并行空间闲置度、流处理器线程束负载、并行效应因子等高层次性能相关特征.2）基于上述特征，框架针对线程负载均衡型GPU程序，评估内核函数在不同问题规模以及执行配置下的执行时间.3）依据性能评估原理提出了内核函数执行配置参数的优化策略.验证实验结果表明，该框架在两种典型情境下对现有程序性能的平均预测准确率分别达到89%和94%，客观归纳了高层次特征与程序性能间的相关关系，且能定性分析并行算法性能水平. 相似文献

17.

IterML: Iterative Machine Learning for Intelligent Parameter Pruning and Tuning in Graphics Processing Units

Cui Xuewen Feng Wu-chun 《Journal of Signal Processing Systems》2021,93(4):391-403

With the rise of graphics processing units (GPUs), the parallel computing community needs better tools to productively extract performance from the GPU. While modern compilers provide flags to activate different optimizations to improve performance, the effectiveness of such automated optimization has been limited at best. As a consequence, extracting the best performance from an algorithm on a GPU requires significant expertise and manual effort to exploit both spatial and temporal sharing of computing resources. In particular, maximizing the performance of an algorithm on a GPU requires extensive hyperparameter (e.g., thread-block size) selection and tuning. Given the myriad of hyperparameter dimensions to optimize across, the search space of optimizations is extremely large, making it infeasible to exhaustively evaluate. This paper proposes an approach that uses statistical analysis with iterative machine learning (IterML) to prune and tune hyperparameters to achieve better performance. During each iteration, we leverage machine-learning models to guide the pruning and tuning for subsequent iterations. We evaluate our IterML approach on the GPU thread-block size across many benchmarks running on an NVIDIA P100 or V100 GPU. Our experimental results show that our automated IterML approach reduces search effort by 40% to 80% when compared to traditional (non-iterative) ML and that the performance of our (unmodified) GPU applications can improve significantly — between 67% and 95% — simply by changing the thread-block size.

相似文献

18.

基于GPU+CPU的CANNY算子快速实现

下载免费PDF全文

唐斌龙文《液晶与显示》2016,31(7):714-720

本文提出一种基于GPU+CPU的快速实现Canny算子的方法。首先将算子分为串行和并行两部分,高斯滤波、梯度幅值和方向计算、非极大值抑制和双阈值处理在GPU中完成,将二维高斯滤波分解为水平方向上和垂直方向上的两次一维滤波从而降低计算的复杂度;然后使用CUDA编程完成多线程并行计算以加快计算速度;最后使用共享存储器隐藏线程访问全局存储的延迟;在CPU中则使用队列FIFO完成边缘连接。仿真测试结果表明：对分辨率为1024×1024的8位图像的处理时间为122 ms,相对应单独使用CPU而言,加速比最高可达5.39倍,因此本文方法充分利用了GPU的并行性的特征和CPU的串行处理能力。相似文献

19.

基于GPU的Viterbi并行解码算法的设计与实现

李俊鹏余心乐徐伟掌《电视技术》2017,41(4)

针对GPU并行计算特征,对Viterbi解码自身做了并行处理探索,并提出使用Zero-Termination卷积码来实现基于GPU的Viterbi解码分块并行处理.设计的实现结果表明:Zero-Termination卷积码的简单而适用于GPU分块并行;误码率降低,特别是在信噪比低的情况下,Zero-Termination卷积码误码率比不损失码率的卷积码要低.同时,还实现了基于GPU的7,9,15三种不同约束长度的Viterbi解码,获得了良好的误码性能曲线及高吞吐率表现. 相似文献

20.

Systematic Application of Data Transfer and Storage Optimizing Code Transformations for Power Consumption and Execution Time Reduction in ACROPOLIS: A Pre-Compiler for Multimedia Applications

K. Masselos F. Catthoor C. E. Goutis H. De Man 《Design Automation for Embedded Systems》2003,8(1):51-86

相似文献