期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

H.264/AVC inter prediction for heterogeneous computing systems

Rafael Rodríguez-Sánchez José Luis Martínez Gerardo Fernández-Escribano José Manuel Claver José L. Sánchez 《The Journal of supercomputing》2013,64(1):79-88

H.264/AVC is the latest standard for video compression and is a significant advance, but at the expense of increasing computing needs. Recently, the progress of GPUs has attracted considerable attention because they are able to offer practical and acceptable solutions for speeding up graphic and non-graphic applications. In this paper, we present an implementation of H.264/AVC Motion Estimation running on an NVIDIA GTX285 using CUDA. The algorithm is divided into three steps, all of which need to be executed sequentially but each one is exploited following a highly parallel procedure by using the GPU. The execution time of the proposed motion estimation algorithm is 53 times faster and it reduces the energy consumption by a factor of 9 compared with the JM reference encoder using a single CPU core. 相似文献

2.

并行时空处理模型下的快速N-body算法

下载免费PDF全文

王伟曾栩鸿王福焕傅丽丽曾国荪《计算机科学与探索》2011,5(11):1006-1013

图形处理器(graphic processing unit,GPU)的最新发展已经能够以低廉的成本提供高性能的通用计算。基于GPU的CUDA(compute unified device architecture)和OpenCL(open computing language)编程模型为程序员提供了充足的类似于C语言的应用程序接口(application programming interface,API),便于程序员发挥GPU的并行计算能力。采用图形硬件进行加速计算,通过一种新的GPU处理模型——并行时间空间模型,对现有GPU上的N-body实现进行了分析,从而提出了一种新的GPU上快速仿真N-body问题的算法,并在AMD的HD Radeon 5850上进行了实现。实验结果表明,相对于CPU上的实现,获得了400倍左右的加速;相对于已有GPU上的实现,也获得了2至5倍的加速。相似文献

3.

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

Michał Czapiński Chris Thompson Stuart Barnes 《International journal of parallel programming》2014,42(6):1032-1047

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks. 相似文献

4.

Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding 总被引：1，自引：0，他引：1

《Multimedia, IEEE Transactions on》2009,11(1):1-10

The video coding standard H.264 supports video compression with a higher coding efficiency than previous standards. However, this comes at the expense of an increased encoding complexity, in particular for motion estimation which becomes a very time consuming task even for today's central processing units (CPU). On the other hand, modern graphics hardware includes a powerful graphics processing unit (GPU) whose computing power remains idle most of the time. In this paper, we present a GPU based approach to motion estimation for the purpose of H.264 video encoding. A small diamond search is adapted to the programming model of modern GPUs to exploit their available parallel computing power and memory bandwidth. Experimental results demonstrate a significant reduction of computation time and a competitive encoding quality compared to a CPU UMHexagonS implementation while enabling the CPU to process other encoding tasks in parallel. 相似文献

5.

H.264编码器存储带宽分析及DRAM控制器设计

下载免费PDF全文

胡红旗许家栋孙景楠《计算机工程与应用》2009,45(14):141-144

在分析H.264/AVC编码过程中存储器带宽需求的基础上,提出一种DRAM控制器结构,并实现了几种不同调度策略的DRAM控制器结构设计。实现了令牌环、固定优先级和抢占式等三种结构,结合已有的存储空间映射方法,通过减少换行及Bank切换过程中的冗余周期,进一步提高存储器的带宽利用率。实验结果表明,提出的三种存储器结构中抢占式调度具有最高的宽利用率,可满足150 MHz时钟频率条件下HDTV1080P实时编码的应用。相似文献

6.

一种基于GPU集群的深度优先并行算法设计与实现

余莹李肯立郑光勇《计算机科学》2015,42(1):82-85

深度优先搜索算法在GPU集群中大型图上的简单执行,会导致线程间的负载不平衡和无法合并内存访问的情况,这使得算法的性能较低.为了明显提高算法在单个GPU和多个GPU环境下的性能,在处理数据之前通过采取一系列有效的操作来进行重新编排.提出了构造线程和数据之间映射的新技术,通过利用前缀求和及二分查找操作来达到完美的负载平衡.为了降低通信开销,对DFS各分支中需要进行交换的边集执行修剪操作.实验结果表明,算法在单个GPU上可以尽可能地实现最佳的并行性,在多GPU环境下可以最小化通信开销.在一个GPU集群中,它可以对合有数十亿节点的图有效地执行分布式DFS. 相似文献

7.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

8.

Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms

Svetislav Momcilovic Nuno Roma Leonel Sousa 《Journal of Real-Time Image Processing》2016,11(3):571-587

相似文献

9.

GPU implementation of a parallel two‐list algorithm for the subset‐sum problem

Lanjun Wan Kenli Li Jing Liu Keqin Li 《Concurrency and Computation》2015,27(1):119-145

The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

10.

Value Prediction and Speculative Execution on GPU

Shaoshan Liu Christine Eisenbeis Jean-Luc Gaudiot 《International journal of parallel programming》2011,39(5):533-552

GPUs and CPUs have fundamentally different architectures. It is conventional wisdom that GPUs can accelerate only those applications that exhibit very high parallelism, especially vector parallelism such as image processing. In this paper, we explore the possibility of using GPUs for value prediction and speculative execution: we implement software value prediction techniques to accelerate programs with limited parallelism, and software speculation techniques to accelerate programs that contain runtime parallelism, which are hard to parallelize statically. Our experiment results show that due to the relatively high overhead, mapping software value prediction techniques on existing GPUs may not bring any immediate performance gain. On the other hand, although software speculation techniques introduce some overhead as well, mapping these techniques to existing GPUs can already bring some performance gain over CPU. Based on these observations, we explore the hardware implementation of speculative execution operations on GPU architectures to reduce the software performance overheads. The results indicate that the hardware extensions result in almost tenfold reduction of the control divergent sequential operations with only moderate hardware (5–8%) and power consumption (1–5%) overheads. 相似文献

11.

Efficient breadth first search on multi-GPU systems

Enrico Mastrostefano Massimo Bernaschi 《Journal of Parallel and Distributed Computing》2013

Simple algorithms for the execution of a Breadth First Search on large graphs lead, running on clusters of GPUs, to a situation of load unbalance among threads and un-coalesced memory accesses, resulting in pretty low performances. To obtain a significant improvement on a single GPU and to scale by using multiple GPUs, we resort to a suitable combination of operations to rearrange data before processing them. We propose a novel technique for mapping threads to data that achieves a perfect load balance by leveraging prefix-sum and binary search operations. To reduce the communication overhead, we perform a pruning operation on the set of edges that needs to be exchanged at each BFS level. The result is an algorithm that exploits at its best the parallelism available on a single GPU and minimizes communication among GPUs. We show that a cluster of GPUs can efficiently perform a distributed BFS on graphs with billions of nodes. 相似文献

12.

CUDA架构下H.264快速去块滤波算法 总被引：1，自引：0，他引：1

刘虎孙召敏陈启美《计算机应用》2010,30(12):3252-3254

针对H.264/AVC视频编码标准中去块滤波器运算复杂度高、耗时巨大这一难题,提出了一种基于NVIDIA计算统一设备架构（CUDA）平台的H.264并行快速去块滤波算法,介绍了CUDA平台硬件结构特点与软件开发流程,根据图形处理器（GPU）的并发结构特点,对BS判定与滤波计算进行了并行优化,降低了算法复杂度,利用共享内存提高了数据访问速率,实现了去块滤波器的并行处理。实验结果表明,在图像质量基本不变的情况下,GPU算法能够明显提高运算速度,平均加速比在20倍左右,取得了良好的效果。相似文献

13.

基于运动复杂度的快速模式选择算法

下载免费PDF全文

何军球赵欢《计算机工程与应用》2009,45(7):79-81

H.264/AVC是ITU-T和ISO/IEC联合制定的最新视频压缩标准。它采用了可变块运动估计和率失真优化模式判决。H.264需要对10 种模式进行率失真优化计算才能得到一个宏块的最优编码模式,极大地增加了编码器的复杂度。为了提高模式选择效率,提出了一种基于运动复杂度的快速模式选择算法。实验结果表明,在PSNR仅有微小下降情况下,该算法可大幅提升编码速度。相似文献

14.

基于嵌入式处理器的H .264编码器的存储优化

下载免费PDF全文

徐宁史册陈梅丽《计算机工程》2006,32(23):268-270

由于H.264/AVC新标准采用了很多新技术，在可编程处理器的应用领域中，如果不进行优化将会需要非常大的存储空间。该文对编码器的存储复杂度进行了分析，在此基础上提出了基于宏块级的滤波和插值算法。为了便于嵌入式处理器的实现，提出了一种高效的内存管理调度策略。实验结果表明，优化方法在极大地降低存储复杂度(cycle:64.9%)的同时得到了更高的编码速率(76.6%)，而只有很小的编码效率损失。相似文献

15.

H.264/AVC中基于决策树的P帧快速模式选择（HHME2013第61号论文）

下载免费PDF全文

王萍张晓丹张磊《中国图象图形学报》2014,19(3)

目的：H.264/AVC帧间预测编码需要对所有可能编码模式计算并比较率失真代价,众多的模式类型导致了P帧编码的计算复杂度非常高。本文提出了一种针对P帧的基于决策树的快速选择候选模式算法。方法：在对宏块进行16×16的帧间运动估计后,首先根据残差宏块中4×4全零系数块个数对部分宏块直接选择出候选模式;然后使用16个4×4块的SATD值,采用决策树分类方法对其余宏块选择候选模式。结果：由于只需对候选模式进行编码,因此有效降低了编码器的计算复杂度。实验结果表明,与原始全搜索编码算法相比,该算法对不同运动程度的视频序列获得了较一致的编码时间节省,同时平均峰值信噪比的损失和平均比特率的增加均较少。结论：本文提出了一种新的P帧帧间预测候选模式选择算法,根据帧间运动估计后的残差宏块信息,采用决策树方法对候选模式集进行分类。实验结果表明,该算法能在保证视频编码质量的前提下,有效地降低编码过程中的计算量,缩短编码时间。相似文献

16.

HDS, a real-time multi-DSP motion estimator for MPEG-4 H.264 AVC high definition video encoding

Fabrice Urban Jean-François Nezan Mickaël Raulet 《Journal of Real-Time Image Processing》2009,4(1):23-31

H.264 AVC video compression standard achieves high compression rates at the cost of a high encoder complexity. The encoder performances are greatly linked to the motion estimation operation which requires high computation power and memory bandwidth. High definition context magnifies the difficulty of a real-time implementation. EPZS and HME are two well-known motion estimation algorithms. Both EPZS and HME are implemented in a DSP and their performances are compared in terms of both quality and complexity. Based on these results, a new algorithm called HDS for Hierarchical Diamond Search is proposed. HDS motion estimation is integrated in a AVC encoder to extract timings and resulting video qualities reached. A real-time DSP implementation of H.264 quarter-pixel accuracy motion estimation is proposed for SD and HD video format. Furthermore HDS characteristics make this algorithm well suited for H.264 SVC real-time encoding applications. 相似文献

17.

Fast block mode decision algorithm in H.264/AVC using a filter bank of Kalman filters for high definition encoding

Jinwuk Seok Jeong-Woo Lee Chang-Sik Cho 《Multimedia Systems》2008,13(5-6):391-408

In this paper, we propose a fast mode decision algorithm using a filter bank of Kalman filters for H.264/ AVC. For the highest coding efficiency in H.264/AVC, a macroblock can be coded with seven different block sizes for motion compensation in an inter mode and various spatial prediction modes in an intra mode. The conventional encoder employs a complex technique for mode decision based on a rate-distortion (RD) cost of all possible modes. Hence, for the purpose of selecting the best block mode with the minimum RD cost, the conventional procedure requires much computational burden and a very complex encoding structure. In order to reduce the complexity, we propose a fast algorithm for mode decision based on Kalman filtering to estimate RD cost of a specific block mode. Furthermore, we propose an optimized structure of H.264/AVC encoder to implement the proposed algorithm. Without considerable performance degradation, using SIMD technology, the computer simulation shows that the proposed methods are dramatically faster than the original JM 9.6 encoder. 相似文献

18.

Low-complexity heterogeneous architecture for H.264/HEVC video transcoding

Antonio Jesús Díaz-Honrubia Gabriel Cebrián-Márquez José Luis Martínez Pedro Cuenca José Miguel Puerta José Antonio Gámez 《Journal of Real-Time Image Processing》2016,12(2):311-327

High efficiency video coding (HEVC) was developed by the Joint Collaborative Team on video coding to replace the current H.264/AVC standard, which has been widely adopted over the last few years. Therefore, there is a lot of legacy content encoded with H.264/AVC, and an efficient conversion to HEVC is needed. This paper presents a hybrid transcoding algorithm which makes use of soft computing techniques as well as parallel processing. On the one hand, a fast quadtree level decision algorithm tries to exploit the information gathered at the H.264/AVC decoder to make faster decisions on coding unit splitting in HEVC using a Naïve–Bayes probabilistic classifier that is determined by a supervised data mining process. On the other hand, a parallel HEVC-encoding algorithm makes use of a heterogeneous platform composed of a multi-core central processing unit plus a graphics processing unit (GPU). In this way, from a coarse point of view, groups of frames or rows of a frame (both options are possible) are divided into threads to be executed on each core (each of which executes one of the aforementioned classifiers) and, from a finer point of view, all these threads work in a collaborative way on a single GPU to perform the motion estimation process on the co-processor. Experimental results show that the proposed transcoder can achieve a good tradeoff between coding efficiency and complexity compared with the anchor transcoder. 相似文献

19.

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

《Parallel Computing》2014,40(8):425-447

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:

•method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
•method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
•method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
•approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.

Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively. 相似文献

20.

基于GPU的H.264并行解码算法

陈鹏曹剑炜陈庆奎《计算机工程》2014,(1):283-286

针对并行处理H.264标准视频流解码问题,提出基于CPU/GPU的协同运算算法。以统一设备计算架构(CUDA)语言作为GPU编程模型,实现DCT逆变换与帧内预测在GPU中的加速运算。在保持较高计算精度的前提下,结合CUDA混合编程,提高系统的计算性能。利用NIVIDIA提供的CUDA语言,在解码过程中使DCT逆变换和帧内预测在GPU上并行实现,将并行算法与CPU单机实现进行比较,并用不同数量的视频流验证并行解码算法的加速效果。实验结果表明,该算法可大幅提高视频流的编解码效率,比CPU单机的平均计算加速比提高10倍。相似文献