首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Variable block-size motion estimation (VBSME) process occupies a major part of computation of an H.264 encoder, which is usually accelerated by bit-parallel hardware architectures with large I/O bit width to meet real-time constrains. However, such kind of architectures increase the area overhead and pin count, and therefore will not be suitable for area-constrained electronic consumer designs such as small portable multimedia devices. This paper addresses this problem by proposing two area efficient least significant bit (LSB) bit-serial architectures with small pin numbers. Both designs take advantage of data reusing technique in different ways for sum of absolute differences (SAD) computation and reading reference pixels, leading to a considerable reduction of memory bandwidth. The first architecture propagates the partial SAD and sum results and broadcasts the reference pixel rows whereas the second design reuse the SAD of small blocks and has a reconfigurable reference buffer leading to a better memory bandwidth when using hardware parallelism. The proposed designs benefit from several optimization techniques including an efficient serial absolute difference architecture, word length reduction by parallelism, bit truncation, mode filtering, and macroblock (MB) level subsampling, which significantly enhance their performances in terms of silicon area, throughput, latency, and power consumption. The first and second designs can support full search VBSME of 720?×?480 video with 30 frames per second (fps), two reference frames, and [?16, 15] search range at a clock frequency of 414 MHz with 29.28 k and 31.5 k gates, respectively.  相似文献   

2.
Motion estimation (ME) is the most critical component of a video coding standard. H.264/AVC adopts the variable block size motion estimation (VBSME) to obtain excellent coding efficiency, but the high computational complexity makes design difficult. This paper presents an effective processor chip for integer motion estimation (IME) in H264/AVC based on the full-search block-matching algorithm (FSBMA). It uses architecture with a configurable 2D systolic array to obtain a high data reuse of search area. This systolic array supports a three-direction scan format in which only one row of pixels is changed between the two adjacent subblocks, thus reducing the memory accesses and saving clock cycles. A computing array of 64 PEs calculates the SAD of basic 4×4 subblocks and a modified Lagrangian cost is used as matching criterion to find the best 41 variable-size blocks by means of a tree pipeline parallel architecture. Finally, a mode decision module uses serial data flow to find the best mode by comparing the total minimum Lagrangian costs. The IME processor chip was designed in UMC 0.18 μm technology resulting in a circuit with only 32.3 k gates and 6 RAMs (total 59kBits on-chip memory). In typical working conditions (25 °C, 1.8 V), a clock frequency of 300 MHz can be estimated with a processing capacity for HDTV (1920×1088 @ 30 fps) and a search range of 32×32.  相似文献   

3.
Variable block size motion estimation (VBSME) is becoming the new coding technique in H.264/AVC. This article presents a low-power VLSI implementation for VBSME, which employs a fast full-search block-matching algorithm to reduce power consumption, while preserving the optimal motion vectors (MVs). The fast full-search algorithm is based on the comparison of the current minimum sum of absolute difference (SAD) to a conservative lower bound so that unnecessary SAD calculations can be eliminated. We first experimentally determine the specific conservative lower bound of SAD and then implement the fast full-search algorithm in FPGA and 0.18?µm CMOS technology. To the best of our knowledge, this is the first time that a fast full-search block-matching algorithm is explored to reduce power consumption in the context of VBSME and implemented in hardware. Experiment results show that the proposed design can save power consumption by 45% compared to conventional VBSME designs that give optimal MV based on the full-search algorithms.  相似文献   

4.
In the Advanced Video Coding (AVC) standard, motion estimation (ME) adopts many new features to increase the coding performances such as block matching algorithm (BMA), motion vector prediction (MVP) and variable block size motion estimation (VBSME). However, VBSME is utilized in the MPEG4-AVC/H.264 standard which leads to high computational complexity and data dependency that make the hardware implementation very complex.  相似文献   

5.
A VLSI architecture for variable block size video motion estimation   总被引:1,自引:0,他引:1  
With the advent of new video standards such as MPEG-4 part-10 and H.264/H.26L, demands for advanced video coding, particularly in the area of variable block size video motion estimation (VBSME), are increasing. In this paper, we propose a new one-dimensional (1-D) very large-scale integration architecture for full-search VBSME (FSVBSME). The VBS sum of absolute differences (SAD) computation is performed by re-using the results of smaller sub-block computations. These are distributed and combined by incorporating a shuffling mechanism within each processing element. Whereas a conventional 1-D architecture can process only one motion vector (MV), this new architecture can process up to 41 MV sub-blocks (within a macroblock) in the same number of clock cycles.  相似文献   

6.
Execution latency and I/O bandwidth play essential roles in determining the effectiveness and the cost of a parallel hardware implementation for block-matching motion estimation algorithms. Unfortunately, almost all traditional architecture designs, e.g. the two-dimensional mesh-connected systolic array architecture (2DMCSA) and the tree-type structure (TTS), fail to take these two factors into account simultaneously. As a result, they suffer from either large execution latency or huge input bandwidth requirements. The authors propose a family of tree/linear architectures, which efficiently optimise the total implementation cost by combining the merits of the 2DMCSA and the TTS. Moreover, to facilitate hardware designs, the authors present the tree-cut techniques and the on-chip buffer design method to meet computational demands various video compression applications. The proposed architectures are capable executing the exhaustive search and the search block-matching algorithms, they offer relatively flexible and cost-effective hardware solutions for a wide range of video coding systems, including CD-ROM, portable visual communications systems and high-definition TV  相似文献   

7.
H.264/AVC is the latest video coding standard adopting variable block size motion estimation (VBS-ME), quarter-pixel accuracy, motion vector prediction and multi-reference frames for motion estimation. These new features result in much higher computation requirements than previous coding standards. In this paper we propose a novel most significant bit (MSB) first bit-serial architecture for full-search block matching VBS-ME, and compare it with systolic implementations. Since the nature of MSB-first processing enables early termination of the sum of absolute difference (SAD) calculation, the average hardware performance can be enhanced. Five different designs, one and two dimensional systolic and tree implementations along with bit-serial, are compared in terms of performance, pixel memory bandwidth, occupied area and power consumption.
Philip H. W. Leong (Corresponding author)Email:
  相似文献   

8.
We implemented the H.264/AVC variable block size motion estimation (VBSME) using a very long instruction word (VLIW)–single instruction multiple data (SIMD) digital signal processor (DSP). The SAD_Reuse method which has a regular structure is chosen for VBSME not only to remove redundant sum of absolute difference (SAD) operations but also to utilize the instruction level parallelism (ILP) and data level parallelism (DLP) of the architecture. A fast mode decision algorithm is developed to reduce the number of ‘compare and update’ operations and simplify the rate distortion optimization (RDO). The developed fast mode decision uses the difference of motion vectors and the maximum a posteriori (MAP) estimation of the rate-distortion costs. Several advanced software techniques that include software pipelining and packed-data processing are employed. Especially, memory access overhead reduction schemes including the multi-block processing and the inter-procedural scheduling are used for the software optimization. In order to reduce the ‘write buffer full’ in the quarter pixel ME, a 4 bit quantization scheme is developed, which increases the number of arithmetic operations but decreases the stall cycles very much. The implemented variable block size ME for H.264/AVC requires an average of 9 M and 78 Mcycles per frame for QCIF and CIF size video sequences, respectively, in the TMS320C64x DSP architecture.
Wonyong SungEmail:
  相似文献   

9.
针对H.264/AVC视频编码器的系统芯片设计,通过分析分数像素运动估计(FME)模块的数据并行度和硬件利用率,探讨了分数像素运动矢量代价产生器的复用结构,再依据FME模块的具体设计约束,提出了可以复用产生1/2像素和1/4像素运动矢量代价的硬件实现结构,并且在FPGA开发板上进行了分数像素运动矢量代价产生器的设计验证。  相似文献   

10.
H.264中运动估计算法的一种硬件实现架构   总被引:1,自引:1,他引:0  
提出了一种适用于H.264标准中可变块大小运动估计算法的硬件实现架构.架构中采用一维处理单元(PE)阵列来实现运动估计算法中匹配块的搜索,通过对较小子块的块间误差(SAD)的复用来计算不同大小块的块间误差.与传统的处理一个运动矢量的架构相比,这种架构在一定的时钟周期内最多可处理41个运动矢量,并且具有面积小、速度快的特点.  相似文献   

11.
In H.264/AVC, the motion estimation (ME) routine supports variable block size and involves highly parallel sum of absolute difference (SAD) computations. In this study, we introduce a bit serial hybrid-grained processing element (PE) based 2D architecture that has both early termination and intensive data reuse capabilities. PEs operate on most significant bit-first arithmetic for early termination and the 2D architecture enables on-chip data reuse between neighboring PEs in a bit-by-bit pipelined fashion. Hybrid-grained PEs reduce the hardware overhead of conventional adder tree structures used for implementing the variable block size ME. Our design reduces the gate count by 7x compared to its ASIC counterpart, operates at a comparable frequency while sustaining 30 fps and 60 fps; and outperforms bit parallel and bit serial architectures in terms of throughput and performance per gate for various video formats.  相似文献   

12.
Block matching motion estimation is the heart of video coding system. It leads to a high compression ratio, whereas it is time consuming and calculation intensive. Many fast search block matching motion estimation algorithms have been developed in order to minimize search positions and speed up computation but they do not take into account how they can be effectively implemented by hardware. In this paper, we propose an efficient hardware architecture of the fast line diamond parallel search (LDPS) algorithm with variable block size motion estimation (VBSME) for H.264/AVC video coding system. The design is described in VHDL language, synthesized to Altera Stratix III FPGA and to TSMC 0.18 μm standard-cells. The throughput of the hardware architecture reaches a processing rate up to 78 millions of pixels per second at 83.5 MHz frequency clock and uses only 28 kgates when mapped to standard-cells. Finally, a system on a programmable chip (SoPC) implementation and validation of the proposed design as an IP core is presented using the embedded video system.  相似文献   

13.
Motion estimation (ME) in high-definition H.264 video coding presents a significant design challenge for memory bandwidth, latency, and cost because of its large search range and various modes. To conquer this problem, this paper presents a low-latency and hardware-efficient ME design with three design techniques. The first technique on integer-pel ME (IME) adopts parallel instead of serial multiresolution search so that we can process 1080 p @ 60 fps videos with $pm$128 search range within just 256 cycles, 5.95-KB buffers, and 213.7K gates. The second technique on fractional-pel ME (FME) uses a single-iteration six-point search to reduce the cycle count by half with similar gate count and negligible quality loss. The third technique applies a mode-filtering approach to further reduce the bandwidth and cycles and share the buffer of IME and FME. The final ME implementation with 0.13-$mu{hbox {m}}$ process can support processing of 1080 p @ 60 fps with just 128.8 MHz, 282.6 K gates, and 8.54-KB buffer, which saves 60% gate count, and 68.9% SRAM buffers when compared with the previous design.   相似文献   

14.
Motion estimation is a highly computational demanding operation during video compression process and significantly affects the output quality of an encoded sequence. Special hardware architectures are required to achieve real-time compression performance. Many fast search block matching motion estimation (BMME) algorithms have been developed in order to minimize search positions and speed up computation but they do not take into account how they can be effectively implemented by hardware. In this paper, we propose three new hardware architectures of fast search block matching motion estimation algorithm using Line Diamond Parallel Search (LDPS) for H.264/AVC video coding system. These architectures use pipeline and parallel processing techniques and present minimum latency, maximum throughput and full utilization of hardware resources. The VHDL code has been tested and can work at high frequency in a Xilinx Virtex-5 FPGA circuit for the three proposed architectures.  相似文献   

15.
We present a full HD (1080p) H.264/AVC High Profile hardware encoder based on fast motion estimation (ME). Most processing cycles are occupied with ME and use external memory access to fetch samples, which degrades the performance of the encoder. A novel approach to fast ME which uses shared multibank memory can solve these problems. The proposed pixel subsampling ME algorithm is suitable for fast motion vector searches for high‐quality resolution images. The proposed algorithm achieves an 87.5% reduction of computational complexity compared with the full search algorithm in the JM reference software, while sustaining the video quality without any conspicuous PSNR loss. The usage amount of shared multibank memory between the coarse ME and fine ME blocks is 93.6%, which saves external memory access cycles and speeds up ME. It is feasible to perform the algorithm at a 270 MHz clock speed for 30 frame/s real‐time full HD encoding. Its total gate count is 872k, and internal SRAM size is 41.8 kB.  相似文献   

16.
H.264视频编码标准中引入了1/4像素精度插值算法,大大提高了压缩效率,但同时使运算复杂度增加、存储带宽增大。针对以上问题,从运动估计的角度出发,采用一步插值法和数据复用技术,可使带宽减少26%,处理周期可减少45%;设计了相应的硬件结构:采用了5级流水线实现一步插值算法,通过输入缓冲单元实现了参考数据的复用;针对插值过程中产生的大量数据,采用乒乓操作结构,保证数据及时传递。该结构可以显著降低带宽,提高吞吐率,完全可以应用于实时编码器中。  相似文献   

17.
Modern video codecs such as MPEG2, MPEG4-ASP and H.264 depend on sub-pixel motion estimation to optimise rate-distortion efficiency. Sub-pixel motion estimation is implemented within these standards using interpolated values at 1/2 or 1/4 pixel accuracy. By using these interpolated values, the residual energy for each predicted macroblock is reduced. However this leads to a significant increase in complexity at the encoder, especially for H.264, where the cost of an exhaustive set of macroblock segmentations needs to be estimated for optimal mode selection. This paper presents a novel scheme for sub-pixel motion estimation based on the whole-pixel SAD distribution. Both half-pixel and quarter-pixel searches are guided by a model-free estimation of the SAD surface using a two dimensional kernel method. While giving an equivalent rate distortion performance, this approach approximately halves the number of quarter-pixel search positions giving an overall speed up of approximately 10% compared to the EPZS quarter-pixel method (the state of the art H.264 optimised sub-pixel motion estimator).  相似文献   

18.
Video applications are increasingly present in consumer electronic devices which require low-power and low-energy consumption. Sum of Absolute Differences (SAD) is the most used distortion metric in video coding implementation and consumes a relative large area in the motion estimation hardware. This paper presents the standard-cells synthesis and a comprehensive analysis of various parallel hardware architectures alternatives for SAD calculation, focusing on different design constraints such as high-performance (maximum throughput) and the tradeoff between high-performance and low-power dissipation (namely an isoperformance target). Low-power techniques supported by commercial standard-cells tools are exercised in this design, such as clock gating, multi-threshold (VT) and a combination of slow and fast standard-cells. We achieved significant power reduction for the architectures with lower frequencies and higher parallelism, slow cells and mainly with only one pipeline stage.  相似文献   

19.
Data access usually leads to more than 50% of the power cost in a modern signal processing system. To realize a low-power design, how to reduce the memory access power is a critical issue. Data reuse (DR) is a technique that recycles the data read from memory and can be used to reduce memory access power. In this paper, a systematic method of DR exploration for low-power architecture design is presented. For a start, the signal processing algorithms should be formulated as the nested loops structures, and data locality is explored by use of loop analysis. Then, corresponding DR techniques are applied to reduce memory access power. The proposed design methodology is applied to the motion estimation (ME) algorithms of H.264 video coding standard. After analyzing the ME algorithms, suitable parallel architectures and processing flows of the integer ME (IME) and fractional ME (FME) are proposed to achieve efficient DR. The amount of memory access is respectively reduced to 0.91 and 4.37% in the proposed IME and FME designs, and thus lots of memory access power is saved. Finally, the design methodology is also beneficial for other signal processing systems with a low-power consideration.
Liang-Gee ChenEmail:
  相似文献   

20.
宋建斌  李波  李炜  马丽 《电子学报》2007,35(10):1823-1827
H.264标准中的多尺寸块运动估计,在显著提高编码性能的同时,大大增加了其计算量,使得H.264实时编码器的实现面临巨大挑战.本文充分利用视频图像的时空相关性和多尺寸块间的运动相似性,根据运动向量的中心偏置特性,提出了一种运动估计快速算法.该算法通过有效地预测搜索起点,自适应选择搜索模式以及采用二级终止搜索策略等方式,在编码性能相当的情况下,运动估计的速度比全搜索算法提高了95~247倍,比H.264推荐的快速算法提高了4.1~6.3倍.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号