期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王庆春何晓燕曹喜信《电视技术》2010,34(6)

根据H.264/AVC视频编码中分数像素运动估计(FME)的算法特点,针对视频编码系统的不同具体需求,提出了FME的4种VLSI实现结构,并对这些结构的硬件利用率和运算速度进行了对比分析. 相似文献

2.

VLSI Architecture Design of Fractional Motion Estimation for H.264/AVC

Yi-Hau Chen Tung-Chien Chen Shao-Yi Chien Yu-Wen Huang Liang-Gee Chen 《Journal of Signal Processing Systems》2008,53(3):335-347

The H.264/AVC Fractional Motion Estimation (FME) with rate-distortion constrained mode decision can improve the rate-distortion efficiency by 2–6 dB in peak signal-to-noise ratio. However, it comes with considerable computation complexity. Acceleration by dedicated hardware is a must for real-time applications. The main difficulty for FME hardware implementation is parallel processing under the constraint of the sequential flow and data dependency. We analyze seven inter-correlative loops extracted from FME procedure and provide decomposing methodologies to obtain efficient projection in hardware implementation. Two techniques, 4×4 block decomposition and efficiently vertical scheduling, are proposed to reuse data among the variable block size and to improve the hardware utilization. Besides, advanced architectures are designed to efficiently integrate the 6-taps 2D finite impulse response, residue generation, and 4×4 Hadamard transform into a fully pipelined architecture. This design is finally implemented and integrated into an H.264/AVC single chip encoder that supports realtime encoding of 720×480 30fps video with four reference frames at 81 MHz operation frequency with 405 K logic gates (41.9% area of the encoder).

Liang-Gee ChenEmail:

相似文献

3.

分数像素运动矢量代价产生器的VLSI设计

王庆春曹喜信何晓燕魏红雅《电视技术》2007,31(4):15-18

针对H．264／AVC视频编码器的系统芯片设计,通过分析分数像素运动估计（FME）模块的数据并行度和硬件利用率,探讨了分数像素运动矢量代价产生器的复用结构,再依据FME模块的具体设计约束,提出了可以复用产生1／2像素和1／4像素运动矢量代价的硬件实现结构,并且在FPGA开发板上进行了分数像素运动矢量代价产生器的设计验证。相似文献

4.

Interpolation-Free Fractional-Pixel Motion Estimation Algorithms with Efficient Hardware Implementation 总被引：1，自引：0，他引：1

Mohammed S. Sayed Wael Badawy Graham Jullien 《Journal of Signal Processing Systems》2012,67(2):139-155

This paper presents interpolation-free fractional-pixel motion estimation (FME) algorithms and efficient hardware prototype of one of the proposed FME algorithms. The proposed algorithms use a mathematical model to approximate the matching error at fractional-pixel locations instead of using the block matching algorithm to evaluate the actual matching error. Hence, no interpolation is required at fractional-pixel locations. The matching error values at integer-pixel locations are used to evaluate the mathematical model coefficients. The performance of the proposed algorithms has been compared with several FME algorithms including the full quarter-pixel search (FQPS) algorithm, which is used as part of the H.264 reference software. The computational cost and the performance analysis show that the proposed algorithms have about 90% less computational complexity than the FQPS algorithm with comparable reconstruction video quality (i.e., approximately 0.2 dB lower reconstruction PSNR values). In addition, a hardware prototype of one of the proposed algorithms is presented. The proposed architecture has been prototyped using the TSMC 0.18 μm CMOS technology. It has maximum clock frequency of 312.5 MHz, at which, the proposed architecture can process more than 70 HDTV 1080p fps. The architecture has only 13,650 gates. The proposed architecture shows superior performance when compared with several FME architectures. 相似文献

5.

Bandwidth Optimized and High Performance Interpolation Architecture in Motion Compensation for H.264/AVC HDTV Decoder

Yu Li Yun He 《Journal of Signal Processing Systems》2008,52(2):111-126

In this paper, we present high performance motion compensation architecture for H.264/AVC HDTV decoder. The bottleneck of efficient motion compensation implementation primarily rests on the high memory bandwidth demand and six-tap fractional interpolation complexity. To solve the bottleneck for H.264/AVC HD applications, three combined bandwidth optimization strategies are proposed to minimize the memory bandwidth for MB-based decoding process. To improve the interpolation hardware utilization and reduce the interpolation cycles, an interpolation classification scheme is proposed. By classifying the fifteen fractional pixels into five types and processing correspondingly, the interpolation cycles decrease significantly. A direct mapping memory cache characterized with circular addressing, byte-aligned addressing and horizontal and vertical parallel access is designed to support the proposed scheme. The hardware of proposed motion compensation is implemented at 100 M with 31.841 K logic gates, averagely 70–80% reduced memory bandwidth can be offered and the interpolation hardware can be fully utilized and interpolate one MB within 304 cycles, which can satisfy the real time constraint for H.264/AVC HD (1,920 × 1,088) 30 fps decoder. The design is implemented under UMC 0.18 μm technology, and the synthesis results and comparisons are shown.

Yu LiEmail:

相似文献

6.

Hardware design focusing in the tradeoff cost versus quality for the H.264/AVC fractional motion estimation targeting high definition videos

Gustavo Sanchez Marcel Corrêa Diego Noble Marcelo Porto Sergio Bampi Luciano Agostini 《Analog Integrated Circuits and Signal Processing》2012,73(3):931-944

This article presents an architecture for the fractional motion estimation (FME) of the H.264/AVC video coding standard focusing in a good tradeoff between the hardware cost and the video quality. The support to FME guarantees a high quality in the motion estimation process. The applied algorithmic simplifications together with the multiplierless implementation and with a well balanced pipeline allow a low cost and a high throughput solution. The architecture was also designed to avoid redundant external memory accesses when computing the FME. The design was divided in two main modules: integer motion estimation (with diamond search algorithm) and fractional refinement (half-pixel and quarter-pixel interpolation and search). The designed architecture was described in VHDL and synthesized to an Altera Stratix III FPGA. The architecture is able to reach 260 MHz when running in the target FPGA. In worst case scenario, this operation frequency allows a processing rate of 43 HD 1080p (1,920 × 1,080 pixels) frames per second, surpassing the requirements for real time processing. In comparison to related works, the developed architecture was able to achieve a good tradeoff among hardware costs, video quality and processing rate. 相似文献

7.

Design a Low-Power H.264/AVC Baseline Decoder at All Abstraction Levels—A Showcase

Ke Xu Min Zhang Chiu Sing Choy 《Journal of Signal Processing Systems》2012,67(3):317-330

In this paper, we propose a low-power VLSI implementation of H.264/AVC baseline decoder. A systematic methodology for power reduction is proposed and applied at various design abstraction levels. At the algorithm level, the computational complexity is optimized. At the architecture level, pipelining and parallelism are widely adopted to reduce the operating frequency; hierarchical memory organization optimizes power-hungry memory accesses; hardware sharing reduces the total switching capacitance. At the circuit level, the knowledge about signal statistics is exploited to reduce number of transitions; data dependent signal-gating and clock-gating are introduced which are dynamic techniques for power reduction; multiplications are reduced and optimized, while complex dividers are totally eliminated. At the physical level, cell sizing and layout are optimized for power efficiency. The VLSI implementation shows that with UMC 0.18 μm technology, the proposed design is able to decode realtime QCIF 30fps at 1.5 MHz. The decoder contains 169 k logic gates and 2.5 KB on-chip SRAM. The total chip area is 4.4 × 4.4 mm² in a CQFP 208 package. The measured power consumption is 973 μW @ 1.8 V and 293 μW @ 1.0 V. The low-power and realtime features make our design ideal for portable or mobile applications. 相似文献

8.

A Fast-Deblocking Boundary-strength Based Architecture Design of Deblocking Filter in H.264/AVC Applications

Chun-Lung Hsu Yu-Sheng Huang 《Journal of Signal Processing Systems》2008,52(3):211-229

This work presents an efficient architecture design for deblocking filter in H.264/AVC using a novel fast-deblocking boundary-strength (FDBS) technique. Based on the FDBS technique, the proposed architecture divides the deblocking process into three filtering modes, namely offset-based, standard-based and diagonal-based filtering modes, to reduce the blocking artifact and improve the video quality in H.264/AVC. The proposed architecture is designed in Verilog HDL, simulated with Quartus II and synthesized using 0.18 μm CMOS cells library with the Synopsys Design Compiler. Simulation results demonstrate good performance in PSNR improvement and bit-rate reduction. Additionally, verification results through physical chip design reveal that the proposed architecture design can support 1,280 × 720@30 Hz processing throughput while clocking at 100 MHz. Comparisons with other studies show the excellent properties of the proposed architecture in terms of gate count, memory size and clock-cycle/macroblock.

Chun-Lung HsuEmail:

相似文献

9.

A Highly Parallel Joint VLSI Architecture for Transforms in H.264/AVC 总被引：1，自引：0，他引：1

Yu Li Yun He Shunliang Mei 《Journal of Signal Processing Systems》2008,50(1):19-32

In H.264/AVC, the concept of adapting the transform size to the block size of motion-compensated prediction residue has proven to be an important coding tool. This paper presents highly parallel joint circuit architecture for 8 × 8 and 4 × 4 adaptive block-size transforms in H.264/AVC. By decomposing the 8 × 8 transform to basic 4 × 4 transforms, a unified architecture is designed for both 8 × 8 and 4 × 4 transform and the transform data-path can be efficiently reused for six kinds of transforms. i.e., 8 × 8 forward, 8 × 8 inverse, 4 × 4 forward, 4 × 4 inverse, forward-Hadamard, inverse-Hadamard transforms. Linear shift mapping is applied on the memory buffer to support parallel access both in row and column directions which eliminates the need for a transpose circuit. For reusable and configurable transform data-path, a multiple-stage pipeline is designed to reduce the critical path length and increase throughput. The design is implemented under UMC 0.18 um technology at 200 MHz with 13.651 K logic gates, which can support 1,920 × 1,088 30 fps H.264/AVC HDTV decoder.

Yu LiEmail:

相似文献

10.

A Hardware-Efficient H.264/AVC Motion-Estimation Design for High-Definition Video

Yu-Kun Lin Chia-Chun Lin Tzu-Yun Kuo Tian-Sheuan Chang 《IEEE transactions on circuits and systems. I, Regular papers》2008,55(6):1526-1535

Motion estimation (ME) in high-definition H.264 video coding presents a significant design challenge for memory bandwidth, latency, and cost because of its large search range and various modes. To conquer this problem, this paper presents a low-latency and hardware-efficient ME design with three design techniques. The first technique on integer-pel ME (IME) adopts parallel instead of serial multiresolution search so that we can process 1080 p @ 60 fps videos with $pm$128 search range within just 256 cycles, 5.95-KB buffers, and 213.7K gates. The second technique on fractional-pel ME (FME) uses a single-iteration six-point search to reduce the cycle count by half with similar gate count and negligible quality loss. The third technique applies a mode-filtering approach to further reduce the bandwidth and cycles and share the buffer of IME and FME. The final ME implementation with 0.13-$mu{hbox {m}}$ process can support processing of 1080 p @ 60 fps with just 128.8 MHz, 282.6 K gates, and 8.54-KB buffer, which saves 60% gate count, and 68.9% SRAM buffers when compared with the previous design. 相似文献

11.

A Fast H.264 Intra Frame Encoder with Serialized Execution of 4 × 4 and 16 × 16 Predictions and Early Termination

Jin-Su Jung Young-Joon Jo Hyuk-Jae Lee 《Journal of Signal Processing Systems》2011,64(1):161-175

This paper presents a fast H.264 intra frame encoder that processes a single macroblock of 1920 × 1080 size video in 334 cycles on average which is 20% faster than the previous best design. The speed-up is mainly achieved by early termination of either 4 × 4 intra prediction or 16 × 16 intra prediction. The executions of intra 4 × 4 and 16 × 16 predictions are serialized and the second prediction is terminated early by using the cost of the first prediction as the stop criterion. A simple and efficient algorithm by making use of spatial locality is proposed to select the mode that is processed first. To avoid the bubble cycles caused by this serialized execution of 4 × 4 and 16 × 16 predictions, the modified processing order presented in (Jung et al. 2008) is employed for intra 4 × 4 prediction in order to schedule dependent 4 × 4 blocks apart from each other. To further reduce the execution time of 4 × 4 prediction, neighboring pixels with the same value are grouped, and only one prediction mode in the group is evaluated. Experimental results show that the PSNR drop is 0.0619 dB and the bitrate increase is 0.842% when compared with the JM reference software. The additional hardware cost to support the proposed methods is less than eight thousand gates which are very small when compared with the hardware size of a whole intra frame encoder. 相似文献

12.

H.264去块滤波算法及硬件实现

颜开汉《通信技术》2010,43(8):242-243,246

H.264是ITU-T/ISO在2003年公布的最新的国际视频压缩编码标准,它大大提高了编码效率和图像质量,其中一个重要原因是在编解码环路中引入了去块滤波器。介绍了H.264视频编码标准中的去块滤波算法,并提出了一种可实现的去块滤波器硬件结构。该结构通过合理利用本地SRAM资源,大大减少了总线带宽需求,提高了硬件处理速度。仿真结果显示,通过该去块滤波器进行环路滤波,很大程度地消除了方块效应,图像质量得到明显改善。相似文献

13.

An efficient VLSI processor chip for variable block size integer motion estimation in H.264/AVC

G.A. Ruiz J.A. Michell 《Signal Processing: Image Communication》2011,26(6):289-303

Motion estimation (ME) is the most critical component of a video coding standard. H.264/AVC adopts the variable block size motion estimation (VBSME) to obtain excellent coding efficiency, but the high computational complexity makes design difficult. This paper presents an effective processor chip for integer motion estimation (IME) in H264/AVC based on the full-search block-matching algorithm (FSBMA). It uses architecture with a configurable 2D systolic array to obtain a high data reuse of search area. This systolic array supports a three-direction scan format in which only one row of pixels is changed between the two adjacent subblocks, thus reducing the memory accesses and saving clock cycles. A computing array of 64 PEs calculates the SAD of basic 4×4 subblocks and a modified Lagrangian cost is used as matching criterion to find the best 41 variable-size blocks by means of a tree pipeline parallel architecture. Finally, a mode decision module uses serial data flow to find the best mode by comparing the total minimum Lagrangian costs. The IME processor chip was designed in UMC 0.18 μm technology resulting in a circuit with only 32.3 k gates and 6 RAMs (total 59kBits on-chip memory). In typical working conditions (25 °C, 1.8 V), a clock frequency of 300 MHz can be estimated with a processing capacity for HDTV (1920×1088 @ 30 fps) and a search range of 32×32. 相似文献

14.

FPGA Implementation of Integer Transform and Quantizer for H.264 Encoder 总被引：1，自引：0，他引：1

Reeba Korah J. Raja Paul Perinbam 《Journal of Signal Processing Systems》2008,53(3):261-269

This paper deals with the process of Transformation and Quantization that is carried out on each inter-predicted residual block in a video encoding process and their reduced complexity hardware implementation. H.264/AVC utilizes 4 × 4 integer transform, which is derived from the 4 × 4 DCT. We propose, a reduced complexity algorithm and a pipelined structure for the Core forward integer transform module. A multiplier-less architecture is realized with less number of shifts and adds compared to existing works. The corresponding inverse transform is exactly reversible. Each of the transformed coefficients is quantized by a scalar quantizer. The quantization step size can be varied from macroblock to macroblock. The proposed unified pipelined architecture outperforms many recent implementations in terms of gate count and is capable of processing a 4 × 4 residual block in 4 clock cycles.

Reeba KorahEmail:

相似文献

15.

Architecture Design of Fine Grain Quality Scalable Encoder with CABAC for H.264/AVC Scalable Extension

Tzu-Der Chuang Yu-Jen Chen Yi-Hau Chen Shao-Yi Chien Liang-Gee Chen 《Journal of Signal Processing Systems》2010,60(3):363-375

In addition to coding efficiency, the scalable extension of H.264/AVC provides good functionality for video adaptation in heterogeneous environments. Fine grain scalability (FGS) is a technique to extract video bitstream at the finest quality level under the given bandwidth. In this paper, an architecture of FGS encoder with low external memory bandwidth and low hardware cost is proposed. Up to 99% of bandwidth reduction can be attained by the proposed scan bucket algorithm, early context modeling with context reduction, and first scan pre-encoding. The area-efficient hardware architecture is implemented by layer-wise hardware reuse. Besides, three design strategies for enhancement layer coder are explored so that the trade-off between external memory bandwidth and silicon area is allowed. The proposed hardware architecture can real-time encode HDTV 1920×1080 video with two FGS enhancement layers at 200 MHz working frequency, or HDTV 1280×720 video with three FGS enhancement layers at 130 MHz working frequency. 相似文献

16.

Algorithm and Software Optimization of Variable Block Size Motion Estimation for H.264/AVC on a VLIW–SIMD DSP

Wonchul Lee Hyojin Choi Wonyong Sung 《Journal of Signal Processing Systems》2008,51(3):289-302

We implemented the H.264/AVC variable block size motion estimation (VBSME) using a very long instruction word (VLIW)–single instruction multiple data (SIMD) digital signal processor (DSP). The SAD_Reuse method which has a regular structure is chosen for VBSME not only to remove redundant sum of absolute difference (SAD) operations but also to utilize the instruction level parallelism (ILP) and data level parallelism (DLP) of the architecture. A fast mode decision algorithm is developed to reduce the number of ‘compare and update’ operations and simplify the rate distortion optimization (RDO). The developed fast mode decision uses the difference of motion vectors and the maximum a posteriori (MAP) estimation of the rate-distortion costs. Several advanced software techniques that include software pipelining and packed-data processing are employed. Especially, memory access overhead reduction schemes including the multi-block processing and the inter-procedural scheduling are used for the software optimization. In order to reduce the ‘write buffer full’ in the quarter pixel ME, a 4 bit quantization scheme is developed, which increases the number of arithmetic operations but decreases the stall cycles very much. The implemented variable block size ME for H.264/AVC requires an average of 9 M and 78 Mcycles per frame for QCIF and CIF size video sequences, respectively, in the TMS320C64x DSP architecture.

Wonyong SungEmail:

相似文献

17.

多标准宏块预测与边界强度计算的VLSI架构及应用

陶昱良何卫锋王琴毛志刚李涌伟郑吉君《微电子学与计算机》2011,28(6)

提出一种支持H.264 High Profile 4.1和AVS JiZhun Profile 6.0的多标准宏块预测与边界滤波强度计算的VLSI架构,该架构根据解码器的算法特点,实现了H.264和AVS标准中控制占优的帧内模式预测、帧间运动矢量预测以及边界滤波强度计算算法,能应用于当前的可重构多媒体系统.对该架构进行实现后,采用TSMC 65nm工艺综合,工作频率可达到312 MHz,解码一个H.264和AVS宏块最大分别消耗351和189个时钟周期,能够满足H.264和AVS高清(1080p)实时处理的需求. 相似文献

18.

Low-Power High-Speed Hybrid Wave-Pipeline Architectures for Binary Morphological Dilation

Mahdiye Hajirahimi Abdolreza Nabavi Ehsanolah Kabir 《Journal of Signal Processing Systems》2012,68(3):391-399

Dilation and erosion are two fundamental operations of mathematical morphology for image processing. This paper presents three hybrid wave-pipeline (HWP) architectures for real-time binary dilation operator. With minor changes to the number and/or to the type of the basic gates, they can be employed as erosion operator. In the first HWP-architecture, each single cell utilizes the wave technique along with delay units for balancing the data paths. By minimizing the number of delay units, the second HWP-architecture with reduced power consumption and hardware complexity is obtained. The third HWP-architecture employs wave technique in each three cascaded cells. This architecture improves the above performance further, at the cost of slight reduction in maximum clock frequency and clock frequency range. Simulation results, using a 0.18 μm CMOS technology, indicate that the HWP architectures have higher speed, less hardware complexity, and lower power consumption compared to pipeline (P) architecture. Also, they are faster than wave-pipeline (WP) architecture, without the difficulty of balancing the delay of long signal paths. Simulation illustrates that the third HWP-architecture dilates a 1024 × 1024 image by a 21 × 21 structuring element (SE) in 214.64 μs. The maximum frequency of operation is 5 GHz for the power supply of 1.8 V. The power dissipation is 410 mW, and the chip area is 0.075 mm². 相似文献

19.

Design and Synthesis of a Multiprocessor System-on-Chip Architecture for Real-Time Biomedical Signal Processing in Gamma Cameras

Kai Sun Meng Wang Zili Shao Hui Liu Hongxing Wei Tianmiao Wang 《Journal of Signal Processing Systems》2010,59(1):71-83

MPSoC (Multi-Processor System-on-Chip) architecture is becoming increasingly used because it can provide designers much more opportunities to meet specific performance and power goals. In this paper, we propose an MPSoC architecture for implementing real-time signal processing in gamma camera. Based on a fully analysis of the characteristics of the application, we design several algorithms to optimize the systems in terms of processing speed, power consumption, and area costs etc. Two types of DSP core have been designed for the integral algorithm and the coordinate algorithm, the key parts of signal processing in a gamma camera. An interconnection synthesis algorithm is proposed to reduce the area cost of the Network-on-Chip. We implement our MPSoC architecture on FPGA, and synthesize DSP cores and Network-on-Chip using Synopsys Design Compiler with a UMC 0.18 \upmum\upmu\textrm m standard cell library. The results show that our technique can effectively accelerate the processing and satisfy the requirements of real-time signal processing for 256 × 256 image construction. 相似文献

20.

A high speed programmable focal-plane SIMD vision chip

Dominique Ginhac Jérôme Dubois Barthélémy Heyrman Michel Paindavoine 《Analog Integrated Circuits and Signal Processing》2010,65(3):389-398

A high speed analog VLSI image acquisition and low-level image processing system is presented. The architecture of the chip is based on a dynamically reconfigurable SIMD processor array. The chip features a massively parallel architecture enabling the computation of programmable mask-based image processing in each pixel. Each pixel include a photodiode, an amplifier, two storage capacitors, and an analog arithmetic unit based on a four-quadrant multiplier architecture. A 64 × 64 pixel proof-of-concept chip was fabricated in a 0.35 μm standard CMOS process, with a pixel size of 35 μm × 35 μm. The chip can capture raw images up to 10,000 fps and runs low-level image processing at a framerate of 2,000–5,000 fps. 相似文献