共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Yi-Hau Chen Tung-Chien Chen Shao-Yi Chien Yu-Wen Huang Liang-Gee Chen 《Journal of Signal Processing Systems》2008,53(3):335-347
The H.264/AVC Fractional Motion Estimation (FME) with rate-distortion constrained mode decision can improve the rate-distortion
efficiency by 2–6 dB in peak signal-to-noise ratio. However, it comes with considerable computation complexity. Acceleration
by dedicated hardware is a must for real-time applications. The main difficulty for FME hardware implementation is parallel
processing under the constraint of the sequential flow and data dependency. We analyze seven inter-correlative loops extracted
from FME procedure and provide decomposing methodologies to obtain efficient projection in hardware implementation. Two techniques,
4×4 block decomposition and efficiently vertical scheduling, are proposed to reuse data among the variable block size and
to improve the hardware utilization. Besides, advanced architectures are designed to efficiently integrate the 6-taps 2D finite
impulse response, residue generation, and 4×4 Hadamard transform into a fully pipelined architecture. This design is finally
implemented and integrated into an H.264/AVC single chip encoder that supports realtime encoding of 720×480 30fps video with
four reference frames at 81 MHz operation frequency with 405 K logic gates (41.9% area of the encoder).
相似文献
Liang-Gee ChenEmail: |
3.
4.
Interpolation-Free Fractional-Pixel Motion Estimation Algorithms with Efficient Hardware Implementation 总被引:1,自引:0,他引:1
Mohammed S. Sayed Wael Badawy Graham Jullien 《Journal of Signal Processing Systems》2012,67(2):139-155
This paper presents interpolation-free fractional-pixel motion estimation (FME) algorithms and efficient hardware prototype
of one of the proposed FME algorithms. The proposed algorithms use a mathematical model to approximate the matching error
at fractional-pixel locations instead of using the block matching algorithm to evaluate the actual matching error. Hence,
no interpolation is required at fractional-pixel locations. The matching error values at integer-pixel locations are used
to evaluate the mathematical model coefficients. The performance of the proposed algorithms has been compared with several
FME algorithms including the full quarter-pixel search (FQPS) algorithm, which is used as part of the H.264 reference software.
The computational cost and the performance analysis show that the proposed algorithms have about 90% less computational complexity
than the FQPS algorithm with comparable reconstruction video quality (i.e., approximately 0.2 dB lower reconstruction PSNR
values). In addition, a hardware prototype of one of the proposed algorithms is presented. The proposed architecture has been
prototyped using the TSMC 0.18 μm CMOS technology. It has maximum clock frequency of 312.5 MHz, at which, the proposed architecture
can process more than 70 HDTV 1080p fps. The architecture has only 13,650 gates. The proposed architecture shows superior
performance when compared with several FME architectures. 相似文献
5.
In this paper, we present high performance motion compensation architecture for H.264/AVC HDTV decoder. The bottleneck of
efficient motion compensation implementation primarily rests on the high memory bandwidth demand and six-tap fractional interpolation
complexity. To solve the bottleneck for H.264/AVC HD applications, three combined bandwidth optimization strategies are proposed
to minimize the memory bandwidth for MB-based decoding process. To improve the interpolation hardware utilization and reduce
the interpolation cycles, an interpolation classification scheme is proposed. By classifying the fifteen fractional pixels
into five types and processing correspondingly, the interpolation cycles decrease significantly. A direct mapping memory cache
characterized with circular addressing, byte-aligned addressing and horizontal and vertical parallel access is designed to
support the proposed scheme. The hardware of proposed motion compensation is implemented at 100 M with 31.841 K logic gates,
averagely 70–80% reduced memory bandwidth can be offered and the interpolation hardware can be fully utilized and interpolate
one MB within 304 cycles, which can satisfy the real time constraint for H.264/AVC HD (1,920 × 1,088) 30 fps decoder. The
design is implemented under UMC 0.18 μm technology, and the synthesis results and comparisons are shown.
相似文献
Yu LiEmail: |
6.
Gustavo Sanchez Marcel Corrêa Diego Noble Marcelo Porto Sergio Bampi Luciano Agostini 《Analog Integrated Circuits and Signal Processing》2012,73(3):931-944
This article presents an architecture for the fractional motion estimation (FME) of the H.264/AVC video coding standard focusing in a good tradeoff between the hardware cost and the video quality. The support to FME guarantees a high quality in the motion estimation process. The applied algorithmic simplifications together with the multiplierless implementation and with a well balanced pipeline allow a low cost and a high throughput solution. The architecture was also designed to avoid redundant external memory accesses when computing the FME. The design was divided in two main modules: integer motion estimation (with diamond search algorithm) and fractional refinement (half-pixel and quarter-pixel interpolation and search). The designed architecture was described in VHDL and synthesized to an Altera Stratix III FPGA. The architecture is able to reach 260 MHz when running in the target FPGA. In worst case scenario, this operation frequency allows a processing rate of 43 HD 1080p (1,920 × 1,080 pixels) frames per second, surpassing the requirements for real time processing. In comparison to related works, the developed architecture was able to achieve a good tradeoff among hardware costs, video quality and processing rate. 相似文献
7.
In this paper, we propose a low-power VLSI implementation of H.264/AVC baseline decoder. A systematic methodology for power
reduction is proposed and applied at various design abstraction levels. At the algorithm level, the computational complexity
is optimized. At the architecture level, pipelining and parallelism are widely adopted to reduce the operating frequency;
hierarchical memory organization optimizes power-hungry memory accesses; hardware sharing reduces the total switching capacitance.
At the circuit level, the knowledge about signal statistics is exploited to reduce number of transitions; data dependent signal-gating
and clock-gating are introduced which are dynamic techniques for power reduction; multiplications are reduced and optimized,
while complex dividers are totally eliminated. At the physical level, cell sizing and layout are optimized for power efficiency.
The VLSI implementation shows that with UMC 0.18 μm technology, the proposed design is able to decode realtime QCIF 30fps
at 1.5 MHz. The decoder contains 169 k logic gates and 2.5 KB on-chip SRAM. The total chip area is 4.4 × 4.4 mm2 in a CQFP 208 package. The measured power consumption is 973 μW @ 1.8 V and 293 μW @ 1.0 V. The low-power and realtime features
make our design ideal for portable or mobile applications. 相似文献
8.
This work presents an efficient architecture design for deblocking filter in H.264/AVC using a novel fast-deblocking boundary-strength
(FDBS) technique. Based on the FDBS technique, the proposed architecture divides the deblocking process into three filtering
modes, namely offset-based, standard-based and diagonal-based filtering modes, to reduce the blocking artifact and improve
the video quality in H.264/AVC. The proposed architecture is designed in Verilog HDL, simulated with Quartus II and synthesized
using 0.18 μm CMOS cells library with the Synopsys Design Compiler. Simulation results demonstrate good performance in PSNR
improvement and bit-rate reduction. Additionally, verification results through physical chip design reveal that the proposed
architecture design can support 1,280 × 720@30 Hz processing throughput while clocking at 100 MHz. Comparisons with other
studies show the excellent properties of the proposed architecture in terms of gate count, memory size and clock-cycle/macroblock.
相似文献
Chun-Lung HsuEmail: |
9.
In H.264/AVC, the concept of adapting the transform size to the block size of motion-compensated prediction residue has proven
to be an important coding tool. This paper presents highly parallel joint circuit architecture for 8 × 8 and 4 × 4 adaptive
block-size transforms in H.264/AVC. By decomposing the 8 × 8 transform to basic 4 × 4 transforms, a unified architecture is
designed for both 8 × 8 and 4 × 4 transform and the transform data-path can be efficiently reused for six kinds of transforms.
i.e., 8 × 8 forward, 8 × 8 inverse, 4 × 4 forward, 4 × 4 inverse, forward-Hadamard, inverse-Hadamard transforms. Linear shift
mapping is applied on the memory buffer to support parallel access both in row and column directions which eliminates the
need for a transpose circuit. For reusable and configurable transform data-path, a multiple-stage pipeline is designed to
reduce the critical path length and increase throughput. The design is implemented under UMC 0.18 um technology at 200 MHz
with 13.651 K logic gates, which can support 1,920 × 1,088 30 fps H.264/AVC HDTV decoder.
相似文献
Yu LiEmail: |
10.
Yu-Kun Lin Chia-Chun Lin Tzu-Yun Kuo Tian-Sheuan Chang 《IEEE transactions on circuits and systems. I, Regular papers》2008,55(6):1526-1535
11.
This paper presents a fast H.264 intra frame encoder that processes a single macroblock of 1920 × 1080 size video in 334 cycles
on average which is 20% faster than the previous best design. The speed-up is mainly achieved by early termination of either
4 × 4 intra prediction or 16 × 16 intra prediction. The executions of intra 4 × 4 and 16 × 16 predictions are serialized and
the second prediction is terminated early by using the cost of the first prediction as the stop criterion. A simple and efficient
algorithm by making use of spatial locality is proposed to select the mode that is processed first. To avoid the bubble cycles
caused by this serialized execution of 4 × 4 and 16 × 16 predictions, the modified processing order presented in (Jung et
al. 2008) is employed for intra 4 × 4 prediction in order to schedule dependent 4 × 4 blocks apart from each other. To further reduce
the execution time of 4 × 4 prediction, neighboring pixels with the same value are grouped, and only one prediction mode in
the group is evaluated. Experimental results show that the PSNR drop is 0.0619 dB and the bitrate increase is 0.842% when
compared with the JM reference software. The additional hardware cost to support the proposed methods is less than eight thousand
gates which are very small when compared with the hardware size of a whole intra frame encoder. 相似文献
12.
H.264是ITU-T/ISO在2003年公布的最新的国际视频压缩编码标准,它大大提高了编码效率和图像质量,其中一个重要原因是在编解码环路中引入了去块滤波器。介绍了H.264视频编码标准中的去块滤波算法,并提出了一种可实现的去块滤波器硬件结构。该结构通过合理利用本地SRAM资源,大大减少了总线带宽需求,提高了硬件处理速度。仿真结果显示,通过该去块滤波器进行环路滤波,很大程度地消除了方块效应,图像质量得到明显改善。 相似文献
13.
Motion estimation (ME) is the most critical component of a video coding standard. H.264/AVC adopts the variable block size motion estimation (VBSME) to obtain excellent coding efficiency, but the high computational complexity makes design difficult. This paper presents an effective processor chip for integer motion estimation (IME) in H264/AVC based on the full-search block-matching algorithm (FSBMA). It uses architecture with a configurable 2D systolic array to obtain a high data reuse of search area. This systolic array supports a three-direction scan format in which only one row of pixels is changed between the two adjacent subblocks, thus reducing the memory accesses and saving clock cycles. A computing array of 64 PEs calculates the SAD of basic 4×4 subblocks and a modified Lagrangian cost is used as matching criterion to find the best 41 variable-size blocks by means of a tree pipeline parallel architecture. Finally, a mode decision module uses serial data flow to find the best mode by comparing the total minimum Lagrangian costs. The IME processor chip was designed in UMC 0.18 μm technology resulting in a circuit with only 32.3 k gates and 6 RAMs (total 59kBits on-chip memory). In typical working conditions (25 °C, 1.8 V), a clock frequency of 300 MHz can be estimated with a processing capacity for HDTV (1920×1088 @ 30 fps) and a search range of 32×32. 相似文献
14.
This paper deals with the process of Transformation and Quantization that is carried out on each inter-predicted residual
block in a video encoding process and their reduced complexity hardware implementation. H.264/AVC utilizes 4 × 4 integer transform,
which is derived from the 4 × 4 DCT. We propose, a reduced complexity algorithm and a pipelined structure for the Core forward
integer transform module. A multiplier-less architecture is realized with less number of shifts and adds compared to existing
works. The corresponding inverse transform is exactly reversible. Each of the transformed coefficients is quantized by a scalar
quantizer. The quantization step size can be varied from macroblock to macroblock. The proposed unified pipelined architecture
outperforms many recent implementations in terms of gate count and is capable of processing a 4 × 4 residual block in 4 clock
cycles.
相似文献
Reeba KorahEmail: |
15.
Tzu-Der Chuang Yu-Jen Chen Yi-Hau Chen Shao-Yi Chien Liang-Gee Chen 《Journal of Signal Processing Systems》2010,60(3):363-375
In addition to coding efficiency, the scalable extension of H.264/AVC provides good functionality for video adaptation in
heterogeneous environments. Fine grain scalability (FGS) is a technique to extract video bitstream at the finest quality level
under the given bandwidth. In this paper, an architecture of FGS encoder with low external memory bandwidth and low hardware
cost is proposed. Up to 99% of bandwidth reduction can be attained by the proposed scan bucket algorithm, early context modeling
with context reduction, and first scan pre-encoding. The area-efficient hardware architecture is implemented by layer-wise
hardware reuse. Besides, three design strategies for enhancement layer coder are explored so that the trade-off between external
memory bandwidth and silicon area is allowed. The proposed hardware architecture can real-time encode HDTV 1920×1080 video
with two FGS enhancement layers at 200 MHz working frequency, or HDTV 1280×720 video with three FGS enhancement layers at
130 MHz working frequency. 相似文献
16.
We implemented the H.264/AVC variable block size motion estimation (VBSME) using a very long instruction word (VLIW)–single
instruction multiple data (SIMD) digital signal processor (DSP). The SAD_Reuse method which has a regular structure is chosen
for VBSME not only to remove redundant sum of absolute difference (SAD) operations but also to utilize the instruction level
parallelism (ILP) and data level parallelism (DLP) of the architecture. A fast mode decision algorithm is developed to reduce
the number of ‘compare and update’ operations and simplify the rate distortion optimization (RDO). The developed fast mode
decision uses the difference of motion vectors and the maximum a posteriori (MAP) estimation of the rate-distortion costs.
Several advanced software techniques that include software pipelining and packed-data processing are employed. Especially,
memory access overhead reduction schemes including the multi-block processing and the inter-procedural scheduling are used
for the software optimization. In order to reduce the ‘write buffer full’ in the quarter pixel ME, a 4 bit quantization scheme
is developed, which increases the number of arithmetic operations but decreases the stall cycles very much. The implemented
variable block size ME for H.264/AVC requires an average of 9 M and 78 Mcycles per frame for QCIF and CIF size video sequences,
respectively, in the TMS320C64x DSP architecture.
相似文献
Wonyong SungEmail: |
17.
提出一种支持H.264 High Profile 4.1和AVS JiZhun Profile 6.0的多标准宏块预测与边界滤波强度计算的VLSI架构,该架构根据解码器的算法特点,实现了H.264和AVS标准中控制占优的帧内模式预测、帧间运动矢量预测以及边界滤波强度计算算法,能应用于当前的可重构多媒体系统.对该架构进行实现后,采用TSMC 65nm工艺综合,工作频率可达到312 MHz,解码一个H.264和AVS宏块最大分别消耗351和189个时钟周期,能够满足H.264和AVS高清(1080p)实时处理的需求. 相似文献
18.
Mahdiye Hajirahimi Abdolreza Nabavi Ehsanolah Kabir 《Journal of Signal Processing Systems》2012,68(3):391-399
Dilation and erosion are two fundamental operations of mathematical morphology for image processing. This paper presents three hybrid wave-pipeline (HWP) architectures for real-time binary dilation operator. With minor changes to the number and/or to the type of the basic gates, they can be employed as erosion operator. In the first HWP-architecture, each single cell utilizes the wave technique along with delay units for balancing the data paths. By minimizing the number of delay units, the second HWP-architecture with reduced power consumption and hardware complexity is obtained. The third HWP-architecture employs wave technique in each three cascaded cells. This architecture improves the above performance further, at the cost of slight reduction in maximum clock frequency and clock frequency range. Simulation results, using a 0.18 μm CMOS technology, indicate that the HWP architectures have higher speed, less hardware complexity, and lower power consumption compared to pipeline (P) architecture. Also, they are faster than wave-pipeline (WP) architecture, without the difficulty of balancing the delay of long signal paths. Simulation illustrates that the third HWP-architecture dilates a 1024 × 1024 image by a 21 × 21 structuring element (SE) in 214.64 μs. The maximum frequency of operation is 5 GHz for the power supply of 1.8 V. The power dissipation is 410 mW, and the chip area is 0.075 mm2. 相似文献
19.
Kai Sun Meng Wang Zili Shao Hui Liu Hongxing Wei Tianmiao Wang 《Journal of Signal Processing Systems》2010,59(1):71-83
MPSoC (Multi-Processor System-on-Chip) architecture is becoming increasingly used because it can provide designers much more
opportunities to meet specific performance and power goals. In this paper, we propose an MPSoC architecture for implementing
real-time signal processing in gamma camera. Based on a fully analysis of the characteristics of the application, we design
several algorithms to optimize the systems in terms of processing speed, power consumption, and area costs etc. Two types
of DSP core have been designed for the integral algorithm and the coordinate algorithm, the key parts of signal processing
in a gamma camera. An interconnection synthesis algorithm is proposed to reduce the area cost of the Network-on-Chip. We implement
our MPSoC architecture on FPGA, and synthesize DSP cores and Network-on-Chip using Synopsys Design Compiler with a UMC 0.18
\upmum\upmu\textrm m standard cell library. The results show that our technique can effectively accelerate the processing and satisfy the requirements
of real-time signal processing for 256 × 256 image construction. 相似文献
20.
Dominique Ginhac Jérôme Dubois Barthélémy Heyrman Michel Paindavoine 《Analog Integrated Circuits and Signal Processing》2010,65(3):389-398
A high speed analog VLSI image acquisition and low-level image processing system is presented. The architecture of the chip
is based on a dynamically reconfigurable SIMD processor array. The chip features a massively parallel architecture enabling
the computation of programmable mask-based image processing in each pixel. Each pixel include a photodiode, an amplifier,
two storage capacitors, and an analog arithmetic unit based on a four-quadrant multiplier architecture. A 64 × 64 pixel proof-of-concept
chip was fabricated in a 0.35 μm standard CMOS process, with a pixel size of 35 μm × 35 μm. The chip can capture raw images
up to 10,000 fps and runs low-level image processing at a framerate of 2,000–5,000 fps. 相似文献