期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Scalable Unified Transform Architecture for Advanced Video Coding Embedded Systems

Tiago Dias Sebastián López Nuno Roma Leonel Sousa 《International journal of parallel programming》2013,41(2):236-260

A novel high throughput and scalable unified architecture for the computation of the transform operations in video codecs for advanced standards is presented in this paper. This structure can be used as a hardware accelerator in modern embedded systems to efficiently compute all the two-dimensional 4 × 4 and 2 × 2 transforms of the H.264/AVC standard. Moreover, its highly flexible design and hardware efficiency allows it to be easily scaled in terms of performance and hardware cost to meet the specific requirements of any given video coding application. Experimental results obtained using a Xilinx Virtex-5 FPGA demonstrated the superior performance and hardware efficiency levels provided by the proposed structure, which presents a throughput per unit of area relatively higher than other similar recently published designs targeting the H.264/AVC standard. Such results also showed that, when integrated in a multi-core embedded system, this architecture provides speedup factors of about 120× concerning pure software implementations of the transform algorithms, therefore allowing the computation, in real-time, of all the above mentioned transforms for Ultra High Definition Video (UHDV) sequences (4,320 × 7,680 @ 30 fps). 相似文献

2.

A reduced memory bandwidth and high throughput HDTV motion compensation decoder for H.264/AVC High 4:2:2 profile

Bruno Zatt Leandro M. de L. Silva Arnaldo Azevedo Luciano Agostini Altamiro Susin Sergio Bampi 《Journal of Real-Time Image Processing》2013,8(1):127-140

This article presents the HP422-MoCHA: optimized Motion Compensation hardware architecture for the High 4:2:2 profile of H.264/AVC video coding standard. The proposed design focuses on real-time decoding for HDTV 1080p (1,920 × 1,080 pixels) at 30 fps. It supports multiple sample bit-width (8, 9, or 10 bits) and multiple chroma sub-sampling formats (4:0:0, 4:2:0, and 4:2:2) to provide enhanced video quality experience. The architecture includes an optimized sample interpolator that processes luma and chroma samples in two parallel datapaths and features quarter sample accuracy, bi-prediction and weighted prediction. HP422-MoCHA also includes a hardwired Motion Vector Predictor, supporting temporal and spatial direct predictions. A novel memory hierarchy implemented as a 3-D Cache reduces the frame memory access, providing, on average, 62% of bandwidth and 80% of clock cycles reduction. The design was implemented in a Xilinx Virtex-II PRO FPGA, and also in an ASIC with a TSMC 0.18 μm standard cells technology. The ASIC implementation occupies 102 K equivalent gates and 56.5 KB of on-chip SRAM in a 3.8 × 3.4 mm² area. It presents a power consumption of 130 mW. Both implementations reach a maximum operation frequency of ~100 MHz, being able to motion compensate 37 bi-predictive frames or 69 predictive fps. The minimum required frequency to ensure the real-time decoding for HD1080p at 30 fps is 82 MHz. Since HP422-MoCHA is the first Motion Compensation architecture for the High 4:2:2 profile found in the literature, a Main profile MoCHA was used for comparison purposes, showing the highest throughput among all presented works. However, the HP422-MoCHA architecture also reaches the highest throughput when compared with the other published Main profile MC solutions, even considering the significantly higher complexity of the High 4:2:2 profile. 相似文献

3.

A low energy intra prediction hardware for high efficiency video coding

Ercan Kalali Yusuf Adibelli Ilker Hamzaoglu 《Journal of Real-Time Image Processing》2018,15(2):221-234

Intra prediction algorithm in the recently developed high efficiency video coding (HEVC) standard has very high computational complexity. Therefore, in this paper, we propose pixel equality and pixel similarity based techniques for reducing amount of computations performed by HEVC intra prediction algorithm and, therefore, reducing energy consumption of HEVC intra prediction hardware. The proposed techniques significantly reduce the amount of computations performed by 4 × 4 and 8 × 8 luminance angular prediction modes with a small comparison overhead. Pixel equality based technique does not affect the PSNR and bit rate. Pixel similarity based technique increases the PSNR slightly for some video frames and it decreases the PSNR slightly for some video frames. We also designed and implemented a low energy HEVC intra prediction hardware for 4 × 4 and 8 × 8 angular prediction modes including the proposed techniques using Verilog HDL. The proposed techniques significantly reduce the energy consumption of this HEVC intra prediction hardware. 相似文献

4.

3D high definition video coding on a GPU-based heterogeneous system

Rafael Rodríguez-Sánchez José Luis Martínez Jan De Cock Gerardo Fernández-Escribano Bart Pieters José L. Sánchez José M. Claver Rik Van de Walle 《Computers & Electrical Engineering》2013

相似文献

5.

A novel hardware/software partitioning for SIMD-based real-time AVS video decoder

Liwei Chen Ming Cong Jing Huang Ling Li Hongwei Liu Cheng Qian 《Multimedia Tools and Applications》2014,71(3):1651-1671

Decoding high-quality videos in real-time is becoming more and more difficult with the increasing resolution. In this paper, a novel hardware/software (HW/SW) partitioning is proposed with powerful SIMD (single instruction multiple data) instructions for the real-time AVS video decoder. Since most key functions that need large amounts of computations are optimized by SIMD instead of hardware, the distribution of workload between hardware and software is balanceable, and the performance of the video decoder is improved. Besides, the generality and programmability are also maintained. The proposed method is implemented on a 32-bit dual-issue RISC processor with 256-bit vector extension. The experimental results of conformation AVS test sequences show that the video decoder system can support the real-time decoding of AVS 1080p videos at 30 fps, and improve performance over 100 times compared to the original processor without the proposed method. Moreover, this approach could be easily applied to other video decoders, such as H.264 and VC-1. 相似文献

6.

Architecture design of the high-throughput compensator and interpolator for the H.265/HEVC encoder

Grzegorz Pastuszak Maciej Trochimiuk 《Journal of Real-Time Image Processing》2016,11(4):663-673

This paper presents the architecture of the high-throughput compensator and the interpolator used in the motion estimation of the H.265/HEVC encoder. The architecture can process 8×8 blocks in each clock cycle. The design allows the random order of checked coding blocks and motion vectors. This feature makes the architecture suitable for different search algorithms. The interpolator embeds 64 multiplierless reconfigurable filter cores to support computations for different fractional-pel positions. Synthesis results show that the design can operate at 200 and 400 MHz when implemented in FPGA Arria II and TSMC 90 nm, respectively. The computational scalability enables the proposed architecture to trade the throughput for the compression efficiency. If 2160p@30fps video is encoded, the design clocked at 400 MHz can check about 100 motion vectors for 8×8 blocks. 相似文献

7.

Implementation of a cost-shared transform architecture for multiple video codecs

Muhammad?Martuza Khan?A.?Wahid Email author 《Journal of Real-Time Image Processing》2015,10(1):151-162

The paper presents a cost-shared architecture to compute multiple integer discrete cosine transform (Int-DCT) of four video codecs—AVS, H.264/AVC, VC-1 and HEVC (under development). Based on the symmetric structure of the matrices and the similarity in matrix operation, we develop a generalized “decompose and share” algorithm to compute both 4 × 4 and 8 × 8 Int-DCT. The algorithm is later applied to the video codecs. The hardware share approach ensures maximum circuit reuse during the computation. The architecture is designed with only adders and shifters to reduce the hardware cost significantly. The design is implemented on FPGA and later synthesized in CMOS 0.18 μm technology. 相似文献

8.

FPGA-based architecture for real time segmentation and denoising of HD video

M. Genovese E. Napoli 《Journal of Real-Time Image Processing》2013,8(4):389-401

The identification of moving objects is a basic step in computer vision. The identification begins with the segmentation and is followed by a denoising phase. This paper proposes the FPGA hardware implementation of segmentation and denoising unit. The segmentation is conducted using the Gaussian mixture model (GMM), a probabilistic method for the segmentation of the background. The denoising is conducted implementing the morphological operators of erosion, dilation, opening and closing. The proposed circuit is optimized to perform real time processing of HD video sequences (1,920 × 1,080 @ 20 fps) when implemented on FPGA devices. The circuit uses an optimized fixed width representation of the data and implements high performance arithmetic circuits. The circuit is implemented on Xilinx and Altera FPGA. Implemented on xc5vlx50 Virtex5 FPGA, it can process 24 fps of an HD video using 1,179 Slice LUTs and 291 Slice Registers; the dynamic power dissipation is 0.46 mW/MHz. Implemented on EP2S15F484C3 StratixII, it provides a maximum working frequency of 44.03 MHz employing 5038 Logic Elements and 7,957 flip flop with a dynamic power dissipation of 4.03 mW/MHz. 相似文献

9.

Scalable hardware architecture for disparity map computation and object location in real-time

Pedro Miguel Santos João Canas Ferreira José Silva Matos 《Journal of Real-Time Image Processing》2016,11(3):473-485

We present the disparity map computation core of a hardware system for isolating foreground objects in stereoscopic video streams. The operation is based on the computation of dense disparity maps using block-matching algorithms and two well-known metrics: sum of absolute differences and Census transform. Two sets of disparity maps are computed by taking each of the images as reference so that a consistency check can be performed to identify occluded pixels and eliminate spurious foreground pixels. Taking advantage of parallelism, the proposed architecture is highly scalable and provides numerous degrees of adjustment to different application needs, performance levels and resource usage. A version of the system for 640 × 480 images and a maximum disparity of 135 pixels was implemented in a system based on a Xilinx Virtex II-Pro FPGA and two cameras with a frame rate of 25 fps (less than the maximum supported frame rate of 40 fps on this platform). Implementation of the same system on a Virtex-5 FPGA is estimated to achieve 80 fps, while a version with increased parallelism is estimated to run at 140 fps (which corresponds to the calculation of more than 5.9 × 10⁹ disparity-pixels per second). 相似文献

10.

Efficient hardware implementation of 8 × 8 integer cosine transforms for multiple video codecs

Khan A. Wahid Muhammad Martuza Mousumi Das Carl McCrosky 《Journal of Real-Time Image Processing》2013,8(4):403-410

The current trend of digital convergence leads to the need of the video decoder that should support multiple video standards such as, H.264/AVC, JPEG, MPEG-2/4, VC-1, and AVS on a single platform. In this paper, we present a cost-sharing architecture of multiple transforms to support all five popular video codecs. The architecture is based on a new multi-dimensional delta mapping. Here the inverse transform matrix of the Discrete Cosine Transform (DCT) of AVS, that has the lowest computational unit, is taken as the base to compute the inverse DCT matrices of the other four codecs. The proposed architecture uses only adders and shifters on a shared basis to reduce the hardware cost significantly. The shared architecture is implemented on FPGA and later synthesized in CMOS 0.18 μm technology. The results show that the proposed design satisfies the requirement of all five codecs with a maximum decoding capability of 60 fps of a full HD video. The scheme is also suitable for low-cost implementation in modern multi-codec systems. 相似文献

11.

A real-time global stereo-matching on FPGA

《Microprocessors and Microsystems》2016

An improved global stereo matching algorithm is implemented on a single FPGA for real-time applications. Stereo matching is widely used in stereo vision systems, i.e. objects detection and autonomous vehicles. Global algorithms have much more accurate results than local algorithms, but global algorithms are not implemented on FPGA since they rely over high-end hardware resources. In this implementation the stereo pairs are divided into blocks, the hardware resources are reduced by processing one block once. The hardware implementation is based on a Xilinx Kintex 7 FPGA. Experiment results show that the proposed implementation has an accurate result for the Middlebury benchmarks and 30 frames per second (fps) @1920 × 1680 is achieved. 相似文献

12.

Adaptive mode decision for multiview video coding based on macroblock position constraint model

Yue Li Gaobo Yang Yapei Zhu Can Liu Kai Liu 《Journal of Real-Time Image Processing》2016,12(3):575-582

Multiview video coding (MVC) exploits mode decision, motion estimation and disparity estimation to achieve high compression ratio, which results in an extensive computational complexity. This paper presents an efficient mode decision approach for MVC using a macroblock (MB) position constraint model (MPCM). The proposed approach reduces the number of candidate modes by utilizing the mode correlation and rate distortion cost (RD cost) in the previously encoded frames/views. Specifically, the mode correlations both in the temporal-spatial domain and the inter-view are modeled with MPCM. Then, MPCM is exploited to select the optimal prediction direction for the current encoding MB. Finally, the inter mode is early determined in the optimal prediction direction. Experimental results show that the proposed method can save 86.03 % of encoding time compared with the exhaustive mode decision used in the reference software of joint multiview video coding, with only 0.077 dB loss in Bjontegaard delta peak signal-to-noise ratio (BDPSNR) and 2.29 % increment of the total Bjontegaard delta bit rate (BDBR), which is superior to the performances of state-of-the-art approaches. 相似文献

13.

High performance architecture for real-time HDTV broadcasting

Yasser Ismail Wael El-Medany Hessa Al-Junaid Ahmed Abdelgawad 《Journal of Real-Time Image Processing》2016,11(4):633-644

A novel full search motion estimation co-processor architecture design is presented in this paper. The proposed architecture efficiently reuses search area data to minimize memory I/O while fully utilizing the hardware resources. A smart processing element (PE) and an efficient simple internal memory are the main components of the proposed co-processor. An efficient algorithm is used for loading both the current block and the search area inside the PE array. The search area data flow horizontally while the current block data are stationary. As a result, the speed of the co-processor is improved in terms of the throughput and the operating frequency compared to the state-of-the-art techniques. A smart local memory and PE design guarantees a simple and a regular data flow. The design of the local memory is implemented using only registers and a simple counter. This simplifies the design by avoiding the use of complicated addressing to write or read into/from the local memory. The proposed architecture is implemented using both the FPGA and the ASIC flow design tools. For a search range of 32 × 32 and block size of 16 × 16, the architecture can perform motion estimation for 30 fps of HDTV video at 350 MHz and easily outperforms many fast full search architectures. 相似文献

14.

Buffer structure optimized VLSI architecture for efficient hierarchical integer pixel motion estimation implementation

Haibing Yin Dong Sun Park Xiao Yun Zhang 《Journal of Real-Time Image Processing》2016,11(3):507-525

Integer pixel motion estimation (IME) is one crucial module with high complexity in high-definition video encoder. Efficient algorithm and architecture joint design is supposed to tradeoff multiple target parameters including throughput capacity, logic gate, on-chip SRAM size, memory bandwidth, and rate distortion performance. Data organization and on-chip buffer structure are crucial factors for IME architecture design, accounting for multiple target performance tradeoff. In this work, we combine global hierarchical search and local full search to propose hardware efficient IME algorithm, and then propose hardware VLSI architecture with optimized on-chip buffer structure. The major contribution of this work is characterized by: (1) improved hierarchical IME algorithm with presearch and deliberate data organization, (2) multistage on-chip reference pixel buffer structure with high data reuse between integer and fraction pixel motion estimations, (3) highly reused and reconfigurable processing element structure. The optimized data organization and buffer structure achieves nearly 70 % buffer saving with less than average 0.08, 0.12 dB the worst case, PSNR degradation compared with full search based architecture. At the hardware cost of 336 and 382 K logic gate and 20 kB SRAM, the proposed architecture achieves the throughput of 384 and 272 cycles per macroblock, at system frequency of 95 and 264 MHz for 1080p and QFHD @30fps format video coding. 相似文献

15.

An efficient low-cost FPGA implementation of a configurable motion estimation for H.264 video coding

Wajdi Elhamzi Julien Dubois Johel Miteran Mohamed Atri 《Journal of Real-Time Image Processing》2014,9(1):19-30

Despite the diversity of video compression standard, the motion estimation still remains a key process which is used in most of them. Moreover, the required coding performances (bit-rate, PSNR, image spatial resolution,etc.) depend obviously of the application, the environment and the network communication. The motion estimation can therefore be adapted to fit with these performances. Meanwhile, the real time encoding is required in many applications. To reach this goal, we propose in this paper a flexible hardware implementation of the motion estimator which enables the integer motion search algorithms to be modified and the fractional search as well as variable block size to be selected and adjusted. Hence, this novel architecture, especially designed for FPGA targets, proposes high-speed processing for a configuration which supports the variable size blocks and quarter-pel refinement, as described in H.264. The proposed low-cost architecture based on Virtex 6 FPGA can process integer motion estimation on 1080 HD video streams, respectively, at 13 fps using full search strategy (108k Macroblocks/s) and up to 223 fps using diamond search (1.8M Macroblocks/s). Moreover subpel refinement in quarter-pel mode is performed at 232k Macroblocks/s. 相似文献

16.

Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms

Svetislav Momcilovic Nuno Roma Leonel Sousa 《Journal of Real-Time Image Processing》2016,11(3):571-587

相似文献

17.

Stereo image watermarking scheme for authentication with self-recovery capability using inter-view reference sharing

Ting Luo Gangyi Jiang Xiaodong Wang Mei Yu Feng Shao Zongju Peng 《Multimedia Tools and Applications》2014,73(3):1077-1102

Advances in three dimensional video is a strong stimulus for research in authentication of stereo image to avoid illegal modification. In this paper, a stereo image watermarking scheme is proposed for authentication with self-recovery capability using inter-view reference sharing. A mechanism of inter-view reference sharing in stereo image pairs is designed to reduce bits for recovery reference generation compared with independent references. Discrete wavelet transform coefficients are employed to generate the references, and two reference copies of each block embedded in two different mapping blocks are prepared for recover tamper. Moreover, detail information from high frequency coefficients is also embedded so as to improve the quality of tamper recovery. For the purpose of resisting collage attack and increasing the probability of tamper detection, disparities between pairs of matched blocks are checked to conduct tamper detection. Experimental results show that the proposed scheme can detect tampered blocks with the probabilities of more than 99 % after collage attack. When stereo images are cropped from 10 to 70 % with randomly tampering, they are recovered without losing main visual information and qualities of recovery are better than those of existing monocular image watermarking schemes extended to stereo images. 相似文献

18.

FPGA–DSP co-processing for feature tracking in smart video sensors

Matteo Tomasi Shrinivas Pundlik Gang Luo 《Journal of Real-Time Image Processing》2016,11(4):751-767

Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA + DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM + DSP and DSP only. The proposed FPGA + DSP system offers a speedup of about 20 times and 3 times over ARM + DSP and DSP only configurations, respectively. A comparison of the Harris feature detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions. 相似文献

19.

Adaptive Census Transform: A novel hardware-oriented stereovision algorithm

Stefania Perri Pasquale Corsonello Giuseppe Cocorullo 《Computer Vision and Image Understanding》2013,117(1):29-41

This paper presents a new hardware-oriented approach for the extraction of disparity maps from stereo images. The proposed method is based on the herein named Adaptive Census Transform that exploits adaptive support weights during the image transformation; the adaptively weighted sum of SADs is then used as the dissimilarity metric. Quality tests show that the proposed method reaches significantly better accuracy than alternative hardware-oriented approaches. To demonstrate the practical hardware feasibility, a specific architecture has been designed and its implementation has been carried out using a single FPGA chip. Such a VLSI implementation allows a frame rate up to 68 fps to be reached for 640 × 480 stereo images, using just 80,000 slices and 32 RAM blocks of a Virtex6 chip. 相似文献

20.

H.264中一种有效的对整数DCT系数的预测方法

下载免费PDF全文

雷亚锋高鹏吴枫《计算机工程与应用》2008,44(23):71-74

提出了一种在H.264中减小DCT变换和量化计算量的新的有效方法。通过理论分析,研究了Normal4×4,LumaDC4×4和ChramaDC2×2这三种变换的系数动态分布,进而对变换和量化方法的三种类型提出相应的DCT系数量化为零三个充分条件。与文献中所提到的其它方法相比,该方法更加有效、精确。理论分析和实验结果表明：在减小计算复杂度、编码视频质量、错误接受率（false acceptance rate）,错误拒绝率（false rejection rate）等方面,该方法都优于其它方法。相似文献