首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Integer motion estimation (IME), which acts as a key component in video encoder, is to remove temporal redundancies by searching the best integer motion vectors for dynamic partition blocks in a macro-block (MB). Huge memory bandwidth requirements and unbearable computational resource demanding are two key bottlenecks in IME engine design, especially for large search window (SW) cases. In this paper, a three-level pipelined VLSI architecture design is proposed, where efficiently integrates the reference data sharing search (RDSS) into multi-resolution motion estimation algorithm (MMEA). First, a hardware-friendly MMEA algorithm is mapped into three-level pipelined architecture with neglected coding quality loss. Second, sub-sampled RDSS coupled with Level C?+?are adopted to reduce on-chip memory and bandwidth at the coarsest and middle level. Data sharing between IME and fractional motion estimation (FME) is achieved by loading only a local predictive SW at the finest level. Finally, the three levels are parallelized and pipelined to guarantee the gradual refinement of MMEA and the hardware utilization. Experimental results show that the proposed architecture can reach a good balance among complexity, on-chip memory, bandwidth, and the data flow regularity. Only 320 processing elements (PE) within 550 cycles are required for IME search, where the SW is set to 256?×?256. Our architecture can achieve 1080P@30 fps real-time processing at the working frequency of 134.6 MHz, with 135 K gates and 8.93 KB on-chip memory.  相似文献   

2.
This article presents the HP422-MoCHA: optimized Motion Compensation hardware architecture for the High 4:2:2 profile of H.264/AVC video coding standard. The proposed design focuses on real-time decoding for HDTV 1080p (1,920 × 1,080 pixels) at 30 fps. It supports multiple sample bit-width (8, 9, or 10 bits) and multiple chroma sub-sampling formats (4:0:0, 4:2:0, and 4:2:2) to provide enhanced video quality experience. The architecture includes an optimized sample interpolator that processes luma and chroma samples in two parallel datapaths and features quarter sample accuracy, bi-prediction and weighted prediction. HP422-MoCHA also includes a hardwired Motion Vector Predictor, supporting temporal and spatial direct predictions. A novel memory hierarchy implemented as a 3-D Cache reduces the frame memory access, providing, on average, 62% of bandwidth and 80% of clock cycles reduction. The design was implemented in a Xilinx Virtex-II PRO FPGA, and also in an ASIC with a TSMC 0.18 μm standard cells technology. The ASIC implementation occupies 102 K equivalent gates and 56.5 KB of on-chip SRAM in a 3.8 × 3.4 mm2 area. It presents a power consumption of 130 mW. Both implementations reach a maximum operation frequency of ~100 MHz, being able to motion compensate 37 bi-predictive frames or 69 predictive fps. The minimum required frequency to ensure the real-time decoding for HD1080p at 30 fps is 82 MHz. Since HP422-MoCHA is the first Motion Compensation architecture for the High 4:2:2 profile found in the literature, a Main profile MoCHA was used for comparison purposes, showing the highest throughput among all presented works. However, the HP422-MoCHA architecture also reaches the highest throughput when compared with the other published Main profile MC solutions, even considering the significantly higher complexity of the High 4:2:2 profile.  相似文献   

3.
A novel full search motion estimation co-processor architecture design is presented in this paper. The proposed architecture efficiently reuses search area data to minimize memory I/O while fully utilizing the hardware resources. A smart processing element (PE) and an efficient simple internal memory are the main components of the proposed co-processor. An efficient algorithm is used for loading both the current block and the search area inside the PE array. The search area data flow horizontally while the current block data are stationary. As a result, the speed of the co-processor is improved in terms of the throughput and the operating frequency compared to the state-of-the-art techniques. A smart local memory and PE design guarantees a simple and a regular data flow. The design of the local memory is implemented using only registers and a simple counter. This simplifies the design by avoiding the use of complicated addressing to write or read into/from the local memory. The proposed architecture is implemented using both the FPGA and the ASIC flow design tools. For a search range of 32 × 32 and block size of 16 × 16, the architecture can perform motion estimation for 30 fps of HDTV video at 350 MHz and easily outperforms many fast full search architectures.  相似文献   

4.
Conventional connected component analysis (CCA) algorithms render a slow performance in real-time embedded applications due to multiple passes to resolve label equivalences. As this fundamental task becomes crucial for stream processing, single-pass algorithms were introduced to enable a stream-oriented hardware design. However, most single-pass CCA algorithms in the literature inhibit maximum streaming throughput as additional time such as horizontal blanking period is required to resolve label equivalence. This paper proposes a novel single-pass CCA algorithm, using a combination of linked list and run-length-based techniques to label and resolve equivalences as well as extracting the object features in a single raster scan. The proposed algorithm involves a label recycling scheme which attains low memory requirement design. Experimental results show the implementation of the proposed CCA achieves one cycle per pixel throughput and surpasses the most memory-efficient state-of-the-art work up to 25 % reduction in memory usage for \(7680\times 4320\) pixels image.  相似文献   

5.
This paper presents a detailed architecture and a reconfigurable logic based hardware design of the SCAN algorithm. This architecture can be used to encrypt high resolution images in real-time. Although the SCAN algorithm is a block cipher algorithm with arbitrarily large blocks, the present design is for 64×64 pixel blocks in order to provide real-time image encryption throughput. The architecture was initially targeted at the Xilinx XCV-1000 FPGA, for which design and performance results are presented in the paper.  相似文献   

6.
Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.  相似文献   

7.
Intra prediction algorithm in the recently developed high efficiency video coding (HEVC) standard has very high computational complexity. Therefore, in this paper, we propose pixel equality and pixel similarity based techniques for reducing amount of computations performed by HEVC intra prediction algorithm and, therefore, reducing energy consumption of HEVC intra prediction hardware. The proposed techniques significantly reduce the amount of computations performed by 4 × 4 and 8 × 8 luminance angular prediction modes with a small comparison overhead. Pixel equality based technique does not affect the PSNR and bit rate. Pixel similarity based technique increases the PSNR slightly for some video frames and it decreases the PSNR slightly for some video frames. We also designed and implemented a low energy HEVC intra prediction hardware for 4 × 4 and 8 × 8 angular prediction modes including the proposed techniques using Verilog HDL. The proposed techniques significantly reduce the energy consumption of this HEVC intra prediction hardware.  相似文献   

8.
The paper presents a cost-shared architecture to compute multiple integer discrete cosine transform (Int-DCT) of four video codecs—AVS, H.264/AVC, VC-1 and HEVC (under development). Based on the symmetric structure of the matrices and the similarity in matrix operation, we develop a generalized “decompose and share” algorithm to compute both 4 × 4 and 8 × 8 Int-DCT. The algorithm is later applied to the video codecs. The hardware share approach ensures maximum circuit reuse during the computation. The architecture is designed with only adders and shifters to reduce the hardware cost significantly. The design is implemented on FPGA and later synthesized in CMOS 0.18 μm technology.  相似文献   

9.
王官军  简春莲  向强 《计算机应用》2022,42(10):3184-3190
针对基于卷积神经网络(CNN)的单图像去雾模型在移动/嵌入式端部署难,不易用做实时视频去雾的问题,提出一种基于Zynq片上系统(SoC)的去雾模型硬件重构加速方法。首先,提出量化-反量化算法,对两个代表去雾模型进行量化;其次,基于视频流存储器架构和软硬件协同、流水线等技术以及高级综合(HLS)工具,对量化后的去雾模型硬件重构并生成具有高性能扩展总线接口(AXI4)的硬件IP核。实验结果表明,在保证去雾效果的前提下,可以实现模型参数从float32到int5(5 bit)的量化,从而节省约84.4%的存储空间;所生成硬件IP核的最高像素时钟频率为182 Mpixel/s,能够实现1080P@60 frame/s的视频去雾;单帧640×480的雾图去雾仅需2.4 ms,而片上功耗仅为2.25 W。这种生成带有标准总线接口的硬件IP核也便于跨平台移植和部署,从而可以扩大这类去雾模型的应用范围。  相似文献   

10.
A novel high throughput and scalable unified architecture for the computation of the transform operations in video codecs for advanced standards is presented in this paper. This structure can be used as a hardware accelerator in modern embedded systems to efficiently compute all the two-dimensional 4 × 4 and 2 × 2 transforms of the H.264/AVC standard. Moreover, its highly flexible design and hardware efficiency allows it to be easily scaled in terms of performance and hardware cost to meet the specific requirements of any given video coding application. Experimental results obtained using a Xilinx Virtex-5 FPGA demonstrated the superior performance and hardware efficiency levels provided by the proposed structure, which presents a throughput per unit of area relatively higher than other similar recently published designs targeting the H.264/AVC standard. Such results also showed that, when integrated in a multi-core embedded system, this architecture provides speedup factors of about 120× concerning pure software implementations of the transform algorithms, therefore allowing the computation, in real-time, of all the above mentioned transforms for Ultra High Definition Video (UHDV) sequences (4,320 × 7,680 @ 30 fps).  相似文献   

11.
This paper presents a new spatio-temporal motion estimation algorithm and its VLSI architecture for video coding based on algorithm and architecture co-design methodology. The algorithm consists of the new strategies of spatio-temporal motion vector prediction, modified one-at-a-time search scheme, and multiple update paths derived from optimization theory. The hardware specification is for high-definition video coding. We applied the ME algorithm to H.264 reference software. Our algorithm surpasses recently published research and achieves close performance to full search. The VLSI implementation proves the low cost feature of our algorithm. The algorithm and architecture co-design concept is highly emphasized in this paper. We provide some quantitative example to show the necessity of algorithm and architecture co-design  相似文献   

12.
Fast search algorithms (FSA) used for variable block size motion estimation follow irregular search (data access) patterns. This poses as the main challenge in designing hardware architectures for them. In this study, we build a baseline architecture for fast search algorithms using state-of-the-art components available in academia. We improve its performance by introducing: (1) a super 2-dimensional (2-D) random access memory architecture for reading regular and interleaved two-rows or two-columns as opposed to one-row or one-column accessibility of the state of the art; (2) a 2-D processing element array with a tuned interconnect to support neighborhood connections required by the conventional fast search algorithms and to exploit on-chip data reuse. Results show that our design increases system throughput by up to 85.47%, and achieves power reduction by up to 13.83% with an area increase in the worst case by up to 65.53% compared to the baseline architecture.  相似文献   

13.
Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA + DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM + DSP and DSP only. The proposed FPGA + DSP system offers a speedup of about 20 times and 3 times over ARM + DSP and DSP only configurations, respectively. A comparison of the Harris feature detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions.  相似文献   

14.
Evolving applications related to video technologies require video encoder and decoder implemented with low cost and achieving real-time performance. In order to meet this demand and targeting especially the applications imposing low VLSI area requirements, the present paper describes a VLSI H.264/AVC encoder architecture performing at real-time. The encoder uses a pipeline architecture and all the modules have been optimized with respect to the VLSI cost. The encoder design complies with the reference software encoder of the standard, follows the baseline profile level 3.0 and it constitutes an IP-core and/or an efficient stand-alone solution. The architecture operates at a maximum frequency of 100 MHz and achieves maximum throughput of 30 frames/s with frame size 1,024 × 768. Results and performance measurements of the entire encoder have been validated on FPGA and VLSI 0.18 μm occupying a total area of 3.9 mm2.  相似文献   

15.
针对传统视频图像压缩算法时延长和成本高的问题,提出一种新的无损/近无损视频压缩算法。该算法由码率控制器和熵编码器组成,其中码率控制器通过对已有信息进行分析(上下文)来确定当前宏块的可用比特数,然后根据大量实验得出的高效Huffman码表,并结合位平面编码器对残差进行编码。实验结果表明,文中提出的视频图像压缩算法能够工作在300 MHz,吞吐量最差为1.3 pixel/cycle,同时仅用一块120*720的SRAM来存储上一行像素值,因此很好地解决了时延和成本问题。  相似文献   

16.
描述了一种基于超长指令字处理器的同时多线程微体系结构——MOSI(MultiOp Splitting Issue,多操作①分离发射).MOSI动态地发射同一多操作内的指令.并通过写回缓冲保证计算结果的写回顺序与编译器的视图一致,从而以较小的代价解决了SMT技术中的关键问题.文中详细描述了写回缓冲的结构及算法,给出了多个线程的硬件模型,最后对硬件支持线程的个数及Cache的组织结构进行了讨论.实验结果表明,基于MOSI结构的双线程处理器能够将吞吐率提高40%.  相似文献   

17.
The emerging intra-coding tools of High Efficiency Video Coding (HEVC) standard can achieve up to 36 % bit-rate reduction compared to H.264/AVC, but with significant complexity increase. The design challenges, such as data dependency and computational complexity, make it difficult to implement a hardware encoder for real-time applications. In this paper, firstly, the data dependency in HEVC intra-mode decision is fully analyzed, which is cost by the reconstruction loop, the Most Probable Mode, the context adaption during Context-based Adaptive Binary Arithmetic Coding based rate estimation, and the Chroma derived mode. Then, several fast algorithms are proposed to remove the data dependency and to reduce the computational complexity, which include source signal based Rough Mode Decision, coarse to fine rough mode search, Prediction Mode Interlaced RDO mode decision, parallelized context adaption and Chroma-free Coding Unit (CU)/Prediction Unit (PU) decision. Finally, the parallelized VLSI architecture with CU reordering and Chroma reordering scheduling is proposed to improve the throughput. The experimental results demonstrate that the proposed intra-mode decision achieves 41.6 % complexity reduction with 4.3 % Bjontegaard Delta Rate (BDR) increase on average compared to the reference software, HM-13.0. The intra-mode decision scheme is implemented with 1571.7K gate count in 55 nm CMOS technology. The implementation results show that our design can achieve 1080p@60fps real time processing at 294 MHz operation frequency.  相似文献   

18.
We present a multimodal registration algorithm between images in the visible, short-wave infrared and long-wave infrared spectra. The algorithm works with two reference-objective image pairs and operates in two stages: (1) A calibration phase between static frames to estimate the transformation parameters using histogram of oriented gradients and the Chi-square distance; (2) a frame-by-frame mapping with these parameters using a projective transformation and a bilinear interpolation to map the objective video stream to the coordinate system of the reference video stream. We present a distributed heterogeneous architecture that combines a programmable processor core and a custom hardware accelerator for each node. The software performs the calibration phase, whereas the hardware computes the frame-by-frame mapping. We implemented our design using a Xilinx Zynq XC7Z020 system-on-a-chip for each node. The prototype uses 2.38W of power, 25% of the logic resources and 65% of the available on-chip memory per node. Running at 100MHz, the core can register 640  ×  512-pixel frames in 4ms after initial calibration, which allows our module to operate at up to 250 frames per second.  相似文献   

19.
Low-Density Parity-heck Codes (LDPC) with excellent error-correction capabilities have been widely used in both data communication and storage fields, to construct reliable cyber-physical systems that are resilient to real-world noises. Fast prototyping field-programmable gate array (FPGA)-based decoder is essential to achieve high decoding performance while accelerating the development process. This paper proposes a three-level parallel architecture, TLP-LDPC, to achieve high throughput by fully exploiting the characteristics of both LDPC and underlying hardware while effectively scaling to large-size FPGA platforms. The three-level parallel architecture contains a low-level decoding unit, a mid-level multi-unit decoding core, and a high-level multi-core decoder. The low-level decoding unit is a basic LDPC computation component that effectively combines the features of the LDPC algorithm and hardware with the specific structure (e.g., Look-Up-Table, LUT) of the FPGA and eliminates potential data conflicts. The mid-level decoding core integrates the input/output and multiple decoding units in a well-balancing pipelined fashion. The top-level multi-core architecture conveniently makes full use of board-level resources to improve the overall throughput. We develop an LDPC C++ code with dedicated pragmas and leverage HLS tools to implement the TLP-LDPC architecture. Experimental results show that TLP-LDPC achieves 9.63 Gbps end-to-end decoding throughput on a Xilinx Alveo U50 platform, 3.9x higher than existing HLS-based FPGA implementations.  相似文献   

20.
基于HTTP的动态自适应流媒体DASH传输协议可以使用户根据自身的终端显示能力和信道条件选择合适的视频质量,是网络视频服务技术的发展方向。如何根据网络吞吐量的变化自适应地选择视频码率,以获得最佳的用户体验质量QOE,在已有的DASH系统中还没有得到很好的解决。 提出了一种基于模糊控制的自适应传输算法,将缓存的视频余量以及用户申请的视频码率和网络吞吐量的码率失配度作为输入,将预期的缓存变化量作为输出,通过模糊逻辑实现以下控制目的:(1)将缓存稳定在一个安全的区间;(2)使传输视频的平均质量最大化;(3)避免因带宽波动所造成的视频播放中断。最后,分别在两种虚拟网络环境和两种实际网络环境下进行性能测试,实验结果表明,与已有的算法相比较,本文提出的算法可以给用户带来更好的QOE。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号