首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Integer pixel motion estimation (IME) is one crucial module with high complexity in high-definition video encoder. Efficient algorithm and architecture joint design is supposed to tradeoff multiple target parameters including throughput capacity, logic gate, on-chip SRAM size, memory bandwidth, and rate distortion performance. Data organization and on-chip buffer structure are crucial factors for IME architecture design, accounting for multiple target performance tradeoff. In this work, we combine global hierarchical search and local full search to propose hardware efficient IME algorithm, and then propose hardware VLSI architecture with optimized on-chip buffer structure. The major contribution of this work is characterized by: (1) improved hierarchical IME algorithm with presearch and deliberate data organization, (2) multistage on-chip reference pixel buffer structure with high data reuse between integer and fraction pixel motion estimations, (3) highly reused and reconfigurable processing element structure. The optimized data organization and buffer structure achieves nearly 70 % buffer saving with less than average 0.08, 0.12 dB the worst case, PSNR degradation compared with full search based architecture. At the hardware cost of 336 and 382 K logic gate and 20 kB SRAM, the proposed architecture achieves the throughput of 384 and 272 cycles per macroblock, at system frequency of 95 and 264 MHz for 1080p and QFHD @30fps format video coding.  相似文献   

2.
Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.  相似文献   

3.
A novel full search motion estimation co-processor architecture design is presented in this paper. The proposed architecture efficiently reuses search area data to minimize memory I/O while fully utilizing the hardware resources. A smart processing element (PE) and an efficient simple internal memory are the main components of the proposed co-processor. An efficient algorithm is used for loading both the current block and the search area inside the PE array. The search area data flow horizontally while the current block data are stationary. As a result, the speed of the co-processor is improved in terms of the throughput and the operating frequency compared to the state-of-the-art techniques. A smart local memory and PE design guarantees a simple and a regular data flow. The design of the local memory is implemented using only registers and a simple counter. This simplifies the design by avoiding the use of complicated addressing to write or read into/from the local memory. The proposed architecture is implemented using both the FPGA and the ASIC flow design tools. For a search range of 32 × 32 and block size of 16 × 16, the architecture can perform motion estimation for 30 fps of HDTV video at 350 MHz and easily outperforms many fast full search architectures.  相似文献   

4.
A motion estimation architecture allowing the execution of a variety of block-matching search techniques is presented in this paper. The ability to choose the most efficient search technique with respect to speeding up the process and locating the best matching target block leads to the improvement of the quality of service and the performance of the video encoding. The proposed architecture is pipelined to efficiently support a large set of currently used block-matching algorithms including Diamond Search, 3-step, MVFAST and PMVFAST. The proposed design executes the algorithms by providing a set of instructions common for all the block-matching algorithms and a few instructions accommodating the specific actions of each technique. Moreover, the architecture supports the use of different search techniques at the block level. The results and performance measurements of the architecture have been validated on FPGA supporting maximum throughput of 30 frames/s with frame size 1,024 × 768.  相似文献   

5.
在视频编码的运动估计运算中,全搜索结构最为主流,然而相应传统的全搜索1-D、2-D脉动结构或树形结构在计算的过程中,往往会出现I/O带宽大或计算效率低等问题。针对这些问题,提出一种新的数据流和相应的两维脉动阵列结构,利用相邻当前块搜索域的数据重合,在保证高性能的同时最大程度地减小I/O带宽。结果表明,提出的结构可以在256周期内完成一个宏块41个运动矢量计算,并且只有3个数据输入。  相似文献   

6.
Fast search algorithms (FSA) used for variable block size motion estimation follow irregular search (data access) patterns. This poses as the main challenge in designing hardware architectures for them. In this study, we build a baseline architecture for fast search algorithms using state-of-the-art components available in academia. We improve its performance by introducing: (1) a super 2-dimensional (2-D) random access memory architecture for reading regular and interleaved two-rows or two-columns as opposed to one-row or one-column accessibility of the state of the art; (2) a 2-D processing element array with a tuned interconnect to support neighborhood connections required by the conventional fast search algorithms and to exploit on-chip data reuse. Results show that our design increases system throughput by up to 85.47%, and achieves power reduction by up to 13.83% with an area increase in the worst case by up to 65.53% compared to the baseline architecture.  相似文献   

7.
The synergistic processor element is a new architecture oriented for multimedia and streaming processing. In this architecture, the memory is not a cache but a private or scratch pad memory. Such a memory is simple and needs to be high-frequency and large space in low-power. This design uses an 11 fan-out of four (11FO4), six-cycle, fully pipelined, embedded 256-Kbyte SRAM for this purpose. The design's memory is not one hard macro, but a group of custom macros physically distributed to optimize the pipeline.  相似文献   

8.
Graph computation problems that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures. Although recent studies use FPGA technology to tackle the memory wall problem of graph computation by adopting a massively multi-threaded architecture, the performance is still far less than optimal memory performance due to the long memory access latency. In this paper, we propose a comprehensive reconfigurable computing approach to address the memory wall problem. First, we present an extended edge-streaming model with massive partitions to provide better load balance while taking advantage of the streaming bandwidth of external memory in processing large graphs. Second, we propose a two-level shuffle network architecture to significantly reduce the on-chip memory requirement while provide high processing throughput that matches the bandwidth of the external memory. Third, we introduce a compact storage design based on graph compression schemes and propose the corresponding encoding and decoding hardware to reduce the data volume transferred between the processing engines and external memory. We validate the effectiveness of the proposed architecture by implementing three frequently-used graph algorithms on ML605 board, showing an up to 3.85 × improvement in terms of performance to bandwidth ratio over previously published FPGA-based implementations.  相似文献   

9.
文中在分析MVFAST算法的基础上,提出了对MVFAST算法搜索窗的改进。通过对搜索窗的改进,在不影响图像质量的前提下,提高了算法的搜索速度。设计了支持该算法的体系结构和处理单元。给出了灵活的数据流结构和搜索策略。对大量不同的视频序列、分辨率不同的视频序列、运动剧烈程度不同的视频序列进行了实验。大量的实验结果表明,设计的体系结构的处理速度得到了明显的提高,内存访问带宽大大减小,获得了与全搜索算法可比的图像质量。  相似文献   

10.
This paper describes the development of efficient hardware/software (HW/SW) neuro-fuzzy systems. The model used in this work consists of an adaptive neuro-fuzzy inference system modified for efficient HW/SW implementation. The design of two different on-chip approaches are presented: a high-performance parallel architecture for offline training and a pipelined architecture suitable for online parameter adaptation. Details of important aspects concerning the design of HW/SW solutions are given. The proposed architectures have been implemented using a system-on-a-programmable-chip. The device contains an embedded-processor core and a large field programmable gate array (FPGA). The processor provides flexibility and high precision to implement the learning algorithms, while the FPGA allows the development of high-speed inference architectures for real-time embedded applications.  相似文献   

11.
12.
基于块匹配算法的运动估计是图像和视频应用中的关键技术。SAD运算是运动估计中最主要的运算形式,具有极高的计算复杂度和传输带宽需求。本文提出了一种可配置的SAD运算加速器结构,采用一个16×1规模的PE阵列和一个加法树结构加速SAD运算的执行。本文将PE阵列和加法树结构的流水线进行细致划分,有效提高了工作频率。加速器采用DMA事件机制,大部分的数据传输可以与SAD计算并行进行,减少了数据传输延迟引起的性能下降。实验结果显示,搜索16×16大小的搜索窗口,本文结构只需要4102个周期。基于SMIC0.13μm的CMOS标准单元工艺对本文结构进行综合,最高工作频率可达到750MHz,面积约为16.8k门和3.5KB的片上存储器。  相似文献   

13.
This article presents the HP422-MoCHA: optimized Motion Compensation hardware architecture for the High 4:2:2 profile of H.264/AVC video coding standard. The proposed design focuses on real-time decoding for HDTV 1080p (1,920 × 1,080 pixels) at 30 fps. It supports multiple sample bit-width (8, 9, or 10 bits) and multiple chroma sub-sampling formats (4:0:0, 4:2:0, and 4:2:2) to provide enhanced video quality experience. The architecture includes an optimized sample interpolator that processes luma and chroma samples in two parallel datapaths and features quarter sample accuracy, bi-prediction and weighted prediction. HP422-MoCHA also includes a hardwired Motion Vector Predictor, supporting temporal and spatial direct predictions. A novel memory hierarchy implemented as a 3-D Cache reduces the frame memory access, providing, on average, 62% of bandwidth and 80% of clock cycles reduction. The design was implemented in a Xilinx Virtex-II PRO FPGA, and also in an ASIC with a TSMC 0.18 μm standard cells technology. The ASIC implementation occupies 102 K equivalent gates and 56.5 KB of on-chip SRAM in a 3.8 × 3.4 mm2 area. It presents a power consumption of 130 mW. Both implementations reach a maximum operation frequency of ~100 MHz, being able to motion compensate 37 bi-predictive frames or 69 predictive fps. The minimum required frequency to ensure the real-time decoding for HD1080p at 30 fps is 82 MHz. Since HP422-MoCHA is the first Motion Compensation architecture for the High 4:2:2 profile found in the literature, a Main profile MoCHA was used for comparison purposes, showing the highest throughput among all presented works. However, the HP422-MoCHA architecture also reaches the highest throughput when compared with the other published Main profile MC solutions, even considering the significantly higher complexity of the High 4:2:2 profile.  相似文献   

14.
为了提高H.264视频编码效率,基于计算统一设备架构(CUDA)的并行全搜索运动估计算法,并利用GPU强大的计算能力和CUDA优化的存储层次结构,以加速H.264编码中的运动估计.与传统的以牺牲视频质量来提升运动估计性能的方法不同,该算法在保证视频质量的同时,结合运动估计计算密集、计算量大等特点,充分利用CUDA架构的并行性加快运动估计的速度,从而达到提高实时编码速度的目的.在GTX280实验平台上的实验结果显示,采用文中算法比优化的CPU实现可获得高达70倍的加速比.  相似文献   

15.
This paper presents the architecture of the high-throughput compensator and the interpolator used in the motion estimation of the H.265/HEVC encoder. The architecture can process 8×8 blocks in each clock cycle. The design allows the random order of checked coding blocks and motion vectors. This feature makes the architecture suitable for different search algorithms. The interpolator embeds 64 multiplierless reconfigurable filter cores to support computations for different fractional-pel positions. Synthesis results show that the design can operate at 200 and 400 MHz when implemented in FPGA Arria II and TSMC 90 nm, respectively. The computational scalability enables the proposed architecture to trade the throughput for the compression efficiency. If 2160p@30fps video is encoded, the design clocked at 400 MHz can check about 100 motion vectors for 8×8 blocks.  相似文献   

16.
The H.264 video-coding standard is a great improvement on its predecessor in that it is able to save 50 % of the bit-rate while maintaining the same quality as MPEG-4. However, its high computational complexity means the standard consumes large amounts of energy to process a video sequence, especially during motion estimation (ME) searches. To overcome this problem, a low-energy ME architecture is proposed in this paper that utilizes a quadrant-based multi-octagon search algorithm as one of its fast-search motion-estimation techniques. The proposed architecture is able to reduce the clock cycle needed to perform the search by 42 % compared to the original conventional algorithm. This clock cycle reduction reduces energy consumption by up to 43 %.  相似文献   

17.
Biologically-inspired packet switched network on chip (NoC) based hardware spiking neural network (SNN) architectures have been proposed as an embedded computing platform for classification, estimation and control applications. Storage of large synaptic connectivity (SNN topology) information in SNNs require large distributed on-chip memory, which poses serious challenges for compact hardware implementation of such architectures. Based on the structured neural organisation observed in human brain, a modular neural networks (MNN) design strategy partitions complex application tasks into smaller subtasks executing on distinct neural network modules, and integrates intermediate outputs in higher level functions. This paper proposes a hardware modular neural tile (MNT) architecture that reduces the SNN topology memory requirement of NoC-based hardware SNNs by using a combination of fixed and configurable synaptic connections. The proposed MNT contains a 16:16 fully-connected feed-forward SNN structure and integrates in a mesh topology NoC communication infrastructure. The SNN topology memory requirement is 50 % of the monolithic NoC-based hardware SNN implementation. The paper also presents a lookup table based SNN topology memory allocation technique, which further increases the memory utilisation efficiency. Overall the area requirement of the architecture is reduced by an average of 66 % for practical SNN application topologies. The paper presents micro-architecture details of the proposed MNT and digital neuron circuit. The proposed architecture has been validated on a Xilinx Virtex-6 FPGA and synthesised using 65 nm low-power CMOS technology. The evolvable capability of the proposed MNT and its suitability for executing subtasks within a MNN execution architecture is demonstrated by successfully evolving benchmark SNN application tasks representing classification and non-linear control functions. The paper addresses hardware modular SNN design and implementation challenges and contributes to the development of a compact hardware modular SNN architecture suitable for embedded applications  相似文献   

18.
A novel architecture of motion estimation(ME) based on improved normalized partial distortion search is proposed to meet three primary requirements for real-time video encoding,which are low-power,lowbandwidth and high area utilization efficiency.The ME engine supports both normalized partial distortion search and adaptive search window adjustment.The former can reduce the computational complexity of ME to save power and area;the latter can avoid unnecessary accessing the external memory to lower the data bandwidth.The proposed engine has been implemented with UMC 90nm CMOS technology.The implementation results show that,compared with traditional engines,the engine can achieve significant improvements of the hardware efficiency and the power efficiency respectively with a little throughput compromise.  相似文献   

19.
为了达到实时视频编码的低功耗、低带宽、省资源3个要求,文中基于改进的归一化部分失真搜索算法,提出一种新颖的运动估计硬件结构.新结构同时支持归一化部分失真搜索和自适应搜索区域调整.前者可降低运动估计的计算复杂度,从而实现低功耗省资源两个要求,后者能避免不必要的外存访问,从而降低数据带宽.在UMC 90 nm CMOS工艺下实现结果表明,相比于传统结构的最好结果,文中结构以6.2%的吞吐率损失,换取面积效率和功耗效率分别提高425.5%和397.5%.  相似文献   

20.
Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding   总被引:1,自引:0,他引:1  
The video coding standard H.264 supports video compression with a higher coding efficiency than previous standards. However, this comes at the expense of an increased encoding complexity, in particular for motion estimation which becomes a very time consuming task even for today's central processing units (CPU). On the other hand, modern graphics hardware includes a powerful graphics processing unit (GPU) whose computing power remains idle most of the time. In this paper, we present a GPU based approach to motion estimation for the purpose of H.264 video encoding. A small diamond search is adapted to the programming model of modern GPUs to exploit their available parallel computing power and memory bandwidth. Experimental results demonstrate a significant reduction of computation time and a competitive encoding quality compared to a CPU UMHexagonS implementation while enabling the CPU to process other encoding tasks in parallel.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号