首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Variable block-size motion estimation (VBSME) has become an important video coding technique, but it increases the difficulty of hardware design. In this paper, we use inter-/intra-level classification and various data flows to analyze the impact of supporting VBSME in different hardware architectures. Furthermore, we propose two hardware architectures that can support traditional fixed block-size motion estimation as well as VBSME with less chip area overhead compared to previous approaches. By broadcasting reference pixel rows and propagating partial sums of absolute differences (SADs), the first design has the fewer reference pixel registers and a shorter critical path. The second design utilizes a two-dimensional distortion array and one adder tree with the reference buffer that can maximize the data reuse between successive searching candidates. The first design is suitable for low resolution or a small search range, and the second design has advantages of supporting a high degree of parallelism and VBSME. Finally, we propose an eight-parallel SAD tree with a shared reference buffer for H.264/AVC integer motion estimation (IME). Its processing ability is eight times of the single SAD tree, but the reference buffer size is only doubled. Moreover, the most critical issue of H.264 IME, which is huge memory bandwidth, is overcome. We are able to save 99.9% off-chip memory bandwidth and 99.22% on-chip memory bandwidth. We demonstrate a 720-p, 30-fps solution at 108 MHz with 330.2k gate count and 208k bits on-chip memory.  相似文献   

2.
Variable block size motion estimation (VBSME) is becoming the new coding technique in H.264/AVC. This article presents a low-power VLSI implementation for VBSME, which employs a fast full-search block-matching algorithm to reduce power consumption, while preserving the optimal motion vectors (MVs). The fast full-search algorithm is based on the comparison of the current minimum sum of absolute difference (SAD) to a conservative lower bound so that unnecessary SAD calculations can be eliminated. We first experimentally determine the specific conservative lower bound of SAD and then implement the fast full-search algorithm in FPGA and 0.18?µm CMOS technology. To the best of our knowledge, this is the first time that a fast full-search block-matching algorithm is explored to reduce power consumption in the context of VBSME and implemented in hardware. Experiment results show that the proposed design can save power consumption by 45% compared to conventional VBSME designs that give optimal MV based on the full-search algorithms.  相似文献   

3.
A VLSI architecture for variable block size video motion estimation   总被引:1,自引:0,他引:1  
With the advent of new video standards such as MPEG-4 part-10 and H.264/H.26L, demands for advanced video coding, particularly in the area of variable block size video motion estimation (VBSME), are increasing. In this paper, we propose a new one-dimensional (1-D) very large-scale integration architecture for full-search VBSME (FSVBSME). The VBS sum of absolute differences (SAD) computation is performed by re-using the results of smaller sub-block computations. These are distributed and combined by incorporating a shuffling mechanism within each processing element. Whereas a conventional 1-D architecture can process only one motion vector (MV), this new architecture can process up to 41 MV sub-blocks (within a macroblock) in the same number of clock cycles.  相似文献   

4.
In H.264/AVC, the motion estimation (ME) routine supports variable block size and involves highly parallel sum of absolute difference (SAD) computations. In this study, we introduce a bit serial hybrid-grained processing element (PE) based 2D architecture that has both early termination and intensive data reuse capabilities. PEs operate on most significant bit-first arithmetic for early termination and the 2D architecture enables on-chip data reuse between neighboring PEs in a bit-by-bit pipelined fashion. Hybrid-grained PEs reduce the hardware overhead of conventional adder tree structures used for implementing the variable block size ME. Our design reduces the gate count by 7x compared to its ASIC counterpart, operates at a comparable frequency while sustaining 30 fps and 60 fps; and outperforms bit parallel and bit serial architectures in terms of throughput and performance per gate for various video formats.  相似文献   

5.
H.264/AVC is the latest video coding standard adopting variable block size motion estimation (VBS-ME), quarter-pixel accuracy, motion vector prediction and multi-reference frames for motion estimation. These new features result in much higher computation requirements than previous coding standards. In this paper we propose a novel most significant bit (MSB) first bit-serial architecture for full-search block matching VBS-ME, and compare it with systolic implementations. Since the nature of MSB-first processing enables early termination of the sum of absolute difference (SAD) calculation, the average hardware performance can be enhanced. Five different designs, one and two dimensional systolic and tree implementations along with bit-serial, are compared in terms of performance, pixel memory bandwidth, occupied area and power consumption.
Philip H. W. Leong (Corresponding author)Email:
  相似文献   

6.
We implemented the H.264/AVC variable block size motion estimation (VBSME) using a very long instruction word (VLIW)–single instruction multiple data (SIMD) digital signal processor (DSP). The SAD_Reuse method which has a regular structure is chosen for VBSME not only to remove redundant sum of absolute difference (SAD) operations but also to utilize the instruction level parallelism (ILP) and data level parallelism (DLP) of the architecture. A fast mode decision algorithm is developed to reduce the number of ‘compare and update’ operations and simplify the rate distortion optimization (RDO). The developed fast mode decision uses the difference of motion vectors and the maximum a posteriori (MAP) estimation of the rate-distortion costs. Several advanced software techniques that include software pipelining and packed-data processing are employed. Especially, memory access overhead reduction schemes including the multi-block processing and the inter-procedural scheduling are used for the software optimization. In order to reduce the ‘write buffer full’ in the quarter pixel ME, a 4 bit quantization scheme is developed, which increases the number of arithmetic operations but decreases the stall cycles very much. The implemented variable block size ME for H.264/AVC requires an average of 9 M and 78 Mcycles per frame for QCIF and CIF size video sequences, respectively, in the TMS320C64x DSP architecture.
Wonyong SungEmail:
  相似文献   

7.
H.264中运动估计算法的一种硬件实现架构   总被引:1,自引:1,他引:0  
提出了一种适用于H.264标准中可变块大小运动估计算法的硬件实现架构.架构中采用一维处理单元(PE)阵列来实现运动估计算法中匹配块的搜索,通过对较小子块的块间误差(SAD)的复用来计算不同大小块的块间误差.与传统的处理一个运动矢量的架构相比,这种架构在一定的时钟周期内最多可处理41个运动矢量,并且具有面积小、速度快的特点.  相似文献   

8.
The paper presents a new architecture composed of bit plane-parallel coder for Embedded Block Coding with Optimized Truncation (EBCOT) entropy encoder used in JPEG2000. In the architecture, the coding information of each bit plane can be obtained simultaneously and processed parallel. Compared with other architectures, it has advantages of high parallelism, and no waste clock cycles for a single point. The experimental results show that it reduces the processing time about 86% than that of bit plane sequential scheme. A Field Programmable Gate Array (FPGA) prototype chip is designed and simulation results show that it can process 512×512 gray-scaled images with more than 30 frames per second at 52MHz.  相似文献   

9.
This paper presents a memory efficient partially parallel decoder architecture suited for high rate quasi-cyclic low-density parity-check (QC-LDPC) codes using (modified) min-sum algorithm for decoding. In general, over 30% of memory can be saved over conventional partially parallel decoder architectures. Efficient techniques have been developed to reduce the computation delay of the node processing units and to minimize hardware overhead for parallel processing. The proposed decoder architecture can linearly increase the decoding throughput with a small percentage of extra hardware. Consequently, it facilitates the applications of LDPC codes in area/power sensitive high-speed communication systems  相似文献   

10.
We evaluate the validity of the fundamental assumption behind application-specific programmable processors: that applications differ from each other in key parameters which are exploitable, such as the available instruction-level parallelism (ILP), demand on various hardware resources, and the desired mix of function units. Following the tradition of the CAD community, we develop an accurate chip area estimate and a set of aggressive hardware optimization algorithms. We follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality tools. This combination enables us to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications. We explore the application-specific programmable processor (ASSP) design space to understand the relationship between performance and area. The architecture model we used is the Hewlett Packard PA-RISC with single level caches. The system, including all memory and bus latencies, is simulated and no other specialized ALU or memory structures are being used. The experimental results reveal a number of important characteristics of the ASSP design space. For example, we found that in most cases a single programmable architecture performs similarly to a set of architectures that are tuned to individual application. A notable exception is highly cost sensitive designs, which we observe need a small number of specialized architectures that require smaller areas. Also, it is clear that there is enough parallelism in the typical media and communication applications to justify use of high number of function units. We found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width  相似文献   

11.
《Microelectronics Journal》2014,45(11):1429-1437
In-memory computation is one of the most promising features of memristive memory arrays. In this paper, we propose an array architecture that supports in-memory computation based on a logic array first proposed in 1972 by Sheldon Akers. The Akers logic array satisfies this objective since this array can realize any Boolean function, including bit sorting. We present a hardware version of a modified Akers logic array, where the values stored within the array serve as primary inputs. The proposed logic array uses memristors, which are nonvolatile memory devices with noteworthy properties. An Akers logic array with memristors combines memory and logic operations, where the same array stores data and performs computation. This combination opens opportunities for novel non-von Neumann computer architectures, while reducing power and enhancing memory bandwidth.  相似文献   

12.
Floating point division is a complex operation among all floating point arithmetic; it is also an area and a performance dominating unit. This paper presents double precision floating point division architectures on FPGA platforms. The designs are area optimized, running at higher clock speed, with less latency, and are fully pipelined. Proposed architectures are based on the well-known Taylor series expansion, using relatively smaller amount of hardware in terms of memory (initial look-up table), multiplier blocks, and slices. Two architectures have been presented with various trade-offs among area, memory and accuracy. Designs are based on the use of the partial block multipliers, in order to reduce hardware usage while minimizing the loss of accuracy. All the implementations have been targeted and optimized separately for different Xilinx FPGAs to exploit their specific resources efficiently. Compared to previously reported literature, the proposed architectures require less area, reduced latency, with the advantage of higher performance gain. The accuracy of the designs has been both theoretically analyzed and validated using random test cases.  相似文献   

13.
The sum of absolute difference (SAD) is generally adopted as a cost function in motion estimation (ME) and temporal error concealment (TEC) algorithm owing to its efficiency. The hardware architecture of SAD also consumes considerable power dissipation in video codec chip. Hence, the switching-activity analysis on SAD is quite essential from algorithm/architecture perspectives. This work develops the estimation formulas for switching-activity dedicated for SAD engine according to probability theory. The experiment results reveal that the probability error rate (PER) of the SAD engine is as minor as 5.61%. Consequently, this leads to a precise switching-activity estimation of the SAD-based algorithm/architecture for video signal processing.  相似文献   

14.
This paper presents novel very large scale integration (VLSI) architectures in support of an efficient implementation of Leighton's well-known Columnsort. The designs take advantage of reconfigurable bus architectures enhanced with simple shift switches. Our first main contribution is to show that Columnsort can be partitioned into two components: a hardware scheme involving the task of sorting arrays of small size and a hardware or software scheme that involves simple data movement tasks. Our second main contribution is to demonstrate that the dynamically reconfigurable mesh architecture can be exploited to obtain a small and efficient hardware sorter. The resulting architectures feature high regularity of circuitry, simplicity of control structure, and adaptability. Both theoretical analyses and simulation tests have shown that the proposed VLSI architectures for sorting are superior to existing designs in the context of sorting small and moderate size arrays  相似文献   

15.
Video applications are increasingly present in consumer electronic devices which require low-power and low-energy consumption. Sum of Absolute Differences (SAD) is the most used distortion metric in video coding implementation and consumes a relative large area in the motion estimation hardware. This paper presents the standard-cells synthesis and a comprehensive analysis of various parallel hardware architectures alternatives for SAD calculation, focusing on different design constraints such as high-performance (maximum throughput) and the tradeoff between high-performance and low-power dissipation (namely an isoperformance target). Low-power techniques supported by commercial standard-cells tools are exercised in this design, such as clock gating, multi-threshold (VT) and a combination of slow and fast standard-cells. We achieved significant power reduction for the architectures with lower frequencies and higher parallelism, slow cells and mainly with only one pipeline stage.  相似文献   

16.
A static random access memory (SRAM)-based novel hardware architecture for longest prefix match (LPM) search scheme has been proposed in this paper. The key concept of this architecture is to store the IP prefixes virtually in the forwarding table. This architecture reduces memory consumption by using a two-tier hierarchical SRAM-based memory structure for maintaining the next hop port information. Originally, next hop addresses are kept in the shared global memory called next hop global memory (NHGM) and its links are maintained in another memory, called next hop link memory (NHLM). This approximately reduces memory consumption by 50–62.5% compared to existing SRAM-based schemes. The proposed architecture consumes single memory write cycle to store an IP prefix and also takes single memory read cycle for LPM search. However, finding next hop information incurs two memory read cycles due to hierarchical next hop memory structure. The proposed scheme performs an LPM lookup operation in 1.05–1.31 ns in IPv4 and between 1.05 and 1.6 ns in IPv6. This results into LPM search throughput of 950 million lookups per second (MLPS) to 760 MLPS in IPv4 and between 620 and 950 MLPS in IPv6. The average search throughput achievable from this architecture is roughly 850 MLPS in IPv4 and 780 MLPS in IPv6. The numerical results show that this architecture significantly reduces memory requirement, power consumption, and transistor-count/bit requirement.  相似文献   

17.
The authors present two systolic architectures to speed up the computation of modular multiplication in RSA cryptosystems. In the double-layer architecture, the main operation of Montgomery's algorithm is partitioned into two parallel operations after using the precomputation of the quotient bit. In the non-interlaced architecture, we eliminate the one-clock-cycle gap between iterations by pairing off the double-layer architecture. We compare our architectures with some previously proposed Montgomery-based systolic architectures, on the basis of both modular multiplication and modular exponentiation. The comparisons indicate that our architectures offer the highest speed, lower hardware complexity, and lower power consumption  相似文献   

18.
The microprocessor industry trend towards many-core architectures introduced the necessity of devising appropriately scalable applications. While implementing software based video decoding, the main challenges are the optimized partitioning of decoder operations, efficient tracking of dependencies and resource synchronization for multiple parallel units. The same applies for hardware implementations of video decoders where monolithic approaches anticipate scalability of the design and reusability of already implemented core components.In this paper, we propose an intermediate data stream format (Meta Format Stream) which is suited for architectural decomposition of video decoding by replacing the conventional monolithic decoder architecture design with a pipelined structure. The Meta Format is forward-oriented and self contained and multistandard capable, so that processing of Meta Streams is independent of the originating bit stream. Our approach does not require special coding settings and is applicable to accelerated decoding of any standards-compliant bit stream. A H.264/AVC multiprocessing proposal is presented as a case study for the potential our our concept. The case study combines coarse grained frame-level parallel decoding of the bit stream with fine-grained macroblock level parallelism in the image processing stage.The proposed H.264 decoder achieved speedup factors of up to 7.6 on an 8 core machine with 2-way SMT. We are reporting actual decoding speeds of up to 150 frames per second in 2160p-resolution.  相似文献   

19.
A novel multisampling time-domain architecture for CMOS imagers with synchronous readout and wide dynamic range is proposed. The proposed multisampling architecture requires only a single bit per pixel memory instead of 8 bits which is typical for time-domain active pixel architectures. The goal is to obtain a time-domain imager with high dynamic range that requires lower number of transistors per pixel in order to achieve higher fill-factor. The maximum frame rate is analyzed as a function of number of bits and array size. The analysis shows that it is possible to achieve high frame rates and operate in video mode having 10 bit pixel data resolution. Also, we present analysis of the impact of comparator offset voltage on the fixed pattern noise. The architecture was implemented in an imager prototype with 32 × 32 pixel array fabricated in AMS CMOS 0.35 μm and was characterized for sensitivity, noise and color response. The pixel size is 30 μm × 26 μm and it is composed of an n+/psub photodiode, a comparator and a D flip-flop with a 16% fill-factor.  相似文献   

20.
Three-dimensional discrete wavelet transform architectures   总被引:2,自引:0,他引:2  
The three-dimensional (3-D) discrete wavelet transform (DWT) suits compression applications well, allowing for better compression on 3-D data as compared with two-dimensional (2-D) methods. This paper describes two architectures for the 3-D DWT, called the 3DW-I and the 3DW-II. The first architecture (3DW-I) is based on folding, whereas the 3DW-II architecture is block-based. Potential applications for these architectures include high definition television (HDTV) and medical data compression, such as magnetic resonance imaging (MRI). The 3DW-I architecture is an implementation of the 3-D DWT similar to folded 1-D and 2-D designs. It allows even distribution of the processing load onto 3 sets of filters, with each set performing the calculations for one dimension. The control for this design is very simple, since the data are operated on in a row-column-slice fashion. Due to pipelining, all filters are utilized 100% of the time, except for the start up and wind-down times. The 3DW-II architecture uses block inputs to reduce the requirement of on-chip memory. It has a central control unit to select which coefficients to pass on to the lowpass and highpass filters. The memory on the chip will be small compared with the input size since it depends solely on the filter sizes. The 3DW-I and 3DW-II architectures are compared according to memory requirements, number of clock cycles, and processing of frames per second. The two architectures described are the first 3-D DWT architectures  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号