首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we present efficient VLSI architectures for full-search block-matching motion estimation (BMME) algorithm. Given a search range, we partition it into sub-search arrays called tiles. By fully exploiting data dependency within a tile, efficient VLSI architectures can be obtained. Using the proposed VLSI architectures, all the block-matchings in a tile can be processed in parallel. All the tiles within a search range can be processed serially or concurrently depending on various requirements. With the consideration of processing speed, hardware cost, and I/O bandwidth, the optimal tile size for a specific video application is analyzed. By partitioning a search range into tiles with appropriate size, flexible VLSI designs with different throughput can be obtained. In this way, cost effective VLSI designs for a wide range of video applications, from H.261 to HDTV, can be achieved.  相似文献   

2.
Motion estimation is a highly computational demanding operation during video compression process and significantly affects the output quality of an encoded sequence. Special hardware architectures are required to achieve real-time compression performance. Many fast search block matching motion estimation (BMME) algorithms have been developed in order to minimize search positions and speed up computation but they do not take into account how they can be effectively implemented by hardware. In this paper, we propose three new hardware architectures of fast search block matching motion estimation algorithm using Line Diamond Parallel Search (LDPS) for H.264/AVC video coding system. These architectures use pipeline and parallel processing techniques and present minimum latency, maximum throughput and full utilization of hardware resources. The VHDL code has been tested and can work at high frequency in a Xilinx Virtex-5 FPGA circuit for the three proposed architectures.  相似文献   

3.
Variable block-size motion estimation (VBSME) process occupies a major part of computation of an H.264 encoder, which is usually accelerated by bit-parallel hardware architectures with large I/O bit width to meet real-time constrains. However, such kind of architectures increase the area overhead and pin count, and therefore will not be suitable for area-constrained electronic consumer designs such as small portable multimedia devices. This paper addresses this problem by proposing two area efficient least significant bit (LSB) bit-serial architectures with small pin numbers. Both designs take advantage of data reusing technique in different ways for sum of absolute differences (SAD) computation and reading reference pixels, leading to a considerable reduction of memory bandwidth. The first architecture propagates the partial SAD and sum results and broadcasts the reference pixel rows whereas the second design reuse the SAD of small blocks and has a reconfigurable reference buffer leading to a better memory bandwidth when using hardware parallelism. The proposed designs benefit from several optimization techniques including an efficient serial absolute difference architecture, word length reduction by parallelism, bit truncation, mode filtering, and macroblock (MB) level subsampling, which significantly enhance their performances in terms of silicon area, throughput, latency, and power consumption. The first and second designs can support full search VBSME of 720?×?480 video with 30 frames per second (fps), two reference frames, and [?16, 15] search range at a clock frequency of 414 MHz with 29.28 k and 31.5 k gates, respectively.  相似文献   

4.
根据H.264/AVC视频编码中分数像素运动估计(FME)的算法特点,针对视频编码系统的不同具体需求,提出了FME的4种VLSI实现结构,并对这些结构的硬件利用率和运算速度进行了对比分析.  相似文献   

5.
一种高效地实现运动估计算法的VLSI结构   总被引:2,自引:2,他引:0  
本文提出了一种全新的低延滞、高吞吐率、可编程的VLSI树型结构,它能十分有效地实现FSA和TSSA运动估计算法。该结构比其它树型结构少1/3的处理单元(PE),而且PE单元的延时减少一半。独特的ME窗缓冲结构使I/O带宽和I/O管脚大大减小,交叉流水线技术使硬件利用率可达到100%。这些特点使得该结构适合VLSI实现。  相似文献   

6.
本文给出了一种用于块匹配运动估值的改进的多分辨率望远镜搜索(MRTlcS)算法.它以望远镜的逆向搜索取代了传统的望远镜搜索,这一改进有效地降低了VLSI实现时对片上存储器容量和带宽的要求.此外本文还采用运动跟踪和自适应搜索窗技术来减小运动估值的计算复杂性.适合于低代价、低功耗VLSI实现是新算法的显著特点.模拟结果表明新算法要求的平均运算量仅为MRTlcS算法的30%左右,而仍然可以得到相似的视频解码图质量.本文也给出了新算法和MRTlcS算法用于VLSI实现时的硬件代价和功耗比较.  相似文献   

7.
A high-performance network architecture for a PA-RISC workstation   总被引:1,自引:0,他引:1  
With current low-cost high-performance workstations, application-to-application throughput is limited more by host memory bandwidth than by the cost of protocol processing. Conventional network architectures are inefficient in their use of this memory bandwidth, because data is copied several times between the application and the network. As network speeds increase further, network architectures must be developed that reduce the demands on host memory bandwidth. The authors discuss the design of a single-copy network architecture, where data is copied directly between the application buffer and the network interface. Protocol processing is performed by the host, and transport layer buffering is provided on the network interface. They describe a prototype implementation for the HP Apollo Series 700 workstation family that consists of an FDDI network interface and a modified 4.3BSD TCP/IP protocol stack, and report some early results that demonstrate twice the throughput of a conventional network architecture and significantly lower latency  相似文献   

8.
HDTV SoC平台中存储器控制及其VLSI优化   总被引:2,自引:0,他引:2  
邱琳  郑世宝  王涛  王峰 《电视技术》2005,(11):41-44
在分析视频解码标准硬件实现要求的基础上,提出了SoC系统结构和SDRAM接口控制器的设计策略,包括冲突调度和面向提升带宽利用率的优化设计。并配置了一个二级请求缓冲池,配合固定优先级策略,解决了共享设备总线冲突问题;提出bank交叠方法隐藏读写等待时间,以达到提高带宽利用率的目的;另外,还用合并空闲状态的方法实现硬件可重用。  相似文献   

9.
With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications. In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs. Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture designs. However, these existing architectures all suffer from certain problems, including high area cost and communication latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity and system performance improvement compared to the existing network-on-chip architectures.  相似文献   

10.
Emerging architectures, technologies and strategies applied to the development of systems for video processing will alleviate the restrictions imposed by limited bandwidth in present communication channels. In this article, the use of GaAs technology together with the application of different techniques applied to existing architectures for motion estimation developed in the literature, are presented. Among the several possible searching methods to compute the block-matching algorithm (BMA), the full-search BMA (FBMA) has obtained great interest from the scientific community due to its regularity, optimal solution and low control overhead which simplifies its VLSI realization. On the other hand, its main drawback is the demand of an enormous amount of computation. There are different ways of overcoming this factor, being the use of advanced technologies, such as gallium arsenide (GaAs), the one adopted in this article together with different techniques to reduce area overhead. The implementation of a 270 MHz processing element (PE) for a FBMA scheme is presented in this paper. From the results obtained for this basic module, an implementation for MPEG applications is proposed, leading to an architecture running at 145 MHz with a power dissipation of 3.48 W and an area of 11.5 mm2.  相似文献   

11.
The authors present a new fast block-matching algorithm for video coding. In the algorithm, the partial distance search technique is performed in the wavelet domain to eliminate the undesired block. The algorithm can reduce computation time without sacrificing the average distortion for encoding. In addition, the algorithm has a lower arithmetic complexity that other existing fast block-matching algorithms  相似文献   

12.
This brief presents a very low complexity hardware interleaver implementation for turbo code in wideband CDMA (W-CDMA) systems. Algorithmic transformations are extensively exploited to reduce the computation complexity and latency. Novel VLSI architectures are developed. The hardware implementation results show that an entire turbo interleave pattern generation unit consumes only 4 k gates, which is an order of magnitude smaller than conventional designs.  相似文献   

13.
Dong-Ho Lee 《Electronics letters》1999,35(19):1622-1623
A new fast block-matching algorithm, the modified four-step search (MFSS) algorithm, is presented. Simulation results and hardware implementation results designed using VHDL show that MFSS has a better performance and is more suitable for hardware realisation than the existing fast algorithms  相似文献   

14.
作为计算量最多的模块之一,运动补偿占用了解码器与片外数据存储器之间约70%的带宽,是实现超高清视频解码的瓶颈。通过所设计的基于Cache的HEVC运动补偿模块,在保证实时解码数据吞吐量的同时,有效减少了80%的带宽消耗。首先,利用由可复用滤波器构成的插值计算模块和2D Cache设计了可并行化流水线数据处理的运动补偿模块,满足计算过程中高数据吞吐量需求。其次,设计高效内部存储器RAM结构,并提出片内Cache功耗降低的有效解决方案。最后,利用了参考帧数据相关性,设计插值顺序重排,将Cache的硬件开销减少了87.5%。基于HM9.0的HEVC标准测试视频序列实验结构表明,该设计显著地减少了带宽消耗和硬件开销。  相似文献   

15.
We present parallel algorithms and array architectures for pyramid vector quantization (PVQ) [1] for use in image coding in low-power wireless systems. PVQ presents an alternative to other quantization methods which is especially suitable for symmetric peer-to-peer communications like video-conferencing. But, both the encoding and decoding algorithms have data-dependent iteration bounds and data-dependent dependencies which prevent efficient parallelization of the algorithms for either hardware or software implementations. We perform an algorithmic transformation [2] to convert the data-dependent regular algorithms to equivalent data-independent algorithms. The resulting regular algorithms exhibit modular and regular structures with minimal control overhead; hence, they are well suited for VLSI array implementation in ASIC or FPGA technologies. Based on our parallel algorithms and systematic design methodologies [3], we develop linear array architectures. Both encoder and decoder architectures consist of L identical processors with local interconnections and provide O(L) speed-up over a sequential implementation, where L is the dimension of a vector. The architectures achieve 100% processor utilization and permit power savings through early completion. A combined encoder-decoder architecture is also presented.  相似文献   

16.
Block matching motion estimation is the heart of video coding system. It leads to a high compression ratio, whereas it is time consuming and calculation intensive. Many fast search block matching motion estimation algorithms have been developed in order to minimize search positions and speed up computation but they do not take into account how they can be effectively implemented by hardware. In this paper, we propose an efficient hardware architecture of the fast line diamond parallel search (LDPS) algorithm with variable block size motion estimation (VBSME) for H.264/AVC video coding system. The design is described in VHDL language, synthesized to Altera Stratix III FPGA and to TSMC 0.18 μm standard-cells. The throughput of the hardware architecture reaches a processing rate up to 78 millions of pixels per second at 83.5 MHz frequency clock and uses only 28 kgates when mapped to standard-cells. Finally, a system on a programmable chip (SoPC) implementation and validation of the proposed design as an IP core is presented using the embedded video system.  相似文献   

17.
18.
Efficient Implementations for AES Encryption and Decryption   总被引:1,自引:0,他引:1  
This paper proposes two efficient architectures for hardware implementation of the Advanced Encryption Standard (AES) algorithm. The composite field arithmetic for implementing SubBytes (S-box) and InvSubBytes (Inverse S-box) transformations investigated by several authors is used as the basis for deriving the proposed architectures. The first architecture for encryption is based on optimized S-box followed by bit-wise implementation of MixColumns and AddRoundKey and optimized Inverse S-box followed by bit-wise implementation of InvMixColumns and AddMixRoundKey for decryption. The proposed S-box and Inverse S-box used in this architecture are designed as a cascade of three blocks. In the second proposed architecture, the block III of the proposed S-box is combined with the MixColumns and AddRoundKey transformations forming an integrated unit for encryption. An integrated unit for decryption combining the block III of the proposed InvSubBytes with InvMixColumns and AddMixRoundKey is formed on similar lines. The delays of the proposed architectures for VLSI implementation are found to be the shortest compared to the state-of-the-art implementations of AES operating in non-feedback mode. Iterative and fully unrolled sub-pipelined designs including key schedule are implemented using FPGA and ASIC. The proposed designs are efficient in terms of Kgates/Giga-bits per second ratio compared with few recent state-of-the-art ASIC (0.18-μm CMOS standard cell) based designs and throughput per area (TPA) for FPGA implementations.  相似文献   

19.
Fractional Motion Estimation (FME) in high-definition H.264 presents a significant design challenge in terms of memory bandwidth, latency and area cost as there are various modes and complex mode decision flow, which require over 45% of the computation complexity in the H.264 encoding process. In this paper, a new high-performance VLSI architecture for Fractional Motion Estimation (FME) in H.264/AVC based on the full-search algorithm is presented. This architecture is made up of three different pipeline processors to establish a trade-off between processing time and hardware utilization. The computing scheme based on a 4-pixel interpolation unit with a 10-pixel input bandwidth is capable of processing a macroblock (MB) in 870 clock cycles. The final VLSI implementation only requires 11.4 k gates and 4.4kBytes of RAM in a standard 180 nm CMOS technology operating at 290 MHz. Our design generates the residual image and the best MVs and mode in a high throughput and low area cost architecture while achieving enough processing capacity for 1080HD (1920 × 1088@30fps) real-time video streams.  相似文献   

20.
This paper presents an integrated systolic array design for implementing full-search block matching, 2-D discrete wavelet transform, and full-search vector quantization on the same VLSI architecture. These functions are the prime components in video compression and take a great amount of computation. To meet the real-time application requirements, many systolic array architectures are proposed for individually performing one of those functions. However, these functions contain similar computational procedure. The matrix-vector product forms of the three functions are quite analogous. After extracting the common computation component, we design an integrated one-dimensional systolic array that can perform aforementioned three functions. The proposed architecture can efficiently perform three typical functions: (1) the full-search block matching with block of size 16 × 16 and the search are from –8 to 7; (2) the 2-D 2 level Harr transform with block of size 8 × 8; and (3) the full-search vector quantization with input vector of size 2 × 2. A utilization rate of 100% to 97% is achieved in the course of executing full-search block matching and full-search vector quantization. When it comes to perform 2-D discrete wavelet transform, the utilization rate is about 32%. The proposed integrated architecture has lowered hardware cost and reduced hardware structure. It befits the VLSI implementation for video/image compression applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号