期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Cost Effective VLSI Architectures for Full-Search Block-Matching Motion Estimation Algorithm

Zhong L. He Ming L. Liou 《The Journal of VLSI Signal Processing》1997,17(2-3):225-240

In this paper, we present efficient VLSI architectures for full-search block-matching motion estimation (BMME) algorithm. Given a search range, we partition it into sub-search arrays called tiles. By fully exploiting data dependency within a tile, efficient VLSI architectures can be obtained. Using the proposed VLSI architectures, all the block-matchings in a tile can be processed in parallel. All the tiles within a search range can be processed serially or concurrently depending on various requirements. With the consideration of processing speed, hardware cost, and I/O bandwidth, the optimal tile size for a specific video application is analyzed. By partitioning a search range into tiles with appropriate size, flexible VLSI designs with different throughput can be obtained. In this way, cost effective VLSI designs for a wide range of video applications, from H.261 to HDTV, can be achieved. 相似文献

2.

FPGA architecture of the LDPS Motion Estimation for H.264/AVC Video Coding

Moez Kthiri Hassen Loukil Ahmed Ben Atitallah Patrice Kadionik Dominique Dallet Nouri Masmoudi 《Journal of Signal Processing Systems》2012,68(2):273-285

Motion estimation is a highly computational demanding operation during video compression process and significantly affects the output quality of an encoded sequence. Special hardware architectures are required to achieve real-time compression performance. Many fast search block matching motion estimation (BMME) algorithms have been developed in order to minimize search positions and speed up computation but they do not take into account how they can be effectively implemented by hardware. In this paper, we propose three new hardware architectures of fast search block matching motion estimation algorithm using Line Diamond Parallel Search (LDPS) for H.264/AVC video coding system. These architectures use pipeline and parallel processing techniques and present minimum latency, maximum throughput and full utilization of hardware resources. The VHDL code has been tested and can work at high frequency in a Xilinx Virtex-5 FPGA circuit for the three proposed architectures. 相似文献

3.

Analysis and Design of Low-Cost Bit-Serial Architectures for Motion Estimation in H.264/AVC

Mohammad R. H. Fatemi Hasan Ates Rosli Salleh 《Journal of Signal Processing Systems》2013,71(2):111-121

Variable block-size motion estimation (VBSME) process occupies a major part of computation of an H.264 encoder, which is usually accelerated by bit-parallel hardware architectures with large I/O bit width to meet real-time constrains. However, such kind of architectures increase the area overhead and pin count, and therefore will not be suitable for area-constrained electronic consumer designs such as small portable multimedia devices. This paper addresses this problem by proposing two area efficient least significant bit (LSB) bit-serial architectures with small pin numbers. Both designs take advantage of data reusing technique in different ways for sum of absolute differences (SAD) computation and reading reference pixels, leading to a considerable reduction of memory bandwidth. The first architecture propagates the partial SAD and sum results and broadcasts the reference pixel rows whereas the second design reuse the SAD of small blocks and has a reconfigurable reference buffer leading to a better memory bandwidth when using hardware parallelism. The proposed designs benefit from several optimization techniques including an efficient serial absolute difference architecture, word length reduction by parallelism, bit truncation, mode filtering, and macroblock (MB) level subsampling, which significantly enhance their performances in terms of silicon area, throughput, latency, and power consumption. The first and second designs can support full search VBSME of 720?×?480 video with 30 frames per second (fps), two reference frames, and [?16, 15] search range at a clock frequency of 414 MHz with 29.28 k and 31.5 k gates, respectively. 相似文献

4.

分数像素运动估计的VLSI结构设计

王庆春何晓燕曹喜信《电视技术》2010,34(6)

根据H.264/AVC视频编码中分数像素运动估计(FME)的算法特点,针对视频编码系统的不同具体需求,提出了FME的4种VLSI实现结构,并对这些结构的硬件利用率和运算速度进行了对比分析. 相似文献

5.

一种高效地实现运动估计算法的VLSI结构

舒清明徐葭生《电子学报》1995,23(5):12-16

本文提出了一种全新的低延滞、高吞吐率、可编程的ＶＬＳＩ树型结构，它能十分有效地实现ＦＳＡ和ＴＳＳＡ运动估计算法。该结构比其它树型结构少１／３的处理单元（ＰＥ），而且ＰＥ单元的延时减少一半。独特的ＭＥ窗缓冲结构使Ｉ／Ｏ带宽和Ｉ／Ｏ管脚大大减小，交叉流水线技术使硬件利用率可达到１００％。这些特点使得该结构适合ＶＬＳＩ实现。相似文献

6.

一种适合于低代价、低功耗VLSI实现的块匹配运动估值算法

张武健邱晓海周润德陈弘毅《信号处理》2001,17(1):21-26

本文给出了一种用于块匹配运动估值的改进的多分辨率望远镜搜索(MRTlcS)算法.它以望远镜的逆向搜索取代了传统的望远镜搜索,这一改进有效地降低了VLSI实现时对片上存储器容量和带宽的要求.此外本文还采用运动跟踪和自适应搜索窗技术来减小运动估值的计算复杂性.适合于低代价、低功耗VLSI实现是新算法的显著特点.模拟结果表明新算法要求的平均运算量仅为MRTlcS算法的30%左右,而仍然可以得到相似的视频解码图质量.本文也给出了新算法和MRTlcS算法用于VLSI实现时的硬件代价和功耗比较. 相似文献

7.

A high-performance network architecture for a PA-RISC workstation 总被引：1，自引：0，他引：1

Banks D. Prudence M. 《Selected Areas in Communications, IEEE Journal on》1993,11(2):191-202

With current low-cost high-performance workstations, application-to-application throughput is limited more by host memory bandwidth than by the cost of protocol processing. Conventional network architectures are inefficient in their use of this memory bandwidth, because data is copied several times between the application and the network. As network speeds increase further, network architectures must be developed that reduce the demands on host memory bandwidth. The authors discuss the design of a single-copy network architecture, where data is copied directly between the application buffer and the network interface. Protocol processing is performed by the host, and transport layer buffering is provided on the network interface. They describe a prototype implementation for the HP Apollo Series 700 workstation family that consists of an FDDI network interface and a modified 4.3BSD TCP/IP protocol stack, and report some early results that demonstrate twice the throughput of a conventional network architecture and significantly lower latency 相似文献

8.

HDTV SoC平台中存储器控制及其VLSI优化 总被引：2，自引：0，他引：2

邱琳郑世宝王涛王峰《电视技术》2005,(11):41-44

在分析视频解码标准硬件实现要求的基础上，提出了SoC系统结构和SDRAM接口控制器的设计策略，包括冲突调度和面向提升带宽利用率的优化设计。并配置了一个二级请求缓冲池，配合固定优先级策略，解决了共享设备总线冲突问题；提出bank交叠方法隐藏读写等待时间，以达到提高带宽利用率的目的；另外，还用合并空闲状态的方法实现硬件可重用。相似文献

9.

PMCNOC: A Pipelining Multi-channel Central Caching Network-on-chip Communication Architecture Design

N. Wang A. Sanusi P. Y. Zhao M. Elgamel M. A. Bayoumi 《Journal of Signal Processing Systems》2010,60(3):315-331

With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications. In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs. Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture designs. However, these existing architectures all suffer from certain problems, including high area cost and communication latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity and system performance improvement compared to the existing network-on-chip architectures. 相似文献

10.

Design of a 270 MHz/340 mW processing element for high performance motion estimation systems application

J. Fco. Lpez P. Corts S. Lpez R. Sarmiento 《Microelectronics Journal》2002,33(12):1123-1134

Emerging architectures, technologies and strategies applied to the development of systems for video processing will alleviate the restrictions imposed by limited bandwidth in present communication channels. In this article, the use of GaAs technology together with the application of different techniques applied to existing architectures for motion estimation developed in the literature, are presented. Among the several possible searching methods to compute the block-matching algorithm (BMA), the full-search BMA (FBMA) has obtained great interest from the scientific community due to its regularity, optimal solution and low control overhead which simplifies its VLSI realization. On the other hand, its main drawback is the demand of an enormous amount of computation. There are different ways of overcoming this factor, being the use of advanced technologies, such as gallium arsenide (GaAs), the one adopted in this article together with different techniques to reduce area overhead. The implementation of a 270 MHz processing element (PE) for a FBMA scheme is presented in this paper. From the results obtained for this basic module, an implementation for MPEG applications is proposed, leading to an architecture running at 145 MHz with a power dissipation of 3.48 W and an area of 11.5 mm². 相似文献

11.

Fast block-matching algorithm for video coding

Wen-Jyi Hwang Yeong-Cherng Lu Yi-Chong Zeng 《Electronics letters》1997,33(10):833-835

The authors present a new fast block-matching algorithm for video coding. In the algorithm, the partial distance search technique is performed in the wavelet domain to eliminate the undesired block. The algorithm can reduce computation time without sacrificing the average distortion for encoding. In addition, the algorithm has a lower arithmetic complexity that other existing fast block-matching algorithms 相似文献

12.

Very Low-Complexity Hardware Interleaver for Turbo Decoding

Zhongfeng Wang Qingwei Li 《Circuits and Systems II: Express Briefs, IEEE Transactions on》2007,54(7):636-640

This brief presents a very low complexity hardware interleaver implementation for turbo code in wideband CDMA (W-CDMA) systems. Algorithmic transformations are extensively exploited to reduce the computation complexity and latency. Novel VLSI architectures are developed. The hardware implementation results show that an entire turbo interleave pattern generation unit consumes only 4 k gates, which is an order of magnitude smaller than conventional designs. 相似文献

13.

Modified four-step block-matching algorithm efficient for hardwareimplementation 总被引：1，自引：0，他引：1

Dong-Ho Lee 《Electronics letters》1999,35(19):1622-1623

A new fast block-matching algorithm, the modified four-step search (MFSS) algorithm, is presented. Simulation results and hardware implementation results designed using VHDL show that MFSS has a better performance and is more suitable for hardware realisation than the existing fast algorithms 相似文献

14.

基于Cache的HEVC运动补偿带宽优化设计

郭铮言《电视技术》2014,38(15)

作为计算量最多的模块之一,运动补偿占用了解码器与片外数据存储器之间约70%的带宽,是实现超高清视频解码的瓶颈。通过所设计的基于Cache的HEVC运动补偿模块,在保证实时解码数据吞吐量的同时,有效减少了80%的带宽消耗。首先,利用由可复用滤波器构成的插值计算模块和2D Cache设计了可并行化流水线数据处理的运动补偿模块,满足计算过程中高数据吞吐量需求。其次,设计高效内部存储器RAM结构,并提出片内Cache功耗降低的有效解决方案。最后,利用了参考帧数据相关性,设计插值顺序重排,将Cache的硬件开销减少了87.5%。基于HM9.0的HEVC标准测试视频序列实验结构表明,该设计显著地减少了带宽消耗和硬件开销。相似文献

15.

Vlsi Array Architectures for Pyramid Vector Quantization

Bongjin Jung Wayne P. Burleson 《The Journal of VLSI Signal Processing》1998,18(2):141-154

We present parallel algorithms and array architectures for pyramid vector quantization (PVQ) [1] for use in image coding in low-power wireless systems. PVQ presents an alternative to other quantization methods which is especially suitable for symmetric peer-to-peer communications like video-conferencing. But, both the encoding and decoding algorithms have data-dependent iteration bounds and data-dependent dependencies which prevent efficient parallelization of the algorithms for either hardware or software implementations. We perform an algorithmic transformation [2] to convert the data-dependent regular algorithms to equivalent data-independent algorithms. The resulting regular algorithms exhibit modular and regular structures with minimal control overhead; hence, they are well suited for VLSI array implementation in ASIC or FPGA technologies. Based on our parallel algorithms and systematic design methodologies [3], we develop linear array architectures. Both encoder and decoder architectures consist of L identical processors with local interconnections and provide O(L) speed-up over a sequential implementation, where L is the dimension of a vector. The architectures achieve 100% processor utilization and permit power savings through early completion. A combined encoder-decoder architecture is also presented. 相似文献

16.

Hardware implementation and validation of the fast variable block size motion estimation architecture for H.264/AVC

A. Ben Atitallah S. Arous H. Loukil N. Masmoudi 《AEUE-International Journal of Electronics and Communications》2012,66(8):701-710

Block matching motion estimation is the heart of video coding system. It leads to a high compression ratio, whereas it is time consuming and calculation intensive. Many fast search block matching motion estimation algorithms have been developed in order to minimize search positions and speed up computation but they do not take into account how they can be effectively implemented by hardware. In this paper, we propose an efficient hardware architecture of the fast line diamond parallel search (LDPS) algorithm with variable block size motion estimation (VBSME) for H.264/AVC video coding system. The design is described in VHDL language, synthesized to Altera Stratix III FPGA and to TSMC 0.18 μm standard-cells. The throughput of the hardware architecture reaches a processing rate up to 78 millions of pixels per second at 83.5 MHz frequency clock and uses only 28 kgates when mapped to standard-cells. Finally, a system on a programmable chip (SoPC) implementation and validation of the proposed design as an IP core is presented using the embedded video system. 相似文献

17.

Hardware architectures of adaptive equalizers for the HDTV receiver

Seung Soo Chae Sung Bum Pan Gi Hun Lee Rae-Hong Park Byung-Uk Lee 《Signal Processing, IEEE Transactions on》1998,46(2):391-404

相似文献

18.

Efficient Implementations for AES Encryption and Decryption 总被引：1，自引：0，他引：1

Rashmi Ramesh Rachh P. V. Ananda Mohan B. S. Anami 《Circuits, Systems, and Signal Processing》2012,31(5):1765-1785

This paper proposes two efficient architectures for hardware implementation of the Advanced Encryption Standard (AES) algorithm. The composite field arithmetic for implementing SubBytes (S-box) and InvSubBytes (Inverse S-box) transformations investigated by several authors is used as the basis for deriving the proposed architectures. The first architecture for encryption is based on optimized S-box followed by bit-wise implementation of MixColumns and AddRoundKey and optimized Inverse S-box followed by bit-wise implementation of InvMixColumns and AddMixRoundKey for decryption. The proposed S-box and Inverse S-box used in this architecture are designed as a cascade of three blocks. In the second proposed architecture, the block III of the proposed S-box is combined with the MixColumns and AddRoundKey transformations forming an integrated unit for encryption. An integrated unit for decryption combining the block III of the proposed InvSubBytes with InvMixColumns and AddMixRoundKey is formed on similar lines. The delays of the proposed architectures for VLSI implementation are found to be the shortest compared to the state-of-the-art implementations of AES operating in non-feedback mode. Iterative and fully unrolled sub-pipelined designs including key schedule are implemented using FPGA and ASIC. The proposed designs are efficient in terms of Kgates/Giga-bits per second ratio compared with few recent state-of-the-art ASIC (0.18-μm CMOS standard cell) based designs and throughput per area (TPA) for FPGA implementations. 相似文献

19.

Dynamically adaptive real-time disparity estimation hardware using iterative refinement

Abdulkadir Akin Ipek BazAuthor VitaeAlexandre SchmidAuthor Vitae Yusuf LeblebiciAuthor Vitae 《Integration, the VLSI Journal》2014

The computational complexity of disparity estimation algorithms and the need of large size and bandwidth for the external and internal memory make the real-time processing of disparity estimation challenging, especially for High Resolution (HR) images. This paper proposes a hardware-oriented adaptive window size disparity estimation (AWDE) algorithm and its real-time reconfigurable hardware implementation that targets HR video with high quality disparity results. Moreover, an enhanced version of the AWDE implementation that uses iterative refinement (AWDE-IR) is presented. The AWDE and AWDE-IR algorithms dynamically adapt the window size considering the local texture of the image to increase the disparity estimation quality. The proposed reconfigurable hardware architectures of the AWDE and AWDE-IR algorithms enable handling 60 frames per second on a Virtex-5 FPGA at a 1024×768 XGA video resolution for a 128 pixel disparity range. 相似文献

20.

An Efficient VLSI Architecture of Fractional Motion Estimation in H.264 for HDTV 总被引：1，自引：0，他引：1

G. A. Ruiz J. A. Michell 《Journal of Signal Processing Systems》2011,62(3):443-457

Fractional Motion Estimation (FME) in high-definition H.264 presents a significant design challenge in terms of memory bandwidth, latency and area cost as there are various modes and complex mode decision flow, which require over 45% of the computation complexity in the H.264 encoding process. In this paper, a new high-performance VLSI architecture for Fractional Motion Estimation (FME) in H.264/AVC based on the full-search algorithm is presented. This architecture is made up of three different pipeline processors to establish a trade-off between processing time and hardware utilization. The computing scheme based on a 4-pixel interpolation unit with a 10-pixel input bandwidth is capable of processing a macroblock (MB) in 870 clock cycles. The final VLSI implementation only requires 11.4 k gates and 4.4kBytes of RAM in a standard 180 nm CMOS technology operating at 290 MHz. Our design generates the residual image and the best MVs and mode in a high throughput and low area cost architecture while achieving enough processing capacity for 1080HD (1920 × 1088@30fps) real-time video streams. 相似文献