期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Concurrency techniques for arithmetic coding in JPEG2000

Dyer M. Taubman D. Nooshabadi S. Gupta A.K. 《IEEE transactions on circuits and systems. I, Regular papers》2006,53(6):1203-1213

JPEG2000 is a recently standardized image compression algorithm. The heart of this algorithm is the coding scheme known as embedded block coding with optimal truncation (EBCOT). This contributes the majority of processing time to the compression algorithm. The EBCOT scheme consists of a bit-plane coder coupled to a MQ arithmetic coder. Recent bit-plane coder architectures are capable of producing symbols at a higher rate than the existing MQ arithmetic coders can absorb. Thus, there is a requirement for a high throughput MQ arithmetic coder. We examine the existing MQ arithmetic coder architectures and develop novel techniques capable of absorbing the high symbol rate from high performance bit-plane coders, as well as providing flexible design choices. 相似文献

2.

并行通道编码EBCOT中MQ编码器的硬件设计

李萱郭炜《信息技术》2007,31(5):51-53,57

提出了一种适用于JPEG2000标准中并行通道编码的Embedded Block Coding with Optimized Truncation （EBCOT）高速MQ编码器的硬件架构。首先对JPEG2000标准流程的标码流程选择和字节输出等流程进行改进,使之更适应于硬件实现,并提出一种区间重整时对前导零位数的更简洁的判断方法和电路实现,充分利用硬件并行性,提高了编码速度。进而提出了四级流水的MQ编码器硬件架构,有效提高了MQ编码速率,充分满足并行通道编码的要求。相似文献

3.

A high throughput pass parallel block decoder architecture for JPEG 2000 that prevents stalling in the decoding process

《Integration, the VLSI Journal》2020

The Block Decoder (BD) which is an indispensable component of the JPEG 2000 image compression standard has the highest computational complexity and determines the speed of the overall decoder system. This paper proposes a high throughput pass parallel BD architecture, which can decode more than one bit per clock cycle. In BD, the dependency between context generation and arithmetic decoding unit incorporates stalling and reduces the throughput of the decoding process. The proposed selective byte input and synchronous sample skipping techniques are used to prevent stalling in the decoding process. The proposed architecture achieves 86% more throughput with 50% increment in the hardware cost than that of the best available serial BD architecture. In comparison with the best available pass parallel architecture, throughput improves almost 8.2 times with 61% increment in the hardware cost. Incorporation of the speed up techniques in the design is the main reason for more hardware consumption. The Figure of Merit of the proposed design, which is the ratio of throughput and hardware cost, is more than that of the available BD architectures for typical code block (CB) size of 32 × 32. The ASIC implementation of the proposed design consumes 66 mW power at maximum operating frequency. 相似文献

4.

JPEG2000算术编码器的改进算法与实现

谢慧王娇许磊《电子科技》2010,23(8):15-17

作为新一代静止图像压缩标准的JPEG2000标准拥有压缩比高,支持多分辨率等优点。JPEG2000的编码方式采用了嵌入式码块编码(EBCOT)方式,在编码过程中采用了MQ算术编码。文中分析了它对内容单一、信息量少图像编解码的不足,针对这些不足提出了一种对MQ算术编码器流程的改进算法。这种算法提高了JPEG2000对简单图像压缩的PSNR值,使解码后的图像更加清晰。相似文献

5.

一种基于流水线的MQ编码器FPGA设计

下载免费PDF全文

陆燕王超李杰曹鹏《电子器件》2007,30(4):1314-1317

提出了一种应用于JPEG2000标准的4级流水线MQ编码器设计方案.采用状态超前更新,前导0位超前检测和字节输出缓冲策略,解决了在上下文(CX)状态表更新、归一化及字节输出过程中的反馈和循环等问题,提高了编码效率.同时,对关键路径处算法进行优化,提高了系统工作的时钟频率.该设计使用VHDL语言在RTL级描述,并在FPGA上对其进行了仿真验证.实验表明,在Altera的StratixⅡ EP2S601020C4上,编码器的工作效率可以达到1CxD/cycle,最高工作时钟频率可达99.66 MHz. 相似文献

6.

A high performance MQ encoder architecture in JPEG2000

Kai Liu Author Vitae Yu Zhou Author Vitae Author Vitae Jian Feng Ma Author Vitae 《Integration, the VLSI Journal》2010,43(3):305-317

In this paper, a novel architecture for an MQ arithmetic coder with high throughput is proposed. The architecture can process two symbols in parallel. The main characteristics are eight process elements for the prediction of probability interval A, the combination of calculation units for the code register C with the Byteout&Flush procedure, and the use of a dedicated probability estimation table to decrease the internal memory. From FPGA synthesis results, the architecture’s throughput can reach 96.60 M context symbols per second with an internal memory size of 1509 bits, which is comparable to that of other architectures and suitable for chip implementation. 相似文献

7.

Efficient architectures for two-dimensional discrete wavelet transform using lifting scheme.

Chengyi Xiong Jinwen Tian Jian Liu 《IEEE transactions on image processing》2007,16(3):607-614

Novel architectures for 1-D and 2-D discrete wavelet transform (DWT) by using lifting schemes are presented in this paper. An embedded decimation technique is exploited to optimize the architecture for 1-D DWT, which is designed to receive an input and generate an output with the low- and high-frequency components of original data being available alternately. Based on this 1-D DWT architecture, an efficient line-based architecture for 2-D DWT is further proposed by employing parallel and pipeline techniques, which is mainly composed of two horizontal filter modules and one vertical filter module, working in parallel and pipeline fashion with 100% hardware utilization. This 2-D architecture is called fast architecture (FA) that can perform J levels of decomposition for N * N image in approximately 2N2(1 - 4(-J))/3 internal clock cycles. Moreover, another efficient generic line-based 2-D architecture is proposed by exploiting the parallelism among four subband transforms in lifting-based 2-D DWT, which can perform J levels of decomposition for N * N image in approximately N2(1 - 4(-J))/3 internal clock cycles; hence, it is called high-speed architecture. The throughput rate of the latter is increased by two times when comparing with the former 2-D architecture, but only less additional hardware cost is added. Compared with the works reported in previous literature, the proposed architectures for 2-D DWT are efficient alternatives in tradeoff among hardware cost, throughput rate, output latency and control complexity, etc. 相似文献

8.

一种用于并行H.264编码器的语法元素级分组并行算术编码器体系结构的评估

下载免费PDF全文

陈胜刚陈书明谷会涛刘尧《电子学报》2012,40(2):400-405

设计了一种语法元素指令流驱动的全流水CABAC(Context-based Adaptive Binary Arithmetic Coding)熵编码VLSI结构,并对提出的语法元素级分组并行算术编码器的体系结构进行了设计和开销评估.该并行方法可以与现有符号级并行算法正交,可同时使用,适合大规模片上并行视频编码器;相比标准CABAC,增加约55%的晶体管即可实现2倍以上的符号处理加速比和>1Gbin/s的吞吐率. 相似文献

9.

JPEG2000中DWT-EBCOT联合的高效低存储VLSI结构

郭杰李云松吴成柯刘凯王柯俨《电子与信息学报》2009,31(3):731-735

针对JPEG2000硬件实现中小波变换与编码之间占用大量存储的问题,该文提出一种基于码块的存储方案。通过对码块大小片内存储最大程度的复用以及对其高效简单的调度控制,从面积和功耗两方面减小了硬件实现的开销。在实现中,采用基于行的提升变换结构和比特平面并行的编码方式,提高了效率,确保整个过程的实时处理。实验结果表明:在实时编码要求下,对分辨率为512512的图像分片进行四级9/7或者5/3小波分解,码块大小为3232,采用本文结构所用的存储量与直接使用外部存储器的方法相比可减少80%以上。整个结构已通过FPGA验证,且系统时钟可以工作在100MHz。相似文献

10.

Two-Symbol FPGA Architecture for Fast Arithmetic Encoding in JPEG 2000

Nandini Ramesh Kumar Wei Xiang Yafeng Wang 《Journal of Signal Processing Systems》2012,69(2):213-224

JPEG 2000 is one of the most popular image compression standards offering significant performance advantages over previous image standards. High computational complexity of the JPEG 2000 algorithms makes it necessary to employ methods that overcomes the bottlenecks of the system and hence an efficient solution is imperative. One such crucial algorithms in JPEG 2000 is arithmetic coding and is completely based on bit level operations. In this paper, an efficient hardware implementation of arithmetic coding is proposed which uses efficient pipelining and parallel processing for intermediate blocks. The idea is to provide a two-symbol coding engine, which is efficient in terms of performance, memory and hardware. This architecture is implemented in Verilog hardware definition language and synthesized using Altera field programmable gate array. The only memory unit used in this design is a FIFO (first in first out) of 256 bits to store the CX-D pairs at the input, which is negligible compared to the existing arithmetic coding hardware designs. The simulation and synthesis results show that the operating frequency of the proposed architecture is greater than 100 MHz and it achieves a throughput of 212 Msymbols/sec, which is double the throughput of conventional one-symbol implementation and enables at least 50% throughput increase compared to the existing two-symbol architectures. 相似文献

11.

A Scalable Architecture for MPEG-4 Wavelet Quantization 总被引：3，自引：0，他引：3

Bart Vanhoof Mercedes Peón Gauthier Lafruit Jan Bormans Lode Nachtergaele Ivo Bolsens 《The Journal of VLSI Signal Processing》1999,23(1):93-107

Wavelet-based image compression has been adopted in MPEG-4 for visual texture coding. All wavelet quantization schemes in MPEG-4—Single Quantization (SQ), Multiple Quantization (MQ) and Bi-level Quantization—use Embedded Zero Tree (EZT) coding followed by an adaptive arithmetic coder for the compression and quantization of a wavelet image. This paper presents the OZONE chip, a dedicated hardware coprocessor for EZT and arithmetic coding. Realized in a 0.5 m CMOS technology and operating at 32 MHz, the EZT coder is capable of processing up to 25.6 Mega pixel-bitplanes per second. This is equivalent to the lossless compression of 31.6 8-bit grayscale CIF images (352 × 288) per second. The adaptive arithmetic coder processes up to 10 Mbit per second. The combination of the performance of the EZT coder and the arithmetic coder allows the OZONE to perform visual-lossless compression of more than 30 CIF images per second. Due to its novel and scalable architecture, parallel operation of multiple OZONEs is supported. The OZONE functionality is demonstrated on a PC-based compression system. 相似文献

12.

Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Packet Transform

Wang C. Gan W. S. 《Circuits and Systems II: Express Briefs, IEEE Transactions on》2007,54(5):422-426

This brief presents a novel very large-scale integration (VLSI) architecture for discrete wavelet packet transform (DWPT). By exploiting the in-place nature of the DWPT algorithm, this architecture has an efficient pipeline structure to implement high-throughput processing without any on-chip memory/first-in first out access. A folded architecture for lifting-based wavelet filters is proposed to compute the wavelet butterflies in different groups simultaneously at each decomposition level. According to the comparison results, the proposed VLSI architecture is more efficient than the previous proposed architectures in terms of memory access, hardware regularity and simplicity, and throughput. The folded architecture not only achieves a significant reduction in hardware cost but also maintains both the hardware utilization and high-throughput processing with comparison to the direct mapped tree-structured architecture 相似文献

13.

Three-Parallel Reed-Solomon Decoder Using S-DCME for High-Speed Communications

Jae Do Lee Myung Hoon Sunwoo 《Journal of Signal Processing Systems》2012,66(1):15-24

This paper proposes a high-speed and area-efficient three-parallel Reed-Solomon (RS) decoder using the simplified degree computationless modified Euclid (S-DCME) algorithm for the key equation solver (KES) block. To achieve a high throughput rate, the inner signals, such as the syndrome, error locator and error value polynomials, are computed in parallel. In addition, the key equations are solved by using the S-DCME algorithm to reduce the hardware complexity. To handle the many problems caused by applying the S-DCME algorithm to the KES block, we modify the architectures of some of the blocks in the three-parallel RS decoder. The proposed RS architecture can reduce the hardware complexity by about 80% with respect to the KES block. In addition, the proposed RS architecture has an approximately 25% shorter latency than the conventional parallel RS architectures. 相似文献

14.

Automatic Synthesis of Motion Estimation Processors Based on a New Class of Hardware Architectures

Nuno Roma Leonel Sousa 《The Journal of VLSI Signal Processing》2003,34(3):277-290

A new class of fully parameterizable multiple array architectures for motion estimation in video sequences based on the Full-Search Block-Matching algorithm is proposed in this paper. This class is based on a new and efficient AB2 single array architecture with minimum latency, maximum throughput and full utilization of the hardware resources. It provides the ability to configure the target processor within the boundary values imposed for the configuration parameters concerning the algorithm setup, the processing time and the circuit area. With this purpose, a software configuration tool has been implemented to determine the set of possible configurations which fulfill the requisites of a given video coder. Experimental results using both FPGA and ASIC technologies are presented. In particular, the implementation of a single array processor configuration on a single-chip is illustrated, evidencing the ability to estimate motion vectors in real-time. 相似文献

15.

Cost Effective VLSI Architectures for Full-Search Block-Matching Motion Estimation Algorithm

Zhong L. He Ming L. Liou 《The Journal of VLSI Signal Processing》1997,17(2-3):225-240

In this paper, we present efficient VLSI architectures for full-search block-matching motion estimation (BMME) algorithm. Given a search range, we partition it into sub-search arrays called tiles. By fully exploiting data dependency within a tile, efficient VLSI architectures can be obtained. Using the proposed VLSI architectures, all the block-matchings in a tile can be processed in parallel. All the tiles within a search range can be processed serially or concurrently depending on various requirements. With the consideration of processing speed, hardware cost, and I/O bandwidth, the optimal tile size for a specific video application is analyzed. By partitioning a search range into tiles with appropriate size, flexible VLSI designs with different throughput can be obtained. In this way, cost effective VLSI designs for a wide range of video applications, from H.261 to HDTV, can be achieved. 相似文献

16.

Efficient Error Control Decoder Architectures for Noncoherent Random Linear Network Coding

Jun Lin Hongmei Xie Zhiyuan Yan 《Journal of Signal Processing Systems》2014,76(2):195-209

Random linear network coding is an efficient technique for disseminating information in networks, but it is highly susceptible to errors. Kötter-Kschischang (KK) codes and Mahdavifar-Vardy (MV) codes are two important families of subspace codes that provide error control in noncoherent random linear network coding. List decoding has been used to decode MV codes beyond half distance. Existing hardware implementations of the rank metric decoder for KK codes suffer from limited throughput, long latency and high area complexity. The interpolation-based list decoding algorithm for MV codes still has high computational complexity, and its feasibility for hardware implementations has not been investigated. In this paper we propose efficient decoder architectures for both KK and MV codes and present their hardware implementations. Two serial architectures are proposed for KK and MV codes, respectively. An unfolded decoder architecture, which offers high throughput, is also proposed for KK codes. The synthesis results show that the proposed architectures for KK codes are much more efficient than rank metric decoder architectures, and demonstrate that the proposed decoder architecture for MV codes is affordable. 相似文献

17.

JPEG2000中小波滤波器的定点分析及其VLSI实现

朱珂华林周晓方章倩苓《固体电子学研究与进展》2004,24(4):466-471,504

对JPEG2 0 0 0中推荐的 5 /3整数滤波器和 9/7实数滤波器进行了硬件实现时所需要的有限精度分析 ;确定了小波变换过程中各个参数的最佳数据宽度 ,还确定了整个变换系统的数据通路的数据宽度。基于lifting的小波变换的特点结合嵌入式延拓算法提出了两种小波变换———折叠结构和长流水线结构 ;对两种结构进行了分析比较。最后 ,对折叠结构和相关的其它结构在所需存储单元的数量、存储单元的访问次数、处理能力以及功耗等方面进行了分析比较 ,可以看出文中提出的结构在性能上有明显优点。相似文献

18.

An area-efficient and low-power 64-point pipeline Fast Fourier Transform for OFDM applications

《Integration, the VLSI Journal》2017

In an orthogonal frequency division multiplexing (OFDM) based wireless systems, Fast Fourier Transform (FFT) is a critical block as it occupies large area and consumes more power. In this paper, we present an area-efficient and low power 16-bit word-width 64-point radix-2² and radix-2³ pipelined FFT architectures for an OFDM-based IEEE 802.11a wireless LAN baseband. The designs are derived from radix-2^k algorithm and adopt a Single-Path Delay Feedback (SDF) architecture for hardware implementation. To eliminate the complex multipliers and read-only memory (ROM) which is used for internal storage of twiddle factor coefficients, the proposed 64-point FFT employs a Canonical Signed Digit (CSD) complex constant multiplier using adders, multiplexers and shifters. The complex constant multiplier (CCM) is modified using common sub-expression sharing block that reduces the area of the design. The proposed radix-2² and radix-2³ pipelined FFT architectures are modeled and implemented using TSMC 180 nm CMOS technology with a supply voltage of 1.8 V. The implementation results show that the proposed architectures significantly reduces the hardware cost and power consumption in comparison to existing 64-point FFT architectures. 相似文献

19.

An Efficient VLSI Architecture of Fractional Motion Estimation in H.264 for HDTV 总被引：1，自引：0，他引：1

G. A. Ruiz J. A. Michell 《Journal of Signal Processing Systems》2011,62(3):443-457

Fractional Motion Estimation (FME) in high-definition H.264 presents a significant design challenge in terms of memory bandwidth, latency and area cost as there are various modes and complex mode decision flow, which require over 45% of the computation complexity in the H.264 encoding process. In this paper, a new high-performance VLSI architecture for Fractional Motion Estimation (FME) in H.264/AVC based on the full-search algorithm is presented. This architecture is made up of three different pipeline processors to establish a trade-off between processing time and hardware utilization. The computing scheme based on a 4-pixel interpolation unit with a 10-pixel input bandwidth is capable of processing a macroblock (MB) in 870 clock cycles. The final VLSI implementation only requires 11.4 k gates and 4.4kBytes of RAM in a standard 180 nm CMOS technology operating at 290 MHz. Our design generates the residual image and the best MVs and mode in a high throughput and low area cost architecture while achieving enough processing capacity for 1080HD (1920 × 1088@30fps) real-time video streams. 相似文献

20.

Parallel, Pipelined and Folded Architectures for Computation of 1-D and 2-D DCT in Image and Video Codec

Shen-Fu Hsiao Jian-Ming Tseng 《The Journal of VLSI Signal Processing》2001,28(3):205-220

Several parallel, pipelined and folded architectures with different throughput rates are presented for computation of DCT, one of the fundamental operations in image/video coding. This paper begins with a new decomposition algorithm for the 1-D DCT coefficient matrix. Then the 2-D DCT problem is converted into the corresponding 1-D counterpart through a regular index mapping technique. Afterward, depending on the trade-off between hardware complexity and speed performance, the derived decomposition algorithm is transformed into different parallel-pipelined and folded architectures that realize the butterfly operations and the post-processing operations. Compared to other DCT processor, our proposed parallel-pipelined architectures, without any intermediate transpose memory, have the features of modularity, regularity, locality, scalability, and pipelinability, with arithmetic hardware cost proportional to the logarithm of the transform length. 相似文献