共查询到20条相似文献,搜索用时 31 毫秒
1.
设计了一种低功耗的2D DCT/IDCT处理器。为了降低功耗,设计基于行列分解的结构,采用了Loeffler的DCT/IDCT快速算法,并使用了零输入旁路、门控时钟、截断处理等技术,在满足设计需求的基础上降低了系统的功耗。常系数乘法器是该处理器的一个重要部件,文中基于并行乘法器结构设计了一种新型的低功耗常系数乘法器,它采用了CSD编码、Wallace Tree乘法算法,结合采用了截断处理、变数校正的优化技术,使得2D DCT/IDCT处理器整体性能有较大提高。设计的时钟频率为100 MHz,可以满足MPEG2 MP@HL实时解码的应用。采用SMIC0.18μm工艺进行综合,该2D DCT/IDCT处理器的面积为341 212μm2,功耗为14.971 mW。通过与其他结构的2DDCT/IDCT处理器设计分析与比较,在满足MPEG2 MP@HL实时解码应用的同时,实现了较低的功耗。 相似文献
2.
3.
New matrix formulation for two-dimensional DCT/IDCT computation and its distributed-memory VLSI implementation 总被引:1,自引:0,他引:1
A direct method for the computation of 2-D DCT/IDCT on a linear-array architecture is presented. The 2-D DCT/IDCT is first converted into its corresponding I-D DCT/IDCT problem through proper input/output index reordering. Then, a new coefficient matrix factorisation is derived, leading to a cascade of several basic computation blocks. Unlike other previously proposed high-speed 2-D N /spl times/ N DCT/IDCT processors that usually require intermediate transpose memory and have computation complexity O(N/sup 3/), the proposed hardware-efficient architecture with distributed memory structure has computation complexity O(N/sup 2/ log/sub 2/ N) and requires only log/sub 2/ N multipliers. The new pipelinable and scalable 2-D DCT/IDCT processor uses storage elements local to the processing elements and thus does not require any address generation hardware or global memory-to-array routing. 相似文献
4.
Johnson D. Akella V. Stott B. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1998,6(4):731-740
We describe the design and implementation of an asynchronous discrete cosine transform/inverse discrete cosine transform (DCT/IDCT) processor core compliant with the CCITT recommendation H.261. First, a micropipelined implementation with level-sensitive latches is shown. This is improved by replacing the level-sensitive latches with dual-edge triggered flip-flops to save power and using completion-detection adders in the critical stage of the pipeline to exploit the data-dependent processing delay. Gate-level simulation of extracted layouts indicates that the performance of asynchronous implementations is comparable with that of a synchronous implementation based on an identical architecture. This is because part of the penalty introduced by handshaking circuitry in an asynchronous pipeline can be recovered by exploiting data-dependent processing delays with completion-detection circuitry. In pipelines with significant arithmetic processing such as the DCT/IDCT processor, this is easily accomplished. Our results are encouraging because asynchronous designs do not employ global clocking. In the near future when clock generation, clock distribution, and the power consumed in the clock circuitry become limiting factors in the design of large synchronous application specific integrated circuits (ASICs), asynchronous implementation methodology could be pursued as a real alternative 相似文献
5.
Hsiao S.-F. Shiue W.-R. Tseng J.-M. 《Vision, Image and Signal Processing, IEE Proceedings -》2000,147(5):400-408
A new linear-array architecture for computation of both the discrete cosine transform (DCT) and the inverse DCT (IDCT) is derived from the heterogeneous dependence graphs representing the factorised coefficient matrices in the matrix formulation of the recursive algorithm. Using the Kronecker product representation of the order-recursive algorithm, it is observed that the kernel operations of the DCT and IDCT can be merged together by proper input/output data reordering. The processor containing only O(log2N) stages is fully pipelineable and easily scaleable to compute longer DCT/IDCTs with transform length N to the power of two. Owing to the systematic matrix formulation and the corresponding efficient architectural design, the new DCT/IDCT processor has the advantages of high-throughput rate and low hardware cost. Furthermore, the power consumption can be reduced significantly by turning off the operation of the arithmetic units whenever possible 相似文献
6.
Sungjoo Suh Seong Soo Chun Myung-Hee Lee Sanghoon Sull 《Electronics letters》2003,39(6):514-515
An efficient image downconversion algorithm in the compressed domain for mixed field/frame-mode macroblocks is proposed. A 16 /spl times/ 16 field/frame-mode macroblock is converted into an 8 /spl times/ 8 reduced block in the discrete cosine transform (DCT) domain using a modified inverse DCT (IDCT) kernel. Experimental results show that the proposed algorithm provides the downconverted image quality similar to an existing method with significantly lower computational cost. 相似文献
7.
一种动态精度匹配的面积优化2-D DCT/IDCT的实现 总被引:1,自引:0,他引:1
提出了一种JPEG标准推荐的2—DDCT/IDCT的改进型Loeffler算法的ASIC实现。该设计采用硬件复用的方法,在正向和反向变换过程中使用同一运算电路,达到了面积优化的目的;并对输入数据进行系数预判,在特定输入情况下,有效提高了处理速度和降低功耗;还根据JPEG体系结构,在DCT变换和量化器之间建立劝态的精度匹配,保证了不同压缩比下的图像质量和功耗效率。该电路应用于140万像素数码相机的JPEG图像处理ASIC芯片中,已成功通过了FPGA验证和流片测试。 相似文献
8.
This paper presents a new algorithm for the fast computation of multidimensional (m-D) discrete cosine transform (DCT) with size N/sub 1//spl times/N/sub 2//spl times//spl middot//spl middot//spl middot//spl times/N/sub m/, where N/sub i/ is a power of 2 and N/sub i//spl les/256, by using the tensor product decomposition of the transform matrix. It is shown that the m-D DCT or inverse discrete cosine transform (IDCT) on these small sizes can be computed using only one-dimensional (1-D) DCTs and additions and shifts. If all the dimensional sizes are the same, the total number of multiplications required for the algorithm is only 1/m times of that required for the conventional row-column method. We also introduce approaches for computing scaled DCTs in which the number of multiplications is considerably reduced. 相似文献
9.
Yoichi Katayama Toshiaki Kitsuki Yasushi Ooi 《The Journal of VLSI Signal Processing》1999,22(1):59-64
This paper describes a block processing unit in a single-chip MPEG-2 MP@ML video encoder LSI. The block processing unit executes algorithms such as a discrete cosine transform (DCT), a quantization, an inverse quantization, and an inverse discrete cosine transform (IDCT). A double-block pipeline scheme has been introduced to execute DCT and IDCT operations on the shared circuits. Using a time-multiplexed DCT/IDCT architecture, we achieve processing performance of 2.0 clk/pel. This architecture has 21% fewer transistors and 30% less power dissipation than a conventional one. The number of transistors of the block processing unit is 240 kTr which measures 7.7% of the total of the chip. By controlling the clock signal supply, power dissipation can be reduced to 43% which is about 400 mW at 3.3 V using a 0.35 m triple-layer metal CMOS cell-base technology at 54 MHz. 相似文献
10.
11.
Bouguezel S. Ahmad M.O. Swamy M.N.S. 《IEEE transactions on circuits and systems. I, Regular papers》2006,53(2):306-315
In this paper, new three-dimensional (3-D) radix-(2/spl times/2/spl times/2)/(4/spl times/4/spl times/4) and radix-(2/spl times/2/spl times/2)/(8/spl times/8/spl times/8) decimation-in-frequency (DIF) fast Fourier transform (FFT) algorithms are developed and their implementation schemes discussed. The algorithms are developed by introducing the radix-2/4 and radix-2/8 approaches in the computation of the 3-D DFT using the Kronecker product and appropriate index mappings. The butterflies of the proposed algorithms are characterized by simple closed-form expressions facilitating easy software or hardware implementations of the algorithms. Comparisons between the proposed algorithms and the existing 3-D radix-(2/spl times/2/spl times/2) FFT algorithm are carried out showing that significant savings in terms of the number of arithmetic operations, data transfers, and twiddle factor evaluations or accesses to the lookup table can be achieved using the radix-(2/spl times/2/spl times/2)/(4/spl times/4/spl times/4) DIF FFT algorithm over the radix-(2/spl times/2/spl times/2) FFT algorithm. It is also established that further savings can be achieved by using the radix-(2/spl times/2/spl times/2)/(8/spl times/8/spl times/8) DIF FFT algorithm. 相似文献
12.
Ahmed Patrice Fahmi Patrice Nouri Herve 《AEUE-International Journal of Electronics and Communications》2007,61(9):605-620
In this paper, we present an efficient HW/SW codesign architecture for H.263 video encoder and its FPGA implementation. Each module of the encoder is investigated to find which approach between HW and SW is better to achieve real-time processing speed as well as flexibility. The hardware portions include the Discrete Cosine Transform (DCT), inverse DCT (IDCT), quantization (Q) and inverse quantization (IQ). Remaining parts were realized in software executed by the NIOS II softcore processor. This paper also introduces efficient design methods for HW and SW modules. In hardware, an efficient architecture for the 2-D DCT/IDCT is suggested to reduce the chip size. A NIOS II Custom instruction logic is used to implement Q/IQ. Software optimization technique is also explored by using the fast block-matching algorithm for motion estimation (ME). The whole design is described in VHDL language, verified in simulations and implemented in Stratix II EP2S60 FPGA. Finally, the encoder has been tested on the Altera NIOS II development board and can work up to 120 MHz. Implementation results show that when HW/SW codesign is used, a 15.8-16.5 times improvement in coding speed is obtained compared to the software based solution. 相似文献
13.
《Solid-State Circuits, IEEE Journal of》1982,17(4):638-647
Multiplication is frequently the speed-limiting function in digital signal processing systems. High-speed hardware multiplier ICs can therefore greatly enhance the throughput and bandwidth of many digital systems. In this paper, the design, fabrication, and performance of GaAs parallel multipliers are discussed. The largest of these circuits, an 8/spl times/8 bit multiplier, has 1008 gates, and is by far the most complex GaAs IC demonstrated today. This multiplier forms the 16 bit product of two 8 bit input numbers in 5.25 ns. This corresponds to an equivalent gate propagation delay of 150 ps/gate. The power dissipation ranges between 0.6-2 mW/gate. 相似文献
14.
A new algorithm to compute the DCT and its inverse 总被引:2,自引:0,他引:2
A novel algorithm to convert the discrete cosine transform (DCT) to skew-circular convolutions is presented. The motivation for developing such an algorithm is the fact that VLSI implementation of distributed arithmetic is very efficient for computing convolutions. It is also shown that the inverse DCT (IDCT) can be computed using the same building blocks which are used for computing the DCT. A DCT/IDCT processor can be designed to compute either the DCT or the IDCT depending on a 1-b control signal 相似文献
15.
提出了一种新的二维DCT和IDCT的FPGA实现结构,采用行列快速算法将二维算法分解为两个一维算法实现,其中每个一维算法采用并行的流水线结构,每一个时钟处理8个数据,大大提高电路的数据吞吐率和运算速度。通过Modelsim仿真工具对该设计进行仿真,证明该算法的功能的正确性,进行一次8*8的分块二维DCT变换仅仅需要16个时钟,满足图像以及视频实时性的要求。 相似文献
16.
《Solid-State Circuits, IEEE Journal of》1986,21(6):976-982
An 8-bit high-speed A/D converter has been developed in a 1.5-/spl mu/m bulk CMOS double-polysilicon process technology. The design, process technology, and performance of the converter are described. In order to achieve high speed and low power, a fine-pattern process technology and a novel capacitor structure have been introduced and the transistor sizes of a chopper-type comparator have been optimized. High speed (30 MS/s) and low power consumption (60 mW) have been obtained. Computerized evaluations such as the histogram test and the fast Fourier transform test have been used to measure dynamic performance. The linearity error in dynamic operation is less than /spl plusmn/1 LSB. Signal-to-peak-noise ratio is 40 dB at a sampling rate of 14.32 MS/s and an input frequency of 1.42 MHz. 相似文献
17.
18.
在MPEG视音频标准中,使用DCT(离散余旋变换)/IDCT(反离散余旋变换)来压缩数据。在数字视音频MP3解码电路中,IDCT是整个解码过程中运算量最大最耗时的一部分,因此IDCT的速度对整个MP3解码进程的速度起着极为关键的作用。在众多的解码过程实现方案中,用芯片实现是速度最快的一种方案。本方案是用RTL级的Verilog语言进行描述,用Synplify Pro综合成门级电路,然后用ModelSim仿真通过后,下载到X ilinx公司的V irtex的FPGA中。结果表明:电路工作正确可靠,速度上能满足MP3的实时播放要求。 相似文献
19.
Fast 2-dimensional 4 /spl times/ 4 forward integer transform implementation for H.264/AVC 总被引:1,自引:0,他引:1
Chih-Peng Fan 《Circuits and Systems II: Express Briefs, IEEE Transactions on》2006,53(3):174-177
In this paper, the novel two-dimensional (2-D) fast algorithm for realization of 4 /spl times/ 4 forward integer transform in H.264 is proposed. Based on matrix operations with Kronecker product and direct sum, the efficient fast 2-D 4 /spl times/ 4 forward integer transform can be derived from the proposed one-dimensional fast 4 /spl times/ 4 forward integer transform through matrix decompositions. The proposed fast 2-D 4 /spl times/ 4 forward integer transform design doesn't need transpose memory for direct parallel pipelined architecture. The fast 2-D 4 /spl times/ 4 forward integer transform requires fewer latency delays than the state-of-the-art methods. With regular modularity, the proposed fast algorithm is suitable for VLSI implementation to achieve real-time H.264/advanced video coding (AVC) signal processing. 相似文献
20.
A configurable architecture for performing image transform algorithms is presented that provides a better tradeoff between low complexity and algorithm flexibility than either software-programmable processors or dedicated ASIC's. The configurable processor unit requires only 110 K transistors and can execute several image transform algorithms. By emulating the signal flow of the algorithms in hardware, rather than software, complexity is reduced by an order of magnitude compared with current software programmable video signal processors, while providing more flexibility than single function ASIC's. The processor has been fabricated in 1.2-μm CMOS and has been successfully used to execute the discrete cosine transform/inverse discrete cosine transform (DCT/IDCT), subband coding, vector quantization, and two-dimensional filtering algorithms at pixel rates up to 25 MPixels/s 相似文献