首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
The discrete wavelet transform (DWT) provides a new method for signal/image analysis where high frequency components are studied with finer time resolution and low frequency components with coarser time resolution. It decomposes a signal or an image into localized contributions for multiscale analysis. In this paper, we present a parallel pipelined VLSI array architecture for 2D dyadic separable DWT. The 2D data array is partitioned into non-overlapping groups of rows. All rows in a partition are processed in parallel, and consecutive partitions are pipelined. Moreover, multiple wavelet levels are computed in the same pipeline, and multiple DWT problems can be pipelined also. The whole computation requires a single scan of the image data array. Thus, it is suitable for on-line real-time applications. For anN×N image, anm-level DWT can be computed in time units on a processor costing no more than , whereq is the partition size,p is the length of corresponding 1D DWT filters,C m andC a are the costs of a parallel multiplier and a parallel adder respectively, and a time unit is the time for a multiplication and an addition. Forq=N m, the computing time reduces to . When a large number of DWT problems are pipelined, the computing time is about per problem.  相似文献   

2.
Dobson  J.M. Blair  G.M. 《Electronics letters》1995,31(20):1721-1722
The design by Srinivas and Parhi (1992) which used redundant-number adders for fast two's complement addition is re-examined. The underlying mechanism is revealed and improvements are presented which lead to a static-logic binary-tree carry generator to support high speed adder implementations with a delay of [log2(N)]+2 gates  相似文献   

3.
An energy efficient adder design based on a hybrid carry computation is proposed. Addition takes place by considering the carry as propagating forwards from the LSB and backwards from the MSB. The incidence at a midpoint significantly accelerates the addition. This acceleration together with combining low-cost ripple-carry and carry-chain circuits, yields energy efficiency compared to other adder architectures. The optimal midpoint is analytically formulated and its closed-form expression is derived. To avoid the quadratic RC delay growth in a long carry chain, it is optimally repeated. The adder is enhanced in a tree-like structure for further acceleration. 32, 64 and 128-bit adders targeting 500 MHz and 1 GHz clock frequencies were designed in 65 nm technology. They consumed 11–18% less energy compared to adders generated by state-of-the-art EDA synthesis tool.  相似文献   

4.
A VLSI architecture of JPEG2000 encoder   总被引:1,自引:0,他引:1  
This paper proposes a VLSI architecture of JPEG2000 encoder, which functionally consists of two parts: discrete wavelet transform (DWT) and embedded block coding with optimized truncation (EBCOT). For DWT, a spatial combinative lifting algorithm (SCLA)-based scheme with both 5/3 reversible and 9/7 irreversible filters is adopted to reduce 50% and 42% multiplication computations, respectively, compared with the conventional lifting-based implementation (LBI). For EBCOT, a dynamic memory control (DMC) strategy of Tier-1 encoding is adopted to reduce 60% scale of the on-chip wavelet coefficient storage and a subband parallel-processing method is employed to speed up the EBCOT context formation (CF) process; an architecture of Tier-2 encoding is presented to reduce the scale of on-chip bitstream buffering from full-tile size down to three-code-block size and considerably eliminate the iterations of the rate-distortion (RD) truncation.  相似文献   

5.
Agrawal  D.P. 《Electronics letters》1974,10(15):312-313
A carry-look-ahead negabinary adder is proposed in the letter. It is shown that these adders make the design of a fast negabinary multiplier feasible.  相似文献   

6.
The arithmetic Fourier transform (AFT) is a number-theoretic approach to Fourier analysis which has been shown to perform competitively with the classical FFT in terms of accuracy, complexity, and speed. Theorems developed by I.S. Reed et al. (1990) for the AFT algorithm are used here to derive the original AFT algorithm which Bruns found in 1903. This is shown to yield an algorithm of less complexity and of improved performance over certain AFT algorithms. A VLSI architecture is suggested for this simplified AFT algorithm. This architecture uses a butterfly structure which reduces the number of additions by 25% of that used in the direct method  相似文献   

7.
A very-large-scale integration architecture for Reed-Solomon (RS) decoding is presented that is scalable with respect to the throughput rate. This architecture enables given system specifications to be matched efficiently independent of a particular technology. The scalability is achieved by applying a systematic time-sharing technique. Based on this technique, new regular, multiplexed architectures have been derived for solving the key equation and performing finite field divisions. In addition to the flexibility, this approach leads to a small silicon area in comparison with several decoder implementations published in the past. The efficiency of the proposed architecture results from a fine granular pipeline scheme throughout each of the RS decoder components and a small overhead for the control circuitry. Implementation examples show that due to the pipeline strategy, data rates up to 1.28 Gbit/s are reached in a 0.5 μm CMOS technology  相似文献   

8.
A method for screening out poor-quality metallizations from VLSI fabrication lines by wafer-level probing is proposed. Theoretical analysis suggests a linear dependence of the metal line conductance on the square of the current density, at thermal equilibrium. The limit to this linearity for ideally perfect metallizations occurs at the metal melting point, at which there is a sudden decrease in the conductance value to zero. In real interconnects, nonidealities such as localized defects or nonuniform surrounding dielectric at isolated points could lead to a deviation of the conductance from ideal expectations. Using this as a diagnostic, a universal methodology for assessing metal quality, independently of the physical parameters of the metal line, is described. Qualitative correlation with electromigration lifetime results is used to validate the method  相似文献   

9.
We present a simple recursive algorithm for multiplying two binary N-bit numbers in parallel O(log N) time. The simplicity of the design allows for a regular layout. The area requirement of this algorithm is comparable with that of much slower designs classically used in monolithic multipliers and in signal processing chips, hence the construction has definite practical impact.  相似文献   

10.
一种比特平面并行处理的零树编码结构   总被引:1,自引:0,他引:1  
提出了比特平面并行处理的零树编码结构.根据内嵌编码的零树结构,指出每一个比特平面的编码信息可以同时获得,从而给出了并行的零树编码结构.与现有的结构相比,该结构具有并行度高,没有中间缓冲等特点.实验结果表明,处理速度有明显提高,图像质量可满足大多数应用要求.  相似文献   

11.
We propose an architecture that performs the forward and inverse discrete wavelet transform (DWT) using a lifting-based scheme for the set of seven filters proposed in JPEG2000. The architecture consists of two row processors, two column processors, and two memory modules. Each processor contains two adders, one multiplier, and one shifter. The precision of the multipliers and adders has been determined using extensive simulation. Each memory module consists of four banks in order to support the high computational bandwidth. The architecture has been designed to generate an output every cycle for the JPEG2000 default filters. The schedules have been generated by hand and the corresponding timings listed. Finally, the architecture has been implemented in behavioral VHDL. The estimated area of the proposed architecture in 0.18-μ technology is 2.8 nun square, and the estimated frequency of operation is 200 MHz  相似文献   

12.
A modified two-dimensional (2-D) discrete periodized wavelet transform (DPWT) based on the homeomorphic high-pass filter and the 2-D operator correlation algorithm is developed in this paper. The advantages of this modified 2-D DPWT are that it can reduce the multiplication counts and the complexity of boundary data processing in comparison to other conventional 2-D DPWT for perfect reconstruction. In addition, a parallel-pipeline architecture of the nonseparable computation algorithm is also proposed to implement this modified 2-D DPWT. This architecture has properties of noninterleaving input data, short bus width request, and short latency. The analysis of the finite precision performance shows that nearly half of the bit length can be saved by using this nonseparable computation algorithm. The operation of the boundary data processing is also described in detail. In the three-stage decomposition of an N×N image, the latency is found to be N2+2N+18  相似文献   

13.
In this paper, we propose a fast block-matching algorithm based on search center prediction and search early termination, called center-prediction and early-termination based motion search algorithm (CPETS). The CPETS satisfies high performance and efficient VLSI implementation. It makes use of the spatial and temporal correlation in motion vector (MV) fields and feature of all-zero blocks to accelerate the searching process. This paper describes the CPETS with three levels. At the coarsest level, which happens when center prediction fails, the search area is defined to enclose all original search range. At the middle level, the search area is defined as a 7×7-pels square area around the predicted center. At the finest level, a 5×5-pels search area around the predicted center is adopted. At each level, 9-points uniformly allocated search pattern is adopted. The experiment results show that the CPETS is able to achieve a reduction of 95.67% encoding time in average compared with full-search scheme, with a negligible peak signal-noise ratio (PSNR) loss and bitrate increase. Also, the efficiency of CPETS outperforms some popular fast algorithms such as: three-step search, new three-step search, four-step search evidently. This paper also describes an efficient four-way pipelined VLSI architecture based on the CPETS for H.264/AVC coding. The proposed architecture divides current block and search area into four sub-regions, respectively, with 4:1 sub-sampling and processes them in parallel. Also, each sub-region is processed by a pipelined structure to ensure the search for nine candidate points is performed simultaneously. By adopting search early-termination strategy, the architecture can compute one MV for 16×16 block in 81 clock cycles in the best case and 901 clock cycles in the poorest case. The architecture has been designed and simulated with VHDL language. Simulation results show that the proposed architecture achieves a high performance for real-time motion estimation. Only 47.3 K gates and 1624×8 bits on-chip RAM are needed for a search range of (−15, +15) with three reference frames and four candidate block modes by using 36 processing elements.  相似文献   

14.
A novel Parallel-Based Lifting Algorithm (PBLA) for Discrete Wavelet Transform (DWT), exploiting the parallelism of arithmetic operations in all lifting steps, is proposed in this paper. It leads to reduce the critical path latency of computation, and to reduce the complexity of hardware implementation as well. The detailed derivation on the proposed algorithm, as well as the resulting Very Large Scale Integration (VLSI) architecture, is introduced, taking the 9/7 DWT as an example but without loss of generality. In comparison with the Conventional Lifting Algorithm Based Implementation (CLABI), the critical path latency of the proposed architecture is reduced by more than half from (4Tm + 8Ta)to Tm + 4Ta, and is competitive to that of Convolution-Based Implementation (CBI), but the new implementation will save significantly in hardware. The experimental results demonstrate that the proposed architecture has good performance in both increasing working frequency and reducing area.  相似文献   

15.
本文提出一种新的低功率分层运动估值器的VLSI结构,它支持低比特视频编码器的高级预测模式,如H.263和MPEG-4。为减少芯片尺寸及功率消耗,在所有搜索层中使用同一个基本的搜索单元 (BSU)。另外,通过对数据流的有效控制,使其在高级预测模式下,在获得宏块运动矢量的同时,也获得每个宏块中的4个88子块的运动矢量。实验结果表明,这种结构采用较少的门电路,有效降低了功率消耗,并且实现了与全搜索块匹配算法(FSBMA)相似的编码效果,可广泛应用于无线视频通信所需的低功率视频编码器中。  相似文献   

16.
17.
Brown  C.I. Yates  R.B. 《Electronics letters》1996,32(10):891-893
A new VLSI architecture for sparse matrices reduces the I/O and reduces the number of trivial multiplications. A two pipeline 30 MHz processor has been fabricated. This device performs 60 million MACs per second and reduces the time complexity for matrix multiplication by several orders of magnitude for most applications  相似文献   

18.
A custom VLSI architecture for implementing the CCITT G.722 64-kb/s (7-kHz) wideband audio coding standard is presented. By tailoring the architecture to the algorithm, an architecture was designed that is capable of processing a full duplex channel in less than 625 cycles. That is 71-73% less cycles than are required by the reported general-purpose DSP implementations. In a 1.5-μ technology with a 100-ns cycle time, it is estimated that the architecture would consume 95000 mL2 of silicon and support two full duplex channels on a single chip. The authors wrote a behavioral simulation of the architecture and its implicit microcode. This simulates the architecture's behavior at the bit level. The simulation passes the CCITT G.722 test vectors, demonstrating that the implementation conforms to the standard  相似文献   

19.
A bit-serial algorithm for the multiplication of elements in the vector space of finite dimension is presented. Based on the algorithm, a VLSI architecture of quasicyclic (QC) encoders is shown. Compared with that of conventional QC encoders, the proposed architecture is more regular, simpler and programmable. It also offers designers more flexibility for choosing available VLSI techniques. In addition, it can easily be changed to accommodate any QC coding strategies.  相似文献   

20.
As neural network systems are scaled up in size it will become extremely difficult, if not impossible, to maintain full connectivity. A digital architecture which exhibits hierarchical connectivity similar to that observed in many biological neural networks is described. At the lowest level, clusters of fully connected neurons correspond to subnetworks. These subnetworks are then sparsely connected to form the complete neural network system. The architecture exploits the inherent density and large bandwidth of on-chip RAM and can use either a large number of bit-serial processors or a reduced number of bit-parallel processors. A prototype chip which implements a complete subnetwork has been fabricated in 3-μm CMOS and is fully functional  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号