首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
In this brief an efficient folded architecture (EFA) for lifting-based discrete wavelet transform (DWT) is presented. The proposed EFA is based on a novel form of the lifting scheme that is given in this brief. Due to this form, the conventional serial operations of the lifting data flow can be optimized into parallel ones by employing parallel and pipeline techniques. The corresponding optimized architecture (OA) has short critical path latency and is repeatable. Further, utilizing this repeatability, the EFA is derived from the OA by employing the fold technique. For the proposed EFA, hardware utilization achieves 100%, and the number of required registers is reduced. Additionally, the shift-add operation is adopted to optimize the multiplication; thus, the proposed architecture is more suitable for hardware implementation. Performance comparisons and field-programmable gate array (FPGA) implementation results indicate that the proposed EFA possesses better performances in critical path latency, hardware cost, and control complexity.  相似文献   

2.
一种新型基于提升算法的二维离散小波变换结构的实现   总被引:1,自引:0,他引:1  
孟军  魏同立 《电路与系统学报》2003,8(6):139-142,128
在提升算法原理分析的基础上,设计出一种采用提升算法的二维离散小波变换结构,改变了传统的提升算法先行后列的运算方式,将行列运算操作结合起来进行,这样,相比于传统结构,在基本不增加硬件单元的前提下,变换时间减小为原来的75%左右,提高了硬件效率。  相似文献   

3.
In this paper we formalize a novel multirate folding transformation which is a tool used to systematically synthesize control circuits for pipelined VLSI architectures which implement multirate algorithms. Although multirate algorithms contain decimators and expanders which change the effective sample rate of a discrete-time signal, multirate folding time-multiplexes the multirate algorithm to hardware in such a manner that the resulting synchronous architecture requires only a single-clock signal. Multirate folding equations are derived and these equations are used to address two related issues. The first issue is memory requirements in folded architectures. We derive expressions for the minimum number of registers required by a folded architecture which implements a multirate algorithm. The second issue is retiming. Based on the noble identities of multirate signal processing, we derive retiming for folding constraints which indicate how a multirate data-flow graph must be retimed for a given schedule to be feasible. The techniques introduced in this paper can be used to synthesize architectures for a wide variety of digital signal processing applications which are based on multirate algorithms, such as signal analysis and coding based on subband decompositions and wavelet transforms  相似文献   

4.
A new cell architecture for high performance digit-serial computation is presented. The design of this cell is based on the feed forward of the carry digit, which allows a high level of pipelining to increase the throughput rate with minimum latency. This will give designers greater flexibility in finding the best trade-off between hardware cost and throughput rate. A twin-pipe architecture to double the throughput rate of digit-serial/parallel multipliers is also presented. The effects of the number of pipelining levels and the twin architecture on the throughput rate and hardware cost are presented. A two's complement digit-serial/parallel multiplier which can operate on both negative and positive numbers is also presented.  相似文献   

5.
A distributed arithmetic (DA)-based decision feedback equalizer architecture for IEEE 802.11b PHY scenarios is presented. As the transmission data rate increases, the hardware complexity of the decision feedback equalizer increases due to requirement for large number of taps in feed forward and feedback filters. DA, an efficient technique that uses memories for the computation of inner product of two vectors, has been used since DA-based realization of filters can lead to great computational savings. For higher-order filters, the memory-size requirement in DA would be high, and so ROM decomposition has been employed. The speed is further increased by employing digit-serial input operation. Two architectures have been presented, namely the direct-memory architecture and reduced-memory architecture where the later is derived using the former. A third architecture has also been presented where the offset-binary coding scheme is employed along with the ROM decomposition and digit-serial variants of DA. Synthesis results on Altera Cyclone III EP3C55F484C6 FPGA show that the proposed DA-based implementations are free of hardware multipliers and use less number of hardware resources compared to the multiply-and-accumulate-based implementation.  相似文献   

6.
This brief presents a novel very large-scale integration (VLSI) architecture for discrete wavelet packet transform (DWPT). By exploiting the in-place nature of the DWPT algorithm, this architecture has an efficient pipeline structure to implement high-throughput processing without any on-chip memory/first-in first out access. A folded architecture for lifting-based wavelet filters is proposed to compute the wavelet butterflies in different groups simultaneously at each decomposition level. According to the comparison results, the proposed VLSI architecture is more efficient than the previous proposed architectures in terms of memory access, hardware regularity and simplicity, and throughput. The folded architecture not only achieves a significant reduction in hardware cost but also maintains both the hardware utilization and high-throughput processing with comparison to the direct mapped tree-structured architecture  相似文献   

7.
A folded very large scale integration (VLSI) architecture is presented for the implementation of the two-dimensional discrete wavelet transform, without constraints on the choice of the wavelet-filter bank. The proposed architecture is dedicated to flexible block-oriented image processing, such as adaptive vector quantization used in wavelet image coding. We show that reading the image along a two-dimensional (2-D) pseudo-fractal scan creates a very modular and regular data flow and, therefore, considerably reduces the folding complexity and memory requirements for VLSI implementation. This leads to significant area savings for on-chip storage (up to a factor of two) and reduces the power consumption. Furthermore, data scheduling and memory management remain very simple. The end result is an efficient VLSI implementation with a reduced area cost compared to the conventional approaches, reading the input data line by line  相似文献   

8.
A systolic-like modular architecture is presented for hardware-efficient implementation of two-dimensional (2-D) discrete wavelet transform (DWT). The overall computation is decomposed into two distinct stages; where column processing is performed in stage-1, while row processing is performed in stage-2. Using a new data-access scheme and a novel folding technique, the computation of both the stages are performed concurrently for transposition-free implementation of 2-D DWT. The proposed design can offer nearly the same throughput rate, and requires the same or less the number of adders and multipliers as the best of the existing structures. The storage space is found to occupy most of the area in the existing 2-D DWT structures but the proposed structure does not require any on-chip or off-chip storage of input samples or storage/transposition of intermediate output. The proposed one, therefore, involves considerably less hardware complexity compared with the existing structures. Apart from that, it has less duration of cycle period in comparison to the existing structures, and has a latency of cycles while all the existing structures have latency of cycles, the filter order being small compared to the input size .  相似文献   

9.
A new high-performance systolic architecture for calculating the discrete Fourier transform (DFT) is described which is based on two levels of transform factorization. One level uses an index remapping that converts the direct transform into structured sets of arithmetically simple four-point transforms. Another level adds a row/column decomposition of the DFT. The architecture supports transform lengths that are not powers of two or based on products of coprime numbers. Compared to previous systolic implementations, the architecture is computationally more efficient and uses less hardware. It provides low latency as well as high throughput, and can do both one- and two-dimensional DFTs. An automated computer-aided design tool was used to find latency and throughput optimal designs that matched the target field programmable gate array structure and functionality.  相似文献   

10.
基于提升算法的离散小波变换FPGA实现   总被引:1,自引:0,他引:1       下载免费PDF全文
吴志林  王超  李杰  卜爱国   《电子器件》2007,30(1):290-293
离散小波变换是当今许多图像处理和压缩技术的基础,并得到了广泛的应用.本文以4阶Daubechies小波为例阐述基于提升算法的离散小波变换的原理,并给出其硬件实现架构,然后进行仿真,将仿真结果与Matlab软件实现结果进行比较,结果表明硬件实现与软件实现基本一致,该硬件架构与基于传统的卷积方法实现相比,可以减小硬件实现面积,并利用插入流水线寄存器的方法,缩短关键路径,提高运算速度.  相似文献   

11.
12.
提出了一种基于提升算法的二维离散5/3小波变换(DWT)高效并行VLSI结构设计方法。该方法使得行和列滤波器同时进行滤波,采用流水线设计方法处理,在保证同样的精度下,大大减少了运算量,提高了变换速度,节约了硬件资源。该方法已通过了VerilogHDL行为级仿真验证,可作为单独的IP核应用在JPEG2000图像编、解码芯片中。该结构可推广到9/7小波提升结构。  相似文献   

13.
对JPEG2 0 0 0中推荐的 5 /3整数滤波器和 9/7实数滤波器进行了硬件实现时所需要的有限精度分析 ;确定了小波变换过程中各个参数的最佳数据宽度 ,还确定了整个变换系统的数据通路的数据宽度。基于lifting的小波变换的特点结合嵌入式延拓算法提出了两种小波变换———折叠结构和长流水线结构 ;对两种结构进行了分析比较。最后 ,对折叠结构和相关的其它结构在所需存储单元的数量、存储单元的访问次数、处理能力以及功耗等方面进行了分析比较 ,可以看出文中提出的结构在性能上有明显优点。  相似文献   

14.
The discrete wavelet transform (DWT) is an upcoming compression technique that has been selected for MPEG-4 and JEPG 2000, because it has no blocking effects and it efficiently determines the frequency property of the temporary signals. In this paper, we propose a low-complexity, low-power bit-serial DWT architecture, employing a two-channel lattice-based quadrature mirror filter (QMF). The filter consists of four lattices (filter length = 8), and we determine the quantization bit for the coefficients using a fixed-length peak signal-to-noise ratio analysis and propose the architecture of the bit-serial multiplier with a fixed coefficient. The canonical signed digit encoding for the coefficients is applied to minimize the number of nonzero bits, thus reducing the hardware complexity. The proposed folded one-dimensional DWT architecture processes the other resolution levels during idle periods by decimations, and it provides efficient scheduling. The proposed architecture requires only flip-flops and full adders. This architecture has been designed and verified by the Verilog HDL and synthesized using the Synopsys Design Compiler with the DongbuAnam 0.18 μm Standard Cell Library. The maximum throughput is 393 Mbps at 450 MHz with a latency of 16 clocks, and the gate count is about 5K in equivalent two-input NAND gates. The dynamic power is 7.02 mW at 1.8 V. The data scheduling using a data dependency graph, and the performance, power, and required hardware cost are discussed.  相似文献   

15.
This paper investigates efficient hardware architectures for implementation of 1-D and 2-D discrete wavelet transforms (DWTs). The architectures are based on the lifting scheme. We propose a general structure to minimize the number of multipliers and adders for 1-D DWTs. Compared to previous conventional architectures, the architecture presented here is more efficient in terms of the required arithmetic units. Moreover, we describe a new frame scan method for a block-based 2-D DWT structure which provides a flexible trade-off between the required internal memory size and external memory access. In contrast, other 2-D DWT structures require a fixed memory size.  相似文献   

16.
In this paper, the design and implementation of an optimized hardware architecture in terms of speed and memory requirements for computing the tile-based 2D forward discrete wavelet transform for the JPEG2000 image compression standard, are described. The proposed architecture is based on a well-known architecture template for calculating the 2D forward discrete wavelet transform. This architecture is derived by replacing the filtering units by our previously published throughput-optimized ones and by developing a scheduling algorithm suited to the special features of our filtering units. The architecture exhibits high-performance characteristics due to the throughput-optimized filters. Also, the extra clock cycles required due to the tile-based version of the discrete wavelet transform are partially compensated by the proper scheduling of the filters. The developed scheduling algorithm results in reduced memory requirements compared with existing architectures.  相似文献   

17.
This paper presents a VLSI implementation of a novel hybrid architecture that computes three 8-point 1-D transforms—the discrete cosine transform, the discrete Fourier transform, and the Haar wavelet transform—on a single chip. The architecture is developed on matrix factorization and row permutation algorithms, where the basis forward transformation matrices are decomposed into common submatrices which are then shared among the transforms. A two-level hardware mapping has been employed which is parallel, pipelined, and multiplexed. The hybrid architecture, the first of its kind, is implemented using 0.18 μm CMOS technology. The estimated die size of the hybrid processor is 0.203 sq. mm, the frequency of operation is 100 MHz, the gate count is 20,400, and the power consumption is 15.38 mW. Compared to the existing designs, the proposed hybrid scheme has less power density and higher frequency of operation, making it very suitable for modern multimedia-based transcoding applications.  相似文献   

18.
We propose an efficient hardware-oriented method for evaluating complex polynomials. The method is based on solving iteratively a system of linear equations. The solutions are obtained digit-by-digit on simple and highly regular hardware. The operations performed are defined over the reals. We describe a complex-to-real transform, a complex polynomial evaluation algorithm, the convergence conditions, and a corresponding design and implementation. The latency and the area are estimated for the radix-2 case. The main features of the method are: the latency of about m cycles for an m-bit precision; the cycle time independent of the precision; a design consisting of identical modules; and digit-serial connections between the modules. The number of modules, each roughly corresponding to serial-parallel multiplier without a carry-propagate adder, is 2(n?+?1) for evaluating an n-th degree complex polynomial. The method can also be used to compute all successive integer powers of the complex argument with the same latency and a similar implementation cost. The design allows straightforward tradeoffs between latency and cost: a factor k decrease in cost leads to a factor k increase in latency. A similar tradeoff between precision, latency and cost exists. The proposed method is attractive for programmable platforms because of its regular and repetitive structure of simple hardware operators.  相似文献   

19.
A cost-effective VLSI architecture with separate data-paths and their corresponding filter structure is proposed for performing a two-dimensional discrete wavelet transform (2D DWT). Compared with the conventional 2D DWT VLSI architectures, the proposed semi-recursive 2D DWT VLSI architecture has minimum hardware cost, and optimised data-bus utilisation, scheduling control overhead and storage size  相似文献   

20.
In this paper we propose a generalized technique to count the required number of registers in a schedule which supports overlapped scheduling and can be applied to the case where a general digit-serial data format is used. This technique is integrated into an integer linear programming (ILP) model for time-constrained scheduling. In the ILP model, appropriate processors of certain data formats are chosen from a library of processors and data format converters are automatically inserted between processors of different data formats if necessary. Then the required number of registers for each data format is evaluated correctly by the proposed technique. Hence an optimal architecture for a given digital signal processing algorithm is synthesized where the cost of registers as well as the cost of processors and data format converters are minimized. It is shown that by including the cost of registers in the synthesis task as proposed in this paper leads to up to 12.8% savings in the total cost of the synthesized architecture when compared with synthesis performed without including the register cost in the total cost.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号