首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper, we have analyzed the register complexity of direct-form and transpose-form structures of FIR filter and explored the possibility of register reuse. We find that direct-form structure involves significantly less registers than the transpose-form structure, and it allows register reuse in parallel implementation. We analyze further the LUT consumption and other resources of DA-based parallel FIR filter structures, and find that the input delay unit, coefficient storage unit and partial product generation unit are also shared besides LUT words when multiple filter outputs are computed in parallel. Based on these finding, we propose a design approach, and used that to derive a DA-based architecture for reconfigurable block-based FIR filter, which is scalable for larger block-sizes and higher filter-lengths. Interestingly, the number of registers of the proposed structure does not increase proportionately with the block-size. This is a major advantage for area-delay and energy efficient high-throughput implementation of reconfigurable FIR filters of higher block-sizes. Theoretical comparison shows that the proposed structure for block-size 8 and filter-length 64 involves 60% more flip-flops, 6.2 times more adders, 3.5 times more AND-OR gates, and offers 8 times higher throughput. ASIC synthesis result shows that the proposed structure for block-size 8 and filter-length 64 involves 1.8 times less area-delay product (ADP) and energy per sample (EPS) than the existing design, and it can support 8 times higher throughput. The proposed structure for block sizes 4 and 8, respectively, consumes 38% and 50% less power than the exiting structure for the same throughput rates on average for different supply voltages.  相似文献   

2.
A MIMD based multiprocessor architecture for real-time video processing applications consisting of identical bus connected processing elements has been developed. Each processing element contains a RISC processor for controlling and data-dependent tasks and a Low Level Coprocessor for fast processing of convolution-type video processing tasks. To achieve efficient parallel processing of video input signals, the architecture supports independent processing of overlapping image segments. Running at a clock rate of 40 MHz, a single processing element provides a peak performance of 640 Mega arithmetic operations per second (MOPS). For the real-time processing of basic video processing tasks like 3×3 FIR-filter, 8×8 2D-DCT and motion estimation, a single processing element provides a sufficient computational rate for video signals with Common Intermediate Format (CIF) at a frame rate up to 30 Hz. For hybrid source coding of CIF video signals at a frame rate of 30 Hz a multiprocessor system consisting of six processing elements is required. A linear speedup of the multiprocessor system compared to a single processing element is achieved. A VLSI implementation of a processing element in 0.8 µm CMOS technology is under development.  相似文献   

3.
Image feature separation is a crucial step for image segmentation in computer vision systems. One efficient and powerful approach is the unsupervised clustering of the resulting data set; however, it is a very computationally intensive task. This paper presents a high-performance architecture for unsupervised data clustering. This architecture is suitable for VLSI implementations. It exploits paradigms of massive connectivity like those inspired by neural networks, and parallelism and functionality integration that can be afforded by emerging nanometer semiconductor technologies. By utilizing a "global-quasi-systolic, local-hyper-connected" architectural approach, the hardware can process real-time DVD-quality video at the highest rate allowed by the MPEG-2 standard. The architecture is a realization of the histogram peak-climbing clustering algorithm, and it is the first special-purpose architecture that has been proposed for this important problem. The architecture has also been prototyped using a Xilinx field programmable gate array (FPGA) development environment. Although this paper discusses a computer vision application, the architecture presented can be utilized in the acceleration of the clustering process of any type of high-dimensionality data.  相似文献   

4.
The paper presents a new architecture composed of bit plane-parallel coder for Embedded Block Coding with Optimized Truncation (EBCOT) entropy encoder used in JPEG2000. In the architecture, the coding information of each bit plane can be obtained simultaneously and processed parallel. Compared with other architectures, it has advantages of high parallelism, and no waste clock cycles for a single point. The experimental results show that it reduces the processing time about 86% than that of bit plane sequential scheme. A Field Programmable Gate Array (FPGA) prototype chip is designed and simulation results show that it can process 512×512 gray-scaled images with more than 30 frames per second at 52MHz.  相似文献   

5.
A family of multiprocessor architectures implementing the Viterbi algorithm is presented. The family of architectures is shown to be capable of achieving an increase in throughput that is directly proportional to the number of processors when the number of processors is smaller than the constraint length v of the code. The hardware utilization and the depth of the pipelining available inside each processor are also shown. An architecture with v-1 processors is found to be particularly advantageous, since it results in the maximum speedup and simplest interconnection structure  相似文献   

6.
Brown  C.I. Yates  R.B. 《Electronics letters》1996,32(10):891-893
A new VLSI architecture for sparse matrices reduces the I/O and reduces the number of trivial multiplications. A two pipeline 30 MHz processor has been fabricated. This device performs 60 million MACs per second and reduces the time complexity for matrix multiplication by several orders of magnitude for most applications  相似文献   

7.
The Viterbi (1967) algorithm (VA) is known to be an efficient method for the realization of maximum-likelihood (ML) decoding of convolutional codes. The VA is characterized by a graph, called a trellis, which defines the transitions between states. To define an area efficient architecture for the VA is equivalent to obtaining an efficient mapping of the trellis. We present a methodology that permits the efficient hardware mapping of the VA onto a processor network of arbitrary size. This formal model is employed for the partitioning of the computations among an arbitrary number of processors in such a way that the data are recirculated, optimizing the use of the PEs and the communications. Therefore, the algorithm is mapped onto a column of processing elements and an optimal design solution is obtained for a particular set of area and/or speed constraints. Furthermore, the management of the surviving path memory for its mapping and distribution among the processors was studied. As a result, we obtain a regular and modular design appropriate for its VLSI implementation in which the only necessary communications between processors are the data recirculations between stages  相似文献   

8.
High-speed VLSI architecture for parallel Reed-Solomon decoder   总被引:3,自引:0,他引:3  
This paper presents high-speed parallel Reed-Solomon (RS) (255,239) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems. Pipelining and parallelizing allow inputs to be received at very high fiber-optic rates and outputs to be delivered at correspondingly high rates with minimum delay. A parallel processing architecture results in speed-ups of as much as or more than 10 Gb, since the maximum achievable clock frequency is generally bounded by the critical path of the modified Euclidean algorithm block. The parallel RS decoders have been designed and implemented with the 0.13-/spl mu/m CMOS standard cell technology in a supply voltage of 1.1 V. It is suggested that a parallel RS decoder, which can keep up with optical transmission rates, i.e., 10 Gb/s and beyond, could be implemented. The proposed channel = 4 parallel RS decoder operates at a clock frequency of 770 MHz and has a data processing rate of 26.6 Gb/s.  相似文献   

9.
A single chip, 128 coefficient, asynchronous echo canceller is presented. Cancellation is performed by an FIR filter whose coefficients are adapted using the power-of-two modified LMS algorithm. The pipelined circuit updates all coefficients and generates the filtered output every cycle while allowing a sampling rate >206.5 kHz  相似文献   

10.
王刚毅  金炎胜  任广辉  刘通 《红外与激光工程》2018,47(9):926001-0926001(9)
交通标志检测是驾驶辅助系统的重要功能,但对实时性极高的要求使其非常具有挑战性。提出了一种高性能禁令标志检测模块的VLSI结构,并在FPGA平台上完成了实现和验证。该结构的基本原理是同时利用颜色与形状特征,在图像的红色边缘位图中采用圆霍夫变换检测圆形。通过挖掘圆霍夫变换的局部特性,所提出的结构在内存占用方面显著低于常规结构。所有半径同时投票的设计使FPGA的逻辑单元和内存的并行性得以充分发挥。该结构在Altera公司的EP3C55F484C6型FPGA上进行了验证,其最大可运行频率达到122 MHz,且资源占用在可接受范围内。实验结果表明:该结构的吞吐量达到115 M像素/s,且对低光照条件、局部遮挡、多标志相连和相似背景颜色等不利条件具有良好的适应能力。  相似文献   

11.
We propose a new VLSI architecture for an FFT processor. Our architecture uses few processing elements and can be laid out in a mesh-interconnected pattern. We show how to compute the discrete Fourier transform at n points with an optimal speed-up as long as the memory is large enough. The control is shown to be simple and easily implementable in VLSI.  相似文献   

12.
An architecture for performing fixed-point, high-speed, two's-complement, bit-parallel addition by using the carry-free property of redundant arithmetic and a fast parallel redundant-to-binary conversion scheme is presented. The internal numbers are represented in radix-2 redundant digit form, and the inputs and the output of the adder are represented in two's-complement binary form. The adder operands are added first in a radix-2 redundant adder to produce the result in radix-2 digit (-1, 0, 1) form. This result is converted to two's-complement binary form using the parallel conversion scheme. The high-speed conversion for long words is achieved through the use of a novel sign-select operation. The proposed adder, referred to as the sign-select conversion adder, is faster than all previous high-speed two's-complement binary adders for large word lengths. The implementation is highly regular with repeated modules and is very well suited for VLSI implementation  相似文献   

13.
An n-p-n bipolar transistor structure with the emitter region self-aligned to the polysilicon base contact is described. The self-alignment results in an emitter-to-base contact separation less than 0.4 µm and a collector-to-emitter area ratio about 3:1 for a two-sided base contact. This ratio can be less than 2:1 for a base contacted only on one side. The vertical doping profile can be optimized independently for high-performance and/or high-density and low-power-delay circuit applications. The technology, using recessed oxide isolation, was evaluated using 13-stage nonthreshold logic (NTL) and 11-stage merged-transition logic (MTL) ring-oscillator circuits designed with 2.5 µm design rules. For transistors with 200-nm emitter junction depth the common-emitter current gain for polysilicon emitter contact is typically 2-4 times that for Pd2Si emitter contact. There is no observable circuit performance degradation attributable to the polysilicon emitter contact. Typical observed per-stage delays were 190 ps at 1.3 mW and 120 ps at 2.3 mW for the NTL (FI = FO = 1) circuits and 1.3 ns at 0.15 mA for the MTL (FO = 4) circuits.  相似文献   

14.
王雪静  叶凡  任俊彦 《通信学报》2006,27(5):115-119
对以太网物理层中的判决反馈均衡器进行了VLSI优化设计,从硬件实现角度提出了两点改进措施,一是采用混合结构实现,二是系数调整单元复用。在0.18μm CMOS工艺下,改进后的判决反馈均衡器相比转置结构实现的判决反馈均衡器,速度提高16%,面积减少36%,功耗降低了39%。  相似文献   

15.
The arithmetic Fourier transform (AFT) is a number-theoretic approach to Fourier analysis which has been shown to perform competitively with the classical FFT in terms of accuracy, complexity, and speed. Theorems developed by I.S. Reed et al. (1990) for the AFT algorithm are used here to derive the original AFT algorithm which Bruns found in 1903. This is shown to yield an algorithm of less complexity and of improved performance over certain AFT algorithms. A VLSI architecture is suggested for this simplified AFT algorithm. This architecture uses a butterfly structure which reduces the number of additions by 25% of that used in the direct method  相似文献   

16.
Due to the evolution of increasingly high performantdsp algorithms for bit rate reduction of speech signals in telecommunications,vlsi implementations of these applications are becoming more and more complex. The solutions currently being used for these applications are general purpose digital signal processors or dsp cores which are never fully adapted to the application in terms ofvlsi architecture, i.e. silicon area and power consumption. We propose an alternative to these solutions, based on a parametrable Harvard architecture, and a C compiler which gives an optimized microcode suited to this architecture and to the application. Finally we present two examples of audio applications implemented using this solution.  相似文献   

17.
《Microelectronics Journal》2002,33(5-6):417-427
In this paper, the design of a very large scale integration (VLSI) architecture for low-power H.263/MPEG-4 video codec is addressed. Starting from a high-level system modelling, a profiling analysis indicates a hardware–software (HW–SW) partitioning assuming power consumption, flexibility and circuit complexity as main cost functions. The architecture is based on a reduced instruction set computer engine, enhanced by dedicated hardware processing, with a memory hierarchy organisation and direct memory access-based data transfers. To reduce the system power consumption two main strategies have been adopted. The first consists in the design of a low-power high-efficiency motion estimator specifically targeted to low bit-rate applications. Exploiting the correlation of video motion field it attains the same high coding efficiency of the full-search approach for a computational burden lower than about two orders of magnitude. Combining the decreased algorithm complexity with low-power VLSI design techniques the motion estimator power consumption is scaled down to few mW. The second consists in the implementation of a proper buffer hierarchy to reduce memory and bus power consumption in the HW–SW communication. The effectiveness of the proposed architecture has been validated through performance measurements on a prototyping platform.  相似文献   

18.
The direct provision of connectionless service in BISDN calls for servers that are connected to or are part of an ATM network to provide the routing function at input speeds up to 622 Mb/s. Routing is achieved in such a server by changing the VCI/VPI headers in the ATM cells; actual switching is done by existing switches in the ATM network. The paper presents an architecture capable of executing all the functions of a server at input speeds up to 622 Mb/s, scalable to multiple inputs at that speed, making use of processors and special hardware that are available today. To avoid storing large quantities of data, the architecture routes data packets by examining routing information in the initial cell of the packet and routing subsequent cells as they arrive rather than waiting until the complete packet has arrived. It is capable of handling packets that have been multiplexed at the SAR sublayer using AAL Type 3/4 and, with minor modifications, could also handle Type 5 traffic. Arguments are also presented for the use of AAL Type 5 for the direct connectionless service  相似文献   

19.
The application of fine grain pipelining techniques in the design of high performance Wave Digital Filters (WDFs) is described. It is shown that significant increases in the sampling rate of bit parallel circuits can be achieved using most significant bit (msb) first arithmetic. A novel VLSI architecture for implementing two-port adaptor circuits is described which embodies these ideas. The circuit in question is highly regular, uses msb first arithmetic and is implemented using simple carry-save adders.  相似文献   

20.
A very-large-scale integration architecture for Reed-Solomon (RS) decoding is presented that is scalable with respect to the throughput rate. This architecture enables given system specifications to be matched efficiently independent of a particular technology. The scalability is achieved by applying a systematic time-sharing technique. Based on this technique, new regular, multiplexed architectures have been derived for solving the key equation and performing finite field divisions. In addition to the flexibility, this approach leads to a small silicon area in comparison with several decoder implementations published in the past. The efficiency of the proposed architecture results from a fine granular pipeline scheme throughout each of the RS decoder components and a small overhead for the control circuitry. Implementation examples show that due to the pipeline strategy, data rates up to 1.28 Gbit/s are reached in a 0.5 μm CMOS technology  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号