首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
设计了一种数据字长可重构的流水线坐标旋转数字计算(Coord inate R otation D ig ita l Com pu ting,CORD IC)单元,用于可重构DSP阵列式处理引擎的处理单元核心的设计。首先对流水线CORD IC的模校正进行改造,使流水级数有所减少,且使模校正流水的分配有利于字长可重构的设计。之后通过相邻8位流水线CORD IC单元间的横向和纵向可重构设计,使相邻的(2×2)/(3×3)/(4×4)个基本单元可以组合成数据字长为16/24/32位的CORD IC单元。  相似文献   

2.
We present here an efficient systolic implementation for 3-D IIR digital filters. The systolic implementation is obtained by using an algebraic mapping technique. This new mapping technique gives us the choice to mix pipelined variables and broadcast variables. We also determine, through the mapping method, the buffer sizes, the direction of variables propagations and the data feeding and extracting points. The resultant systolic array implementation is a modular structure composed of 2-D filter modules connected by simple buffers. This new systolic implementation is regular, modular and amenable to VLSI Implementation.  相似文献   

3.
The authors propose a systolic block Householder transformation (SBHT) approach to implement the HT on a systolic array and also propose its application to the RLS (recursive least squares) algorithm. Since the data are fetched in a block manner, vector operations are in general required for the vectorized array. However, a modified HT algorithm permits a two-level pipelined implementation of the SBHT systolic array at both the vector and word levels. The throughput rate can be as fast as that of the Givens rotation method. The present approach makes the HT amenable for VLSI implementation as well as applicable to real-time high-throughput applications of modern signal processing. The constrained RLS problem using the SBHT RLS systolic array is also considered  相似文献   

4.
Using a simple input-regeneration approach and index-transformation techniques, a new formulation is presented in this paper for computing an N-point prime-length discrete sine transform (DST) through two pairs of [(N-1)/4]-point cyclic convolutions, where [(N-1)/4] is an odd number. The cyclic convolution-based algorithm is used further to obtain a simple regular and locally connected linear systolic array for concurrent pipelined implementation of the DST. It is shown that the proposed systolic structure involves significantly less area-time complexity compared with that of the existing structures  相似文献   

5.
A family of Schur-type spatial least-squares algorithms is presented for solving the spatial LS estimation problem, in which the correlation matrix is neither Toeplitz nor near-Toeplitz, by order recursion. Normalized spatial Levinson- and Schur-type algorithms are also derived. Highly pipelined architectures are designed to realize these recursions. The reflection coefficients are first computed using the spatial Schur type recursions. Then, the forward and backward filter parameters are calculated by the spatial Levinson-type recursions. A pyramid systolic array is demonstrated to calculate not only the filter parameters but also the LDU decomposition of the inverse cross-correlation matrix at every clock phase. This pyramid array can be mapped onto a two-dimensional systolic array which has a simpler structure. A square systolic array is developed to implement the Levinson- and Schur-type temporal recursive LS (RLS) algorithms. A highly concurrent architecture which exploits the parallelism of the spatial Schur-type recursions is illustrated to perform the LDU decomposition of the cross-correlation matrix  相似文献   

6.
An improvement of the majority gate algorithm suitable for grey scale morphological operations is presented in the Letter. The redundancy of temporal signals led to a simplified hardware implementation. It is shown that max/min operators can be computed by the same circuit. A new pipelined systolic array architecture based on this circuit is illustrated for dilation/erosion operations  相似文献   

7.
Recently, the power consumption of integrated circuits has been attracting increasing attention. Many techniques have been studied to improve the power efficiency of digital signal processing units such as fast Fourier transform (FFT) processors, which are popularly employed in both traditional research fields, such as satellite communications, and thriving consumer electronics, such as wireless communications. This paper presents solutions based on parallel architectures for high throughput and power efficient FFT cores. Different combinations of hybrid low‐power techniques are exploited to reduce power consumption, such as multiplierless units which replace the complex multipliers in FFTs, low‐power commutators based on an advanced interconnection, and parallel‐pipelined architectures. A number of FFT cores are implemented and evaluated for their power/area performance. The results show that up to 38% and 55% power savings can be achieved by the proposed pipelined FFTs and parallel‐pipelined FFTs respectively, compared to the conventional pipelined FFT processor architectures.  相似文献   

8.
Introduces efficient pipeline architectures for the recursive morphological operations. The standard morphological operation is applied directly on the original input image and produces an output image. The order of image scanning in which the operator is applied to the input pixels is irrelevant. However, the intent of the recursive morphological operations is to feed back the output at the current scanning pixel to overwrite its corresponding input pixel to be considered into computation at the following scanning pixels. The resultant output image by recursive morphology inherently depends on the image scanning sequence. Two pipelined implementations of the recursive morphological operations are presented. The design of an application-specific systolic array is first introduced. The systolic array uses 3 n cells to process an nxn image in 6 n-2 cycles. The cell utilization rate is 100%. Second, a parallel program implementing the recursive morphological operations and running on distributed-memory multicomputers is described. Performance of the program can be finely tuned by choosing appropriate partition parameters.  相似文献   

9.
Pipelined systolic architectures for DLMS adaptive filtering   总被引:6,自引:0,他引:6  
This work reports two new pipelined, systolic architectures for delayed least mean squares (DLMS) adaptive filtering. In contrast to existing systolic architectures, which introduce a tracking delay that increases linearly with filter order, those presented here, do not. They support the same sampling rate as the fastest such architecture reported so far, even when unpipelined. Our designs use significantly less hardware (i.e., multiply-accumulate modules and registers) with minimal control logic requirement on account of the algebraic projection techniques that we employ, implying a net gain in terms of the silicon area utilized and the dynamic power dissipated. Further, one of these architectures introduces only half the adaptation delay that is conventionally used for systolization; the other requires the normal adaptation delay, but compensates by using considerably reduced control logic. The sampling rates supported by our architectures are further increased by pipelining the processor modules to the level of a 42 compressor. This requires only small adaptation and tracking delays, which are independent of filter order, and is possible without requiring a modification of the basic algorithm (in terms of introducing a lookahead in the adaptation), all in contrast with the only pipelined DLMS architecture reported so far. We propose and implement a scheme in our architectures, for computing a normalized step size for delayed adaptation, in the general context of a nonstationary real-time environment. The simulation studies performed with our architectures indicate remarkably improved convergence properties over those of previously reported architectures.  相似文献   

10.
A comparative analysis is presented of major methods for reducing power consumption in pipelined ADCs. They are classified into two groups: (1) methods for the structural or parametric optimization of the conventional architecture and (2) original circuit configurations designed to minimize the power consumption by individual units.  相似文献   

11.
The Residue number system (RNS) is inherently suited to high speed computations using custom tailored VLSI systems. In this paper, an algorithm for residue addition, based on a novel, non unique number representation scheme, is implemented by a systolic array embedded in a VLSI chip. The pipelined cells are implemented, using a true single phase clock dynamic circuit structure, with computer synthesized minimized trees (switching trees). The array may be easily programmed by the user to accept any arbitrary modulus. Important applications of this array are in residue decoding and fault tolerant computation requiring the use of the Chinese Remainder Theorem where the modulus for addition is relatively large.  相似文献   

12.
A new hardware scheme is proposed to resolve data and control hazards and assure precise exception on out-of-order execution in a microarchitecture with multiple pipelined functional units. The core of the proposed hardware is a register file called CARF, which is made of CAM (content-addressable memory), with an efficient state-transition mechanism for precise exception handling and prompt branch misprediction recovery  相似文献   

13.
In this paper, a fully-pipeline linear systolic array based on adjusted Montgomery's algorithm is presented to perform modular multiplication at extremely high speed. The processing element (PE) consists of only 4 full-adders and 14 flip-flops. Three-stage internal pipelined PE results in a very short critical path with only a one-bit full-adder delay. Thus, it can run at a very high cycle rate. The total execution time for an n-bit modular multiplication is 2n + 11 cycles with only (n/2 + 2) PEs. A modular exponentiation based on it takes (3n + 16.5)n cycles in average. Compared with most published VLSI modular multipliers, the hardware complexity is greatly reduced while keeping very high throughput. Therefore it is a good candidate of the arithmetic units used in the many public-key crypto-systems, e.g. RSA, Elliptic Curve and so on, especially for the embedded applications concerning information security.  相似文献   

14.
The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms  相似文献   

15.
Based on the microprocessor structure,an RSA coprocessor for improved Montgomery algorithm has been designed.The functional units of this coprocessor operate concurrently,and up to three instructions can be issued in one cycle.A mixed form of three-stage and two-stage pipelined structure is used for instruction execution,and the coprocessor and CPU core can share a common RAM memory through a set of switches under control.The structure of the coprocessor can be expanded to contain more than one multiplier-accumulator units for higher performance.  相似文献   

16.
The authors believe that special-purpose architectures for digital signal processing (DSP) real-time applications will use closely coupled processing elements as array processor modules to implement the various portions of the new algorithms, and several such modules will cooperate in a pipelined manner to implement complete algorithms. Such an architecture, based upon systolic modules, for the MUSIC algorithm is presented. The architecture is suitable for VLSI implementation. The throughput of the pipelined approach is O(N), whereas the sequential approach is O(N3)  相似文献   

17.
This paper describes a scalable pipelined RAM system (SPRAMS) for packet switching. The SPRAMS consists of a two-dimensional array of small memory blocks which are fully pipelined and communicate with adjacent blocks in three directions. The maximum delay of a small memory block becomes the cycle time of the chip. The array configuration is scalable for large memory size without the cycle time variation. It has an initial latency of N+3 cycles with an N×N array configuration. We have designed an experimental 200 MHz 4 kbit static RAM chip with the 4×4 array configuration of 256 bit SRAM blocks. It was fabricated in 0.8 μm single-poly double-metal CMOS technology. Experimental results describe the advantages of SPRAMS  相似文献   

18.
Systolic Kalman filter (SKF) designs based on a triangular array (triarray) configuration are presented. A least squares formulation, which is an expanded matrix representation of the state space iteration, is adopted to develop an efficient iterative QR triangularization and consecutive data prewhitening formulations. This formulation has advantages in both numerical accuracy and processor utilization efficiency. Moreover, it leads naturally to pipelined architectures such as systolic or wavefront arrays. For an n state and m measurement dynamic system, the SKF triarray design uses n(n+3)/2 processors and requires only 4n+m timesteps to complete one iteration of prewhitened Kalman filtering system. This means a speedup factor of approximately n2/4 when compared with a sequential processor. Also proposed for the colored noise case are data prewhitening triarrays which offer compatible speedup performance for the preprocessing stage. Based on a comparison of several competing alternatives, the proposed array processor may be considered a most efficient systolic or wavefront design for Kalman filtering  相似文献   

19.
A multiwavelength laser (MWL) is fabricated by means of selective area growth (SAG) with metal organic vapour phase epitaxy (MOVPE). The MWL consists of an array of amplifiers monolithically integrated with a transmissive (de-)multiplexer and to the author' knowledge, is the first device of the kind realised with only two growth step making use of SAG MOVPE  相似文献   

20.
Per-flow queueing with sophisticated scheduling is one of the methods for providing advanced quality of service (QoS) guarantees. The hardest and most interesting scheduling algorithms rely on a common computational primitive, implemented via priority queues. To support such scheduling for a large number of flows at OC-192 (10 Gb/s) rates and beyond, pipelined management of the priority queue is needed. Large priority queues can be built using either calendar queues or heap data structures; heaps feature smaller silicon area than calendar queues. We present heap management algorithms that can be gracefully pipelined; they constitute modifications of the traditional ones. We discuss how to use pipelined heap managers in switches and routers and their cost-performance tradeoffs. The design can be configured to any heap size, and, using 2-port 4-wide SRAMs, it can support initiating a new operation on every clock cycle, except that an insert operation or one idle (bubble) cycle is needed between two successive delete operations. We present a pipelined heap manager implemented in synthesizable Verilog form, as a core integratable into ASICs, along with cost and performance analysis information. For a 16 K entry example in 0.13-mum CMOS technology, silicon area is below 10 mm2 (less than 8% of a typical ASIC chip) and performance is a few hundred million operations per second. We have verified our design by simulating it against three heap models of varying abstraction  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号