期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FPGA implementation of high-performance,resource-efficient Radix-16 CORDIC rotator based FFT algorithm

《Integration, the VLSI Journal》2020

The fast Fourier transform (FFT) is an algorithm widely used to compute the discrete Fourier transform (DFT) in real-time digital signal processing. High-performance with fewer resources is highly desirable for any real-time application. Our proposed work presents the implementation of the radix-2 decimation-in-frequency (R2DIF) FFT algorithm based on the modified feed-forward double-path delay commutator (DDC) architecture on FPGA device. Need for a complex multiplier to carry out the multiplication of complex twiddle factors and large memory to store the twiddle factors are the main concerns for FFT implementation. Propose work aims to address these issues. In this work, a high-performance radix-16 COordinate Rotational DIgital Computer (CORDIC) algorithm based rotator is proposed to carry out the complex twiddle factor multiplication. Further, CORDIC needs only rotational angles to carry out complex multiplication, which reduces the need for large memory to store the twiddle factors. To compute the total rotation for n-bit precision, our proposed radix-16 CORDIC algorithm takes n/4 iteration as compared to n iteration of the radix-2 CORDIC algorithm. Our proposed architecture of the radix-2 decimation-in-frequency (R2DIF) algorithm is implemented on a Virtex−7 series FPGA. Further, the detailed comparison is presented between our proposed FFT implementation and other recently proposed FFT implementations. Experimental results suggest that proposed implementation has less latency and hardware utilization as compared to recently proposed implementations. 相似文献

2.

Efficient arithmetic using self-timing

Ramachandran R. Shih-Lien Lu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1996,4(4):445-454

Recent advances in VLSI technology have facilitated high levels of integration and the implementation of faster circuits on a chip. Most of the improvements in the performance of digital systems have been brought about by such faster technologies. However, these improvements in technology have brought along with them a host of other constraints. In the faster deep submicron technologies, the wire delays constitute a significant portion of the overall delay of the system and hence some of the advantages of faster technologies are lost. The high level of integration necessitates clock distribution schemes which minimize skew across the die. These result in area penalties and adversely affect the level of integration possible at the chip level. Hence, changes in the basic architecture of computing elements of a system, which when implemented in silicon introduces reduced interconnect delays and simpler clock distribution networks, will result in more effective performance improvements. The work presented here examines the implementation of the most basic element in any datapath-an adder. The adder, a carry elimination adder (CEA), uses self-timing at both the algorithmic and implementation levels and presents a minimal hardware high speed addition mechanism. The adder exploits the nature of the input operands dynamically, which results in its average case convergence time approaching that of the ubiquitous carry lookahead adder (CLA) and the hardware complexity of a carry ripple adder (CRA). Use of self-timing results in the elimination of a global clock and hence clock-skew 相似文献

3.

Memristor based N-bits redundant binary adder

《Microelectronics Journal》2015,46(3):207-213

This paper introduces a memristor based N-bits redundant binary adder architecture for canonic signed digit code CSDC as a step towards memristor based multilevel ALU. New possible solutions for multi-level logic designs can be established by utilizing the memristor dynamics as a basis in the circuit realization. The proposed memristor-based redundant binary adder circuit tries to achieve the theoretical advantages of the redundant binary system, and to eliminate the carry (borrow) propagation using signed digit representation. The advantage of carry elimination in the addition process is that it makes the speed independent of the operands length which speeds up all arithmetic operations. One memristor is sufficient for both the addition process and for storing the final result as a memory cell. The adder operation has been validated via different cases for 1-bit and 3-bits addition using HP memristor model and PSPICE simulation results. 相似文献

4.

Fast and gate-count efficient arithmetic logic unit

Yong Surk Lee Joh P. Jae Hee You Kyu Tae Park 《Electronics letters》1996,32(23):2126-2127

A CMOS arithmetic logic unit is presented with a minimum number of transistors and high speed arithmetic operations. Multiple carry chain adders and a novel 1 bit adder, are used in a carry select adder. The carry chain adder has a high degree of shared gates with a low propagation delay 相似文献

5.

Multi-mode parallel and folded VLSI architectures for 1D-fast Fourier transform

《Integration, the VLSI Journal》2016

The modern real time applications like orthogonal frequency division multiplexing and etc., demand high performance fast Fourier transform (FFT) design with less area and clock cycles. This paper proposes efficient FFT VLSI architectures using folded/parallel implementation. In the proposed folded FFT architecture, the number of cycles required to complete the operation is less than single path delay feedback (SDF)/multi-path delay commutator (MDC) architectures. In the proposed parallel FFT architecture, N-point FFT is implemented by using one N/2-point FFT without much extra hardware. Both the proposed architectures are implemented for radix-2, 2², and 4 using 45 nm technology library. The proposed parallel architecture achieves 56.7% and 40.6% of area reduction as compared with the existing parallel architecture based 16-point radix-2 and radix-2² DIF FFTs respectively. The proposed folded architecture achieves 65.5%, 51.1%, and 35.8% of worst path delay reduction as compared with the existing SDF based 16-point radix-2, radix-2², and radix-4 DIF FFTs respectively. 相似文献

6.

Low complexity VLSI implementation of CORDIC-based exponent calculation for neural networks

Supriya Aggarwal Kavita Khare 《International Journal of Electronics》2013,100(11):1471-1488

This article presents a low hardware complexity for exponent calculations based on CORDIC. The proposed CORDIC algorithm is designed to overcome major drawbacks (scale-factor compensation, low range of convergence and optimal selection of micro-rotations) of the conventional CORDIC in hyperbolic mode of operation. The micro-rotations are identified using leading-one bit detection with uni-direction rotations to eliminate redundant iterations and improve throughput. The efficiency and performance of the processor are independent of the probability of rotation angles being known prior to implementation. The eight-staged pipelined architecture implementation requires an 8?×?N ROM in the pre-processing unit for storing the initial coordinate values; it no longer requires the ROM for storing the elementary angles. It provides an area-time efficient design for VLSI implementation for calculating exponents in activation functions and Gaussain Potential Functions (GPF) in neural networks. The proposed CORDIC processor requires 32.68% less adders and 72.23% less registers compared to that of the conventional design. The proposed design when implemented on Virtex 2P (2vp50ff1148-6) device, dissipates 55.58% less power and has 45.09% less total gate count and 16.91% less delay as compared to Xilinx CORDIC Core. The detailed algorithm design along with FPGA implementation and area and time complexities is presented. 相似文献

7.

A GaAs low-power normally-on 4-bit ripple carry adder

《Solid-State Circuits, IEEE Journal of》1983,18(3):365-369

The realization and performance of a low-power buffered FET logic (1p-BFL) 4 bit ripple carry adder is reported. Performance measurements indicate a critical path average propagation delay of 1.9 ns at a total power dissipation of 45 mW, output buffers included (27 mW without). This corresponds to an average propagation delay of 380 ps/gate (FI/FO=/SUP 5///SUB 3/), an average power consumption of 1.56 mW/gate, and a power-delay product of 0.6 pJ. Best speed performance biasing conditions yield a 1.25 ns critical path average propagation delay at a total power dissipation of 180 mW (180 mW excluding buffers), which corresponds to an average gate delay, power consumption and power-delay product of 250 ps, 6 mW, and 1.5 pJ, respectively. Standard cell layout techniques yield an average gate density of 200 gates/mm/SUP 2/, interconnection wiring included. 相似文献

8.

CORDIC algorithm for vectoring mode without constant scalingfactors

Tso Bing Juang Hsiu Feng Lin 《Electronics letters》1999,35(12):971-972

A new CORDIC algorithm is presented that can be used for the vectoring mode without requiring constant scaling factors. The algorithm can also be used to carry out complete transformation from rectangular co-ordinates (x,y) to polar co-ordinates (ρ&thetas;) in each iteration. The exponent difference of x and y is computed so as to speed up convergence. This new CORDIC algorithm has an average of 0.75 n iterations for n-bit input data and can achieve>94.78% 23 bit accuracy. It is also suitable for VLSI chip implementation due to the regular architecture required 相似文献

9.

Efficient VLSI architectures for fast computation of the discreteFourier transform and its inverse

Ching-Hsien Chang Chin-Liang Wang Yu-Tai Chang 《Signal Processing, IEEE Transactions on》2000,48(11):3206-3216

In this paper, we propose two new VLSI architectures for computing the N-point discrete Fourier transform (DFT) and its inverse (IDFT) based on a radix-2 fast algorithm, where N is a power of two. The first part of this work presents a linear systolic array that requires log₂ N complex multipliers and is able to provide a throughput of one transform sample per clock cycle. Compared with other related systolic designs based on direct computation or a radix-2 fast algorithm, the proposed one has the same throughput performance but involves less hardware complexity. This design is suitable for high-speed real-time applications, but it would not be easily realized in a single chip when N gets large. To balance the chip area and the processing speed, we further present a new reduced-complexity design for the DFT/IDFT computation. The alternative design is a memory-based architecture that consists of one complex multiplier, two complex adders, and some special memory units. The new design has the capability of computing one transform sample every log₂ N+1 clock cycles on average. In comparison with the first design, the second design reaches a lower throughput with less hardware complexity. As N=512, the chip area required for the memory-based design is about 5742×5222 μm², and the corresponding throughput can attain a rate as high as 4M transform samples per second under 0.6 μm CMOS technology. Such area-time performance makes this design very competitive for use in long-length DFT applications, such as asymmetric digital subscriber lines (ADSL) and orthogonal frequency-division multiplexing (OFDM) systems 相似文献

10.

一种基于正交多小波的自适应均衡算法 总被引：1，自引：0，他引：1

王军锋宋国乡《电子与信息学报》2002,24(11):1525-1529

该文提出了用正交多小波来表示均衡器,由于多小波可同时具有正交性、紧支性和线性相位等特点,因此经多小波变换后所得到的信号相关阵的稀疏化估计与单小波变换相比非零元素较少,边界效应减小,基于此,文中给出了正交多小波变换域的一种Newton-LMS类自适应均衡算法,其计算复杂性可通过有预处理的共轭梯度法进一步降低为O(N log N),仿真结果表明了该算法收敛速度较快,且易于实时实现。相似文献

11.

Modulated Lapped Transform Domain Acoustic Echo Canceller with Double Talk Detector

Kyusik?Park Email author Sujin?Paek Younho?Jo Kiman?Kim 《Analog Integrated Circuits and Signal Processing》2005,45(1):99-108

A complete acoustic echo cancellation system with double talk detection capability is presented in this paper. The proposed system includes a new acoustic echo canceller (AEC) based on the modulated lapped transform (MLT) domain adaptive structure and a robust two-stage double talk detector (DTD) to cope with MLT domain AEC. The proposed AEC achieves better signal decorrelation via orthogonal MLT of size 2N× N rather than N× N square orthogonal transform such as DCT, DFT, etc. Both the input signal and the desired response are modulated lapped transformed in order to reduce the adaptation error between them so that the signal adaptation is purely operated in MLT domain. As a complementary of this, a two-stage DTD is developed to stabilize the operation of the AEC. The proposed DTD has robust algorithm structure and it allows faster switching according to the talker state change.Several simulation results with a synthetic and real speech are presented to demonstrate the performance of the proposed AEC and DTD. The proposed MLT based AEC proven to be very useful for the echo cancellation applications requiring high convergence speed and good echo attenuation. It can achieves faster convergence rate by more than twice over those of traditional DCT based AEC with an additional advantage of 10–15 dB ERLE improvement. On the other hand, a proposed two-stage DTD is shown to react quickly to both the onset and the end of the double-talk with reasonable high accuracy. 相似文献

12.

基于小波包变换和判决反馈RBF网络的组合均衡器

朱志宇《信号处理》2008,24(6)

本文提出了一种基于小波包变换和判决反馈RBF网络的组合非线性均衡器的结构和算法.首先将信号进行小波包分解,再将分解后的信号分量送入带有判决反馈结构的RBF神经网络进行均衡.一方面,小波包具有很强的去相关能力,可以提高均衡器的收敛速度;另一方面,RBF神经网络具有较强的非线性模式分类能力,可降低均衡器的均方误差.在仿真实验中,针对无线通信数字信号传输过程中由于多径效应和信道衰落而产生的码间干扰(ISI)问题,比较了最小均方(LMS)算法和组合均衡器算法的均衡效果,结果表明,组合均衡算法具有更快的收敛速度,更低的误码率. 相似文献

13.

Architectural Synthesis of Computational Engines for Subband Adaptive Filtering 总被引：2，自引：0，他引：2

S. Ramanathan V. Visvanathan S.K. Nandy 《The Journal of VLSI Signal Processing》1999,22(3):173-195

Architectural synthesis of low-power computational engines (hardware accelerators) for a subband-based adaptive filtering algorithm is presented. The full-band least mean square (LMS) adaptive filtering algorithm, widely used in various applications, is confronted by two problems, viz., slow convergence when the input correlation matrix is ill-conditioned, and increased computational complexity for applications involving use of large adaptive filter orders. Both of these problems can be overcome by the use of a subband-based normalized LMS (NLMS) adaptive filtering algorithm. Since this algorithm is not amenable to pipelining, delayed coefficient adaptation in the NLMS updation is used, which provides the required delays for pipelining. However, the convergence speed of this subband-based delayed NLMS (DNLMS) algorithm degrades with increase in the adaptation delay. We first present a pipelined subband DNLMS adaptive filtering architecture with minimal adaptation delay for any given sampling rate. The architecture is synthesized by using a number of function preserving transformations on the signal flow graph (SFG) representation of the subband DNLMS algorithm. With the use of carry-save arithmetic, the pipelined architecture can support high sampling rates limited only by the delay of two full adders and a 2-to-1 multiplexer. We then extend this synthesis methodology to synthesize a pipelined subband DNLMS architecture whose power dissipation meets a specified budget. This low-power architecture exploits the parallelism in the subband DNLMS algorithm to meet the required computational throughput. The architecture exhibits a novel tradeoff between algorithmic performance (convergence speed) and power dissipation. Finally, we incorporate configurability for filter order, sample period, power reduction factor, number of subbands and decimation/interpolation factor in the low-power architecture, thus resulting in a low-power subband computational engine for adaptive filtering. 相似文献

14.

High Speed Error Tolerant Adder for Multimedia Applications

S. Geetha P. Amritvalli 《Journal of Electronic Testing》2017,33(5):675-688

In this paper, a 1-bit modified full adder (MFA) cell is proposed. This eliminates the carry propagation during the addition by allowing errors in the carry bit. Using the proposed MFA, a 16-bit high speed error tolerant adder (HSETA) circuit is designed with conventional carry select adder (CSLA) structure for higher order bits and MFA based structure for lower order bits. The performance of HSETA is compared with existing adders in terms of accuracy, gate count, delay and power dissipation. The gate count of the HSETA is reduced by 23% and speed is improved by 43% compared to a conventional 16-bit adder structure. Further, implementation on FPGA Spartan 6 shows that HSETA uses 53% fewer LUT and 63% fewer slices compared to the conventional adder. Image blending application is used to evaluate the performance of the HSETA. In addition, to perform extensive error analysis, an analytical model is developed for HSETA and tested for varying bit widths and input probabilities. The analytical model is validated through simulation. 相似文献

15.

Combined trellis coding and DFE through Tomlinson precoding

Aman A.K. Cupo R.L. Zervos N.A. 《Selected Areas in Communications, IEEE Journal on》1991,9(6):876-884

The authors propose and evaluate a receiver architecture which combines the power of a decision feedback equalizer (DFE) with trellis coding, while allowing for minimal decoding delay in such a way that the total gain of the system is additive. The system is based on a structure that transposes the feedback filter of the DFE into the transmitter and, for high-order constellations, provides negligible increase in transmitter power. The first known hardware realization of a high bit rate digital subscriber line (HDSL) system that achieves the coding gain provided by a trellis code in addition to the equalization gain provided by the DFE is presented. A system whose complexity of implementation is comparable to that of a typical DFE and an independent Viterbi decoder is proposed 相似文献

16.

A New VLSI Architecture of Parallel Multiplier–Accumulator Based on Radix-2 Modified Booth Algorithm

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2010,18(2):201-208

In this paper, we proposed a new architecture of multiplier-and-accumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator that has the largest delay in MAC was merged into CSA, the overall performance was elevated. The proposed CSA tree uses 1's-complement-based radix-2 modified Booth's algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of the operands. The CSA propagates the carries to the least significant bits of the partial products and generates the least significant bits in advance to decrease the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits instead of the output of the final adder, which made it possible to optimize the pipeline scheme to improve the performance. The proposed architecture was synthesized with 250, 180 and 130 $ mu$m, and 90 nm standard CMOS library. Based on the theoretical and experimental estimation, we analyzed the results such as the amount of hardware resources, delay, and pipelining scheme. We used Sakurai's alpha power law for the delay modeling. The proposed MAC showed the superior properties to the standard design in many ways and performance twice as much as the previous research in the similar clock frequency. We expect that the proposed MAC can be adapted to various fields requiring high performance such as the signal processing areas. 相似文献

17.

Adaptive frequency-domain equalization and diversity combining forbroadband wireless communications

Clark M.V. 《Selected Areas in Communications, IEEE Journal on》1998,16(8):1385-1395

We introduce a new kind of adaptive equalizer that operates in the spatial-frequency domain and uses either least mean square (LMS) or recursive least squares (RLS) adaptive processing. We simulate the equalizer's performance in an 8-Mb/s quaternary phase-shift keying (QPSK) link over a frequency-selective Rayleigh fading multipath channel with ~3 μs RMS delay spread, corresponding to 60 symbols of dispersion. With the RLS algorithm and two diversity branches, our results show rapid convergence and channel tracking for a range of mobile speeds (up to ~100 mi/h). With a mobile speed of 40 mi/h, for example, the equalizer achieves an average bit error rate (BER) of 10 ^-4 at a signal-to-noise ratio (SNR) of 15 dB, falling short of optimum linear receiver performance by about 4 dB. Moreover, it requires only ~50 complex operations per detected bit, i.e., ~400 M operations per second, which is close to achievable with state-of-the-art digital signal processing technology. An equivalent time-domain equalizer, if it converged at all, would require orders-of-magnitude more processing 相似文献

18.

Efficient FPGA implementation of bit-stream multipliers 总被引：1，自引：0，他引：1

Ng C.W. Wong N. Ng T.S. 《Electronics letters》2007,43(9):496-497

A four-input adder structure for the FPGA implementation of a sigma-delta bit-stream multiplier is proposed. Conventional bit-stream multiplier implementations involve two-input adder circuits. It is shown that the four-input adder structure is more resource-efficient (over 40% hardware savings) and faster (over 20% higher clock frequency) when implemented using state-of-the-art FPGA architecture featuring six-input look-up tables 相似文献

19.

基于低硬件复杂度、高速CORDIC的SVD模块设计与实现

下载免费PDF全文

张晓帆李广军《电子学报》2015,43(4):738-742

为降低实现高阶矩阵SVD时的硬件复杂度和计算延时,本文改进了CORDIC迭代结构,设计了一种用于SVD的低硬件复杂度、高速CORDIC计算单元.本文以2×2矩阵为例,基于XilinxVirtex6硬件平台设计并实现了使用优化后CORDIC计算单元的SVD模块,在19bit位宽下吞吐率达25.9Gbps.对比Xilinx IP core中同类模块,本文设计节省27.6%寄存器,27.7%查找表,实时性提高14%.对高阶矩阵,本文给出资源消耗趋势曲线,可证明优化后CORDIC计算单元能降低16阶矩阵SVD模块约40%的硬件复杂度. 相似文献

20.

An enhanced high-speed multi-digit BCD adder using quantum-dot cellular automata

D. Ajith K. V. Ramanaiah V. Sumalatha 《半导体学报》2017,38(2):024002-9

The advent of development of high-performance, low-power digital circuits is achieved by a suitable emerging nanodevice called quantum-dot cellular automata(QCA). Even though many efficient arithmetic circuits were designed using QCA, there is still a challenge to implement high-speed circuits in an optimized manner. Among these circuits, one of the essential structures is a parallel multi-digit decimal adder unit with significant speed which is very attractive for future environments. To achieve high speed, a new correction logic formulation method is proposed for single and multi-digit BCD adder. The proposed enhanced single-digit BCD adder(ESDBA) is 26% faster than the carry flow adder(CFA)-based BCD adder. The multi-digit operations are also performed using the proposed ESDBA, which is cascaded innovatively. The enhanced multi-digit BCD adder(EMDBA) performs two 4-digit and two 8-digit BCD addition 50% faster than the CFA-based BCD adder with the nominal overhead of the area. The EMDBA performs two 4-digit BCD addition 24% faster with 23% decrease in the area, similarly for 8-digit operation the EMDBA achieves 36% increase in speed with 21% less area compared to the existing carry look ahead(CLA)-based BCD adder design. The proposed multi-digit adder produces significantly less delay of(N-1)+3.5 clock cycles compared to the N*One digit BCD adder delay required by the conventional BCD adder method. It is observed that as per our knowledge this is the first innovative proposal for multi-digit BCD addition using QCA. 相似文献