首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
浮点乘法器是高动态范围(HDR)图像处理、无线通信等系统中的关键运算单元,其相比于定点乘法器动态范围更广,但复杂度更高。近似计算作为一种新兴范式,在受限的精度损失范围内,可大幅降低硬件资源和功耗开销。该文提出一种16 bit半精度近似浮点乘法器(App-Fp-Mul),针对浮点乘法器中的尾数乘法模块,根据其部分积阵列中出现1的概率,提出一种对输入顺序不敏感的近似4-2压缩器及低位或门压缩方法,在精度损失较小的条件下有效降低了浮点乘法器资源及功耗。相较于精确设计,所提近似浮点乘法器在归一化平均错误距离(NMED)为0.0014时,面积及功耗延时积方面分别降低20%及58%;相较于现有近似设计,在近似位宽相同时具有更高的精度及更小的功耗延时积。最后将该文所提近似浮点乘法器应用于高动态范围图像处理,相比现有主流方案,峰值信噪比和结构相似性分别达到83.16 dB 和 99.9989%,取得了显著的提升。  相似文献   

2.
IEEE-754 rounding support increases the critical delay for floating-point multipliers. Except round-to-zero mode all IEEE rounding modes test the (n???2) least significant product bits for one. The result of the test is indicated by the sticky-bit. Since fast generation of the sticky-bit is critical for performance, various sticky-bit generation designs are developed. This paper presents a comparison of previous fast sticky-bit generation designs and proposes a novel design that is independent from the multiplier’s hardware. Thus, the proposed design can be used in any floating-point multiplier or any floating-point multiply-accumulate circuit. The proposed method is one of the fastest among all methods and it uses the minimum hardware resources among the designs that use the same idea.  相似文献   

3.
Finite impulse response (FIR) filtering can be expressed as multiplications of vectors by scalars. We present high-speed designs for FIR filters based on a computation sharing multiplier which specifically targets computation re-use in vector-scalar products. The performance of the proposed implementation is compared with implementations based on carry-save and Wallace tree multipliers in 0.35-/spl mu/m technology. We show that sharing multiplier scheme improves speed by approximately 52 and 33% with respect to the FIR filter implementations based on the carry-save multiplier and Wallace tree multiplier, respectively. In addition, sharing multiplier scheme has a relatively small power delay product than other multiplier schemes. Using voltage scaling, power consumption of the FIR filter based on computation sharing multiplier can be reduced to 41% of the FIR filter based on the Wallace tree multiplier for the same frequency of operation.  相似文献   

4.
This brief presents a very low complexity hardware interleaver implementation for turbo code in wideband CDMA (W-CDMA) systems. Algorithmic transformations are extensively exploited to reduce the computation complexity and latency. Novel VLSI architectures are developed. The hardware implementation results show that an entire turbo interleave pattern generation unit consumes only 4 k gates, which is an order of magnitude smaller than conventional designs.  相似文献   

5.
刘飞  黎海涛 《信号处理》2012,28(3):397-403
在多元低密度奇偶校验码(NB-LDPC)的扩展最小和译码算法(EMS)中,由于消息向量的递归计算和校验/变量节点信息之间的迭代交换,导致译码器存在较大延迟。针对此问题本文提出了一种新型译码器结构,它优化了校验节点更新单步运算单元,根据前向后向算法规则,以3路单步运算单元完成校验节点更新,硬件资源消耗略有增加,但所需时钟周期约降为一般结构的1/3;并采用全并行运算的变量节点信息更新单元,无需利用前向后向算法将更新过程分解为多个单步运算,消除了变量节点更新的递归计算,且具有低复杂度低延时等优点,并在现场可编程门阵列(FPGA)Xilinx Virtex-4 (XC4VLX200)平台上对一个GF(16)域上(480,360)的准循环多元LDPC码进行了综合仿真。仿真结果证明,设计的译码器在较小资源消耗条件下能成倍提高吞吐量。   相似文献   

6.
Recently, emerging “edge computing” moves data and services from the cloud to nearby edge servers to achieve short latency and wide bandwidth, and solve privacy concerns. However, edge servers, often embedded with GPU processors, highly demand a solution for power-efficient neural network (NN) training due to the limitation of power and size. Besides, according to the nature of the broad dynamic range of gradient values computed in NN training, floating-point representation is more suitable. This paper proposes to adopt a logarithm-approximate multiplier (LAM) for multiply-accumulate (MAC) computation in neural network (NN) training engines, where LAM approximates a floating-point multiplication as a fixed-point addition, resulting in smaller delay, fewer gates, and lower power consumption. We demonstrate the efficiency of LAM in two platforms, which are dedicated NN training hardware, and open-source GPU design. Compared to the NN training applying the exact multiplier, our implementation of the NN training engine for a 2-D classification dataset achieves 10% speed-up and 2.3X efficiency improvement in power and area, respectively. LAM is also highly compatible with conventional bit-width scaling (BWS). When BWS is applied with LAM in five test datasets, the implemented training engines achieve more than 4.9X power efficiency improvement, with at most 1% accuracy degradation, where 2.2X improvement originates from LAM. Also, the advantage of LAM can be exploited in processors. A GPU design embedded with LAM executing an NN-training workload, which is implemented in an FPGA, presents 1.32X power efficiency improvement, and the improvement reaches 1.54X with LAM + BWS. Finally, LAM-based training in deeper NN is evaluated. Up to 4-hidden layer NN, LAM-based training achieves highly comparable accuracy as that of the accurate multiplier, even with aggressive BWS.  相似文献   

7.
熊承义  田金文  柳健 《信号处理》2006,22(5):703-706
模乘运算在剩余数值系统、数字信号处理系统及其它领域都具有广泛的应用,模乘法器的硬件实现具有重要的作用。提出了一种改进的模(2~n 1)余数乘法器的算法及其硬件结构,其输入为通常的二进制表示,因此无需另外的输人数据转换电路而可直接用于数字信号处理应用。通过利用模(2~n 1)运算的周期性简化其乘积项并重组求和项,以及采用改进的进位存储加法器和超前进位加法器优化结构以减少路径延时和硬件复杂度。比较其它同类设计,新的结构具有较好的面积、延时性能。  相似文献   

8.
张小妍  邵杰 《电子工程师》2009,35(11):24-27
运用流水线技术对单精度浮点乘法和加法运算单元进行了优化设计。浮点加法器采用了改进的双路径结构,重点对移位单元和前导1检测单元的结构进行了优化。浮点乘法器在对被乘数进行Booth编码后,采用改进的4-2压缩器构成Wallace树,在简化逻辑的同时,提高了系统的吞吐率。经过仿真验证,在Virtex-4系列FPGA(现场可编程门阵列)上,浮点加法器的最高运行速率达到405MHz,浮点乘法器的最高运行速率达到429MHz。  相似文献   

9.
The elaborate design of folded finite-impulse response (FIR) filters based on pipelined multiplier arrays is presented in this paper. The design is considered at the bit-level and the internal delays of the pipelined multiplier array are fully exploited in order to reduce hardware complexity. Both direct and transposed FIR filter forms are considered. The carry-save and the carry-propagate multiplier arrays are studied for the filter implementations. Partially folded architectures are also proposed which are implemented by cascading a number of folded FIR filters. The proposed schemes are compared as to the aspect of hardware complexity with a straightforward implementation of a folded FIR filter based on the pipelined Wallace Tree multiplier. The comparison reveals that the proposed schemes require 20%-30% less hardware. Finally, efficient implementation of partially folded FIR filter circuits is presented when constraints in area, power consumption and clock frequency are given.  相似文献   

10.
文章提出了一种有效的模(2n-1)余数乘法器实现的算法及其VLSI结构, 其输入为通常的二进制表示, 因此无需另外的输入数据转换电路即可用于数字信号处理.通过利用模(2n-1)运算的周期性,简化其乘积项,并重组求和项优化结构,以减少路径延时和硬件复杂度.较之同类设计, 该结构更加规则,且具有更好的面积和速度性能.  相似文献   

11.
Multiplication-accumulation operations described by represent the fundamental computation involved in many digital signal processing algorithms. For high speed signal processing, one obvious approach to realize the above computation in VLSI is to employm discrete multipliers working in parallel. However, a more area efficient approach is offered by the merged multiplication technique [5]. But the principal drawback of the conventional merged technique is its longer latency than the former discrete approach. This work proposes a hardware algorithm for merged array multiplication which eliminates this drawback and achieves significant improvement in latency when compared with the conventional scheme for merged multiplication. The proposed algorithm utilizes multiple wave front computation as opposed to the traditional approach where computation in an array multiplier is carried out by a single wave front. The improvement in latency by the proposed approach is greater than 40% (form>2) when compared with a conventional approach to merged multiplication. The consequent cost in the form of additional requirement of VLSI area is found to be rather small. In this paper, we provide a thorough analytic discussion on the proposed algorithm and support it by experimental results.  相似文献   

12.
In this paper, we propose two new VLSI architectures for computing the N-point discrete Fourier transform (DFT) and its inverse (IDFT) based on a radix-2 fast algorithm, where N is a power of two. The first part of this work presents a linear systolic array that requires log2 N complex multipliers and is able to provide a throughput of one transform sample per clock cycle. Compared with other related systolic designs based on direct computation or a radix-2 fast algorithm, the proposed one has the same throughput performance but involves less hardware complexity. This design is suitable for high-speed real-time applications, but it would not be easily realized in a single chip when N gets large. To balance the chip area and the processing speed, we further present a new reduced-complexity design for the DFT/IDFT computation. The alternative design is a memory-based architecture that consists of one complex multiplier, two complex adders, and some special memory units. The new design has the capability of computing one transform sample every log2 N+1 clock cycles on average. In comparison with the first design, the second design reaches a lower throughput with less hardware complexity. As N=512, the chip area required for the memory-based design is about 5742×5222 μm2, and the corresponding throughput can attain a rate as high as 4M transform samples per second under 0.6 μm CMOS technology. Such area-time performance makes this design very competitive for use in long-length DFT applications, such as asymmetric digital subscriber lines (ADSL) and orthogonal frequency-division multiplexing (OFDM) systems  相似文献   

13.
高质量0.6 Kb/s声码器的TMS320VC55x实现   总被引:1,自引:0,他引:1  
给出了一种编码速率为600b/s的高质量声码器算法及基于DSP芯片的硬件实现。介绍了语音编解码算法原理、声码器系统的硬件结构、工作流程以及软件实现与代码优化。针对C55xDSP芯片的结构特点,采用C与汇编混合编程,汇编指令优化等方法,大大降低了算法的存储复杂度和运算复杂度,达到了实时性要求。  相似文献   

14.
高速数据传输中的LDPC码译码算法研究   总被引:2,自引:2,他引:0  
在对LDPC码(低密度奇偶校验码)不同译码算法的原理进行深入研究的基础上,提出了一种改进型的最小和译码算法。通过分析不同译码算法的计算复杂度,验证不同算法的译码性能,表明改进最小和译码算法降低了运算复杂度,减少了平均迭代次数,改善了算法的收敛特性,是一种低运算量高性能的译码算法,有利于提高吞吐量,在高速数据传输上具有很大应用价值。  相似文献   

15.
Recently, it has become possible to implement floating-point cores on field-programmable gate arrays (FPGAs) to provide acceleration for the myriad applications that require high-performance floating-point arithmetic. To achieve high clock rates, floating-point cores for FPGAs must be deeply pipelined. This deep pipelining makes it difficult to reuse the same floating-point core for a series of dependent computations. However, floating-point cores use a great deal of area, so it is important to use as few of them in an architecture as possible. In this paper, we describe area-efficient architectures and algorithms for arithmetic expression evaluation. Such expression evaluation is necessary in applications from a wide variety of fields, including scientific computing and cognition. The proposed designs effectively hide the pipeline latency of the floating-point cores and use at most two floating-point cores for each type of operator in the expression. While best-suited for particular classes of expressions, the proposed designs can evaluate general expressions as well. Additionally, multiple expressions can be evaluated without reconfiguration. Experimental results show that the areas of our designs increase linearly with the number of types of operations in the expression and that our designs occupy less area and achieve higher throughput than designs generated by a commercial hardware compiler.  相似文献   

16.
The fast Fourier transform (FFT) is an algorithm widely used to compute the discrete Fourier transform (DFT) in real-time digital signal processing. High-performance with fewer resources is highly desirable for any real-time application. Our proposed work presents the implementation of the radix-2 decimation-in-frequency (R2DIF) FFT algorithm based on the modified feed-forward double-path delay commutator (DDC) architecture on FPGA device. Need for a complex multiplier to carry out the multiplication of complex twiddle factors and large memory to store the twiddle factors are the main concerns for FFT implementation. Propose work aims to address these issues. In this work, a high-performance radix-16 COordinate Rotational DIgital Computer (CORDIC) algorithm based rotator is proposed to carry out the complex twiddle factor multiplication. Further, CORDIC needs only rotational angles to carry out complex multiplication, which reduces the need for large memory to store the twiddle factors. To compute the total rotation for n-bit precision, our proposed radix-16 CORDIC algorithm takes n/4 iteration as compared to n iteration of the radix-2 CORDIC algorithm. Our proposed architecture of the radix-2 decimation-in-frequency (R2DIF) algorithm is implemented on a Virtex−7 series FPGA. Further, the detailed comparison is presented between our proposed FFT implementation and other recently proposed FFT implementations. Experimental results suggest that proposed implementation has less latency and hardware utilization as compared to recently proposed implementations.  相似文献   

17.
This paper presents a new modular multiplication algorithm that allows one to implement modular multiplications efficiently. It proposes a systematic approach for maximizing a level of parallelism when performing a modular multiplication. The proposed algorithm effectively integrates three different existing algorithms, a classical modular multiplication based on Barrett reduction, the modular multiplication with Montgomery reduction and the Karatsuba multiplication algorithms in order to reduce the computational complexity and increase the potential of parallel processing. The algorithm is suitable for both hardware implementations and software implementations in a multiprocessor environment. To show the effectiveness of the proposed algorithm, we implement several hardware modular multipliers and compare the area and performance results. We show that a modular multiplier using the proposed algorithm achieves a higher speed comparing to the modular multipliers based on the previously proposed algorithms.  相似文献   

18.
This investigation proposes a novel radix-42 algorithm with the low computational complexity of a radix-16 algorithm but the lower hardware requirement of a radix-4 algorithm. The proposed pipeline radix-42 single delay feedback path (R42SDF) architecture adopts a multiplierless radix-4 butterfly structure, based on the specific linear mapping of common factor algorithm (CFA), to support both 256-point fast Fourier transform/inverse fast Fourier transform (FFT/IFFT) and 8times8 2D discrete cosine transform (DCT) modes following with the high efficient feedback shift registers architecture. The segment shift register (SSR) and overturn shift register (OSR) structure are adopted to minimize the register cost for the input re-ordering and post computation operations in the 8times8 2D DCT mode, respectively. Moreover, the retrenched constant multiplier and eight-folded complex multiplier structures are adopted to decrease the multiplier cost and the coefficient ROM size with the complex conjugate symmetry rule and subexpression elimination technology. To further decrease the chip cost, a finite wordlength analysis is provided to indicate that the proposed architecture only requires a 13-bit internal wordlength to achieve 40-dB signal-to-noise ratio (SNR) performance in 256-point FFT/IFFT modes and high digital video (DV) compression quality in 8 times 8 2D DCT mode. The comprehensive comparison results indicate that the proposed cost effective reconfigurable design has the smallest hardware requirement and largest hardware utilization among the tested architectures for the FFT/IFFT computation, and thus has the highest cost efficiency. The derivation and chip implementation results show that the proposed pipeline 256-point FFT/IFFT/2D DCT triple-mode chip consumes 22.37 mW at 100 MHz at 1.2-V supply voltage in TSMC 0.13-mum CMOS process, which is very appropriate for the RSoCs IP of next-generation handheld devices.  相似文献   

19.
A new cell architecture for high performance digit-serial computation is presented. The design of this cell is based on the feed forward of the carry digit, which allows a high level of pipelining to increase the throughput rate with minimum latency. This will give designers greater flexibility in finding the best trade-off between hardware cost and throughput rate. A twin-pipe architecture to double the throughput rate of digit-serial/parallel multipliers is also presented. The effects of the number of pipelining levels and the twin architecture on the throughput rate and hardware cost are presented. A two's complement digit-serial/parallel multiplier which can operate on both negative and positive numbers is also presented.  相似文献   

20.
《Microelectronics Journal》2014,45(6):775-780
Decimal multiplication is a frequent operation with inherent complexity in implementation. Commercial and financial applications require working with decimal numbers while it has been shown that if we convert decimal number to binary ones, this will negatively influence the preciseness required for these applications. Existing research works on parallel decimal multipliers have mainly focused on latency and area as two major factors to be improved. However, energy/power consumption is another prominent issue in today׳s digital systems. While the energy consumption of parallel decimal multipliers has not been addressed in previous works, in this paper we present a comparative study of parallel decimal multipliers, considering energy/power consumption, leakage and dynamic power consumption, beside latency and area. This study can provide some guidelines for EDA tools and hardware designers to choose proper multiplier based on given applications and design constraints. All designs in were implemented using VHDL and synthesized in Design-Compiler toolbox with TSMC 45 nm technology file.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号