期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Design of a cycle-efficient 64-b/32-b integer divisor using a table-sharing algorithm

Chua-Chin Wang Po-Ming Lee Jun-Jie Wang Chenn-Jung Huang 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(4):737-740

In new generations of microprocessors, the superscalar architecture is widely adopted to increase the number of instructions executed in one cycle. The division instruction among all of the instructions needs more cycles than the rest, e.g., addition and multiplication. This makes the division instruction an important cycles-per-instruction figure for modern microprocessors. In this paper, a radix-16/8/4/2 divisor is proposed, which uses a variety of techniques, including operand scaling, table partitioning, and, particularly, table sharing, to increase performance without the cost of increasing complexity. A physical chip using the proposed method is implemented by 0.35-/spl mu/m single poly four metal (1P4M) CMOS technology. The testing measurement shows that the chip can execute signed 64-b/32-b integer division between 3-13 cycles with a 80-MHz operating clock. 相似文献

2.

Radix-8 division with over-redundant digit set

Paolo Montuschi Luigi Ciminiera 《The Journal of VLSI Signal Processing》1994,7(3):259-270

We present a radix-8 divider that uses an over-redundant digit set for the quotient in order to obtain simple digit selection rules. We show that the proposed enlarged set of values for the quotient digit does not lead to increases both in the complexity and the delay of the adder required to update the remainder, with respect to similar solutions, since the values allowed for the quotient digit have been selected carefully. The digit selection process is subdivided into two concurrent steps, each one making reference to a secondary digit set and the resulting implementation can be cheaper and faster than other units which do not use over-redundant digit sets. A performance analysis estimates a speed improvement from 25% to 35% with respect to a radix-8 architecture by Fandrianto, and from 21% to 30% with respect to a radix-4 architecture with prescaling, presented by Ercegovac and Lang. As required from the IEEE 754 floating point standard, the proposed algorithm features the correct remainder of the division. 相似文献

3.

An efficient maximum-redundancy radix-8 SRT division andsquare-root method

Hobson R.F. Fraser M.W. 《Solid-State Circuits, IEEE Journal of》1995,30(1):29-38

A new approach to integrating hardware multiplication, division, and square-root is presented. We use a fully integrated control path which simultaneously reduces part of the redundant partial-remainder and performs a truncated multiplication of the next quotient or square-root digit by the divisor or square-root value. A separate (parallel) full precision iterative multiplier is used for partial remainder production. Strategic details of a radix-8 implementation are discussed. It is shown that a maximally redundant digit set is a viable choice for high performance in this case 相似文献

4.

面向OFDM应用的低硬件开销低功耗64点FFT处理器设计

于建《电讯技术》2020,(3):338-343

在基于正交频分复用(Orthogonal Frequency Division Multiplexing,OFDM)的无线系统中,快速傅里叶变换(Fast Fourier Transform,FFT)作为关键模块,消耗着大量的硬件资源。为此,针对于IEEE802. 11a标准的无线局域网基带技术,提出了一种低硬件开销、低功耗的基-24算法流水线架构FFT处理器设计方案。在硬件实现上,采用单路延迟负反馈(Single-path Delay Feedback,SDF)流水线架构;为了降低硬件资源消耗,基于新型的改良蝶形架构利用正则有符号数(Canonical Signed Digit,CSD)常数乘法器替代布斯乘法器完成所有的复数乘法运算。设计采用QUARTUS PRIME工具进行开发,搭配Cyclone 10 LP系列器件,编译结果显示该方案与其他已存在的方案相比,至少节约硬件成本25%,降低功耗18%。相似文献

5.

A new carry-free division algorithm and its application to asingle-chip 1024-b RSA processor

Vandemeulebroecke A. Vanzieleghem E. Denayer T. Jespers P.G.A. 《Solid-State Circuits, IEEE Journal of》1990,25(3):748-756

A carry-free division algorithm is described. It is based on the properties of redundant signed digit (RSD) arithmetic to avoid carry propagation and uses the minimum hardware per bit, i.e. one full adder. Its application to a 1024-b RSA (Rivest, Shamir, and Adelman) cryptographic chip is presented. The features of this new algorithm allowed high performance (8 kb/s for 1024-b words) to be obtained for relatively small area and power consumption (80 mm² in a 2-μm CMOS process and 500 mW at 25 MHz) 相似文献

6.

A fast VLSI adder architecture

Srinivas H.R. Parhi K.K. 《Solid-State Circuits, IEEE Journal of》1992,27(5):761-767

An architecture for performing fixed-point, high-speed, two's-complement, bit-parallel addition by using the carry-free property of redundant arithmetic and a fast parallel redundant-to-binary conversion scheme is presented. The internal numbers are represented in radix-2 redundant digit form, and the inputs and the output of the adder are represented in two's-complement binary form. The adder operands are added first in a radix-2 redundant adder to produce the result in radix-2 digit (-1, 0, 1) form. This result is converted to two's-complement binary form using the parallel conversion scheme. The high-speed conversion for long words is achieved through the use of a novel sign-select operation. The proposed adder, referred to as the sign-select conversion adder, is faster than all previous high-speed two's-complement binary adders for large word lengths. The implementation is highly regular with repeated modules and is very well suited for VLSI implementation 相似文献

7.

Minimizing the complexity of SRT tables

Oberman S.F. Flynn M.J. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1998,6(1):141-149

This paper presents an analysis of the complexity of quotient digit selection tables in SRT division implementations. SRT dividers are widely used in VLSI systems to compute floating-point quotients. These dividers use a fixed number of partial remainder and divisor bits to consult a table to select the next quotient-digit in each iteration. This analysis derives the allowable divisor and partial remainder truncations for radix 2 through radix 32, and it quantifies the relationship between table parameters and the complexity of the tables. Several techniques are presented for further minimizing table complexity. By mapping the tables to a library of standard-cells, delay and area values were measured and are presented for table configurations through radix 32. Several conclusions are drawn based on this data which impacts optimized SRT divider designs 相似文献

8.

Radix-4 Vectoring CORDIC Algorithm and Architectures

J. Villalba E.L. Zapata E. Antelo J.D. Bruguera 《The Journal of VLSI Signal Processing》1998,19(2):127-147

In this work we extend the radix-4 CORDIC algorithm to the vectoring mode (the radix-4 CORDIC algorithm was proposed recently by the authors for the rotation mode). The extension to the vectoring mode is not straightforward, since the digit selection function is more complex in the vectoring case than in the rotation case; as in the rotation mode, the scale factor is not constant. Although the radix-4 CORDIC algorithm in vectoring mode has a similar recurrence as the radix-4 division algorithm, there are specific issues concerning the vectoring algorithm that demand dedicated study. We present the digit selection for nonredundant and redundant arithmetic (following two different approaches: arithmetic comparisons and table look-up), the computation and compensation of the scale factor, and the implementation of the algorithm (with both types of digit selection) in a word-serial architecture. When compared with conventional radix-2 (redundant and non-redundant) architectures, the radix-4 algorithms present a significant speed up for angle calculation. For the computation of the magnitude the speed up is very slight, due to the nonconstant scale factor in the radix-4 algorithm. 相似文献

9.

高性能并行全冗余十进制乘法器的设计

下载免费PDF全文

张柳崔晓平董文雯《电子学报》2018,46(6):1519-1523

商业计算、金融分析等领域对高精度计算的需求对硬件十进制运算提出了越来越高的要求.已有的全冗余十进制乘法器由于全冗余加法器的结构复杂,已经给其性能的提升造成了瓶颈.本文优化设计了基于超载十进制数集（Overloaded Decimal Digit Set,ODDS）的全冗余ODDS加法器以降低其复杂度,并设计了一种新的基于该加法器的十进制压缩树模块.本文在部分积产生模块采用有符号的基-10编码和冗余的二-十进制（Binary Coded Decimal,BCD）编码快速产生十进制部分积.在最终积产生模块采用优化的编码转换电路快速产生BCD-8421乘积.实验结果显示所设计的并行全冗余十进制乘法器速度较快、面积较小. 相似文献

10.

一种结构新颖的流水线Booth乘法器设计

李飞雄蒋林《电子科技》2013,26(8):46-48,67

在对传统Booth乘法器研究的基础上,介绍了一种结构新颖的流水线型布什(Booth)乘法器。使用基-4 Booth编码、华莱士树(Wallace Tree)压缩结构、64位Kogge-Stone前缀加法器实现,并在分段实现的64位Kogge-Stone前缀加法器中插入4级流水线寄存器,实现32 t×32 bit无符号和有符号数快速乘法。用硬件描述语言设计该乘法器,使用现场可编程门阵列(Field Programmable Gate Array,FPGA)进行验证,并采用SMIC 0.18 μm CMOS标准单元工艺对该乘法器进行综合。综合结果表明,电路的关键路径延时为3.6 ns,芯片面积＜0.134 mm,功耗＜32.69 mW。相似文献

11.

高速除法器设计及ASIC实现 总被引：3，自引：0，他引：3

黄秀荪叶青仇玉林《微电子学与计算机》2008,25(2):133-135

为提高除法计算的速度,提出了新的基-16算法的高速除法器算法,并以专用集成电路设计方法实现。与MIPS处理器中使用的除法器相比,电路最大延迟减少了27%,计算所需时钟周期数减少了68%,速度性能改善了77%左右。给出了电路的其他性能指标。该电路适用于对运算速度要求非常高的场合。相似文献

12.

Hybrid low-latency serial-parallel multiplier architecture

Al-Besher B. Bouridane A. Ashur A.S. Crookes D. 《Electronics letters》1998,34(2):141-143

A novel low latency, most significant digit-first, signed digit multiplier architecture is presented. The design of the multiplier is based on a new 2 bit adder cell. Judicious deployment of latches in the circuit ensures that the multiplier operates on two coefficients of the multiplicand at the same time and produces one 2n digit product every 2n+3 cycles with an initial delay (latency) of three cycles. Comparison with existing multipliers has shown a superior performance of the proposed architecture 相似文献

13.

一种Montgomery模乘算法硬件结构

王缔郦白国强陈弘毅《微电子学与计算机》2010,27(5)

基于二进制多字Montgomery模乘算法,提出了一种参数可灵活配置的规则的脉动阵列硬件结构,并使用此结构在FPGA上实现了不同位宽的Montgomery模乘算法.该结构成功地在不增加额外电路或运行周期的情况下,将脉动阵列的关键路径限制在运算单元内部的加法器中.硬件实现结果表明,该结构具有更高的电路频率、更少的电路面积消耗及算法运算时间. 相似文献

14.

High-speed VLSI arithmetic processor architectures using hybrid number representation

H. R. Srinivas Keshab K. Parhi 《The Journal of VLSI Signal Processing》1992,4(2-3):177-198

This paper addresses design of high speed architectures for fixed-point, two's-complement, bit-parallel division, square-root, and multiplication operations. These architectures make use of hybrid number representations (i.e. the input and output numbers are represented using two's complement representation, and the internal numbers are represented using radix-2 redundant representation). We propose newshifted remainder conditioning, andsign multiplexing techniques in combination with novel circuit architecture approaches to obtain efficient divider and square-root architectures. Our divider exploits full dynamic range of operands and eliminates the need for on-line or off-line conversion of the result to binary (this is because our nonrestoring division and square-root operators output binary quotient). Furthermore, since the binary input set is a subset of the redundant digit set, no binary-to-redundant number conversion is necessary at the input of the divider and square-root operators. We also present a fast, new conversion scheme for converting radix-2 redundant numbers to two's complement binary numbers, and use this to design a bit-parallel multiplier. This multiplier architecture requires fewer pipelining latches than conventional two's complement multipliers, and reduces the latency of the multiplication operation from (2W–1) to aboutW (whereW is the word-length), when pipelined at the bit-level.This research was supported by the Office of Naval Research under contract number N00014-J-91-1008. 相似文献

15.

Cellular-array modular multiplier for fast RSA public-key cryptosystem based on modified Booth's algorithm

Jin-Hua Hong Cheng-Wen Wu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):474-484

We propose a radix-4 modular multiplication algorithm based on Montgomery's algorithm, and a fast radix-4 modular exponentiation algorithm for Rivest, Shamir, and Adleman (RSA) public-key cryptosystem. By modifying Booth's algorithm, a radix-4 cellular-array modular multiplier has been designed and simulated. The radix-4 modular multiplier can be used to implement the RSA cryptosystem. Due to reduced number of iterations and pipelining, our modular multiplier is four times faster than a direct radix-2 implementation of Montgomery's algorithm. The time to calculate a modular exponentiation is about n/sup 2/ clock cycles, where n is the word length, and the clock cycle is roughly the delay time of a full adder. The utilization of the array multiplier is 100% when we interleave consecutive exponentiations. Locality, regularity, and modularity make the proposed architecture suitable for very large scale integration implementation. High-radix modular-array multipliers are also discussed, at both the bit level and digit level. Our analysis shows that, in terms of area-time product, the radix-4 modular multiplier is the best choice. 相似文献

16.

Fast radix-2 division with quotient-digit prediction

Milo&#; D. Ercegovac Tomas Lang 《The Journal of VLSI Signal Processing》1989,1(3):169-180

An implementation of a radix-2 division unit is presented that uses prediction of the quotient digit. This prediction allows the concurrent computation of the quotient digit and the partial remainder. To achieve a simple quotient-digit selection, resulting in a step time roughly half of that of SRT division (without prediction), a simple estimate of the partial remainder is used, which requires that the divisor be scaled close to unity. This prescaling is simple to implement and increases the execution time by two cycles. We estimate a speed-up of 1.5 with respect to SRT division with redundant remainders. 相似文献

17.

A Radix 2 Shared Division/Square Root Algorithm and its VLSI Architecture

Hosahalli R. Srinivas Keshab K. Parhi 《The Journal of VLSI Signal Processing》1999,21(1):37-60

This paper presents the VLSI architecture of a shared division/square root operator that operates on the mantissas (23-b in length) of single precision IEEE 754 1985 std., floating point numbers. The division and square root algorithms used in this operator are based on radix 2 signed digit representations and operate in a digit-by-digit manner. These two algorithms perform quotient and root digit selection using two most-significant digits of the partial remainder. Previously proposed shared division square-root algorithms required more than two most-significant digits of the partial remainder to be observed during quotient or root digit selection. Lower the number of digits observed for quotient or root digit selection, faster the operation. Due to this, the algorithms proposed in this scheme are faster than previous schemes. This architecture has been layed out using 1.2 micron 5.0 V CMOS 2 metal process and requires 14.82 mm² area. This design requires 15 ns (@ 5.0 V) to generate a digit of the quotient/root. It requires 29 cycles per divide/square root operation from the time the operands are provided at its pin inputs. 相似文献

18.

Decimal Division Algorithms: The Issue of Partial Remainders

Amir Kaivani Seok-Bum Ko 《Journal of Signal Processing Systems》2013,73(2):181-188

The efficiency of decimal digit-recurrence division algorithms is totally affected by the number representations of the quotient, the divisor and partial remainders participated in quotient digit selection (QDS). This paper establishes general rules and conditions for QDS with operands represented in the generalized signed-digit format. As a result of this generalization, a universal convergence condition is introduced which obviates the unnecessary conservatism of previous algorithms and hence paves the way for more correct and efficient decimal division hardware designs. It is also concluded that keeping the partial remainders in minimally redundant symmetric signed-digit representation (with digit-set [?5,6])and applying into QDS the divisor represented in minimally asymmetric non-redundant signed-digit format (with digit-set [?4,5]) lead to the smallest minimum precision required, of the divisor and the partial remainder, for QDS and thus faster and simpler division algorithm. Moreover, it is shown that even in case of using non-redundant partial remainders (for the sake of lower area cost); minimally asymmetric signed-digit representation brings about more efficiency. The suggested representations are applied to the fastest previous decimal digit-recurrence divider and 10 % speed-up is achieved while keeping the area cost approximately unaltered. 相似文献

19.

Decimal Square Root: Algorithm and Hardware Implementation

Adel Hosseiny Ghassem Jaberipur 《Circuits, Systems, and Signal Processing》2016,35(12):4195-4219

We propose a new digit recurrence decimal square root (DSR) design and provide its ASIC implementation. The interim square root digits are in \([ {-5,5} ]\). The proposed architecture generally follows that of a previous radix-10 divider. However, it provides novel solutions with regard to few DSR-specific challenges. For example, complex error analysis shows that only four (out of sixteen) digits of partial square root is sufficient to estimate partial remainders that are required for the more complicated square root digit selection. This design performs about 10 % faster and consumes 28 % less area than the previously reported ASIC digit recurrence decimal square rooter. 相似文献

20.

Memristor based N-bits redundant binary adder

《Microelectronics Journal》2015,46(3):207-213

This paper introduces a memristor based N-bits redundant binary adder architecture for canonic signed digit code CSDC as a step towards memristor based multilevel ALU. New possible solutions for multi-level logic designs can be established by utilizing the memristor dynamics as a basis in the circuit realization. The proposed memristor-based redundant binary adder circuit tries to achieve the theoretical advantages of the redundant binary system, and to eliminate the carry (borrow) propagation using signed digit representation. The advantage of carry elimination in the addition process is that it makes the speed independent of the operands length which speeds up all arithmetic operations. One memristor is sufficient for both the addition process and for storing the final result as a memory cell. The adder operation has been validated via different cases for 1-bit and 3-bits addition using HP memristor model and PSPICE simulation results. 相似文献