期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

VLSI Implementation of Double-Precision Floating-Point Multiplier Using Karatsuba Technique

Manish Kumar Jaiswal Ray C. C. Cheung 《Circuits, Systems, and Signal Processing》2013,32(1):15-27

The double-precision floating-point arithmetic, specifically multiplication, is a widely used arithmetic operation for many scientific and signal processing applications. In general, the double-precision floating-point multiplier requires a large 53×53 mantissa multiplication in order to get the final result. This mantissa multiplication exists as a limit on both area and performance bounds of this operation. This paper presents a novel way to reduce this large multiplication. The proposed approach in this paper allows to use less amount of multiplication hardware compared to the traditional method. The multiplication is done by using Karatsuba technique. This design is specifically targeting Field Programmable Gate Array (FPGA) platforms, and it has also been evaluated on ASIC flow. The proposed module gives excellent performance with efficient use of resources. The design is fully compatible with the IEEE standard precision. The proposed module has shown a better performance in comparison with the best reported multipliers in the literature. 相似文献

2.

A CMOS floating-point vector-arithmetic unit

Timmermann D. Rix B. Hahn H. Hosticka B.J. 《Solid-State Circuits, IEEE Journal of》1994,29(5):634-639

This work describes a floating-point arithmetic unit based on the CORDIC algorithm. The unit computes a full set of high level arithmetic and elementary functions: multiplication, division, (co)sine, hyperbolic (co)sine, square root, natural logarithm, inverse (hyperbolic) tangent, vector norm, and phase. The chip has been integrated in 1.6 μm double-metal n-well CMOS technology and achieves a normalized peak performance of 220 MFLOPS 相似文献

3.

2.44-GFLOPS 300-MHz floating-point vector-processing unit forhigh-performance 3D graphics computing

Ide N. Hirano M. Endo Y. Yoshioka S. Murakami H. Kunimatsu A. Sato T. Kamei T. Okada T. Suzuoki M. 《Solid-State Circuits, IEEE Journal of》2000,35(7):1025-1033

A vector unit for high-performance three-dimensional graphics computing has been developed. We implement four floating-point multiply-accumulate units, which execute multiply-add operations with one throughput; one floating-point divide/square root unit, which executes division and square-root operations with six cycles at 300 MHz; and one vector general-purpose register file, which has 128 bits×32 words. The parallel execution of all units delivers a peak performance of 2.44 GFLOPS at 300 MHz 相似文献

4.

A novel IEEE rounding algorithm for high-speed floating-point multipliers

Mustafa 《Integration, the VLSI Journal》2007,40(4):549-560

Modern floating-point multipliers perform rounding in compliance with the IEEE 754 standard. Since rounding is on the critical path, high-speed rounding algorithms are used to increase the performance for floating-point multiplication. To achieve high performance with minimum increase in hardware, existing rounding algorithms generate two consecutive values in parallel, and compute the rounded product using these values. This paper presents a novel IEEE rounding algorithm which generates two nonconsecutive values in parallel to compute the rounded product. Synthesis results for double precision operands show that the proposed algorithm has approximately 24–41% less delay than previous high-speed rounding algorithms presented elsewhere. The verification of the new algorithm is also presented in a simple and straightforward manner. 相似文献

5.

An 80-MFLOPS (peak) 64-b microprocessor for parallel computer

Nakano H. Nakajima M. Nakakura Y. Yoshida T. Goi Y. Nakai Y. Segawa R. Kishida T. Kadota H. 《Solid-State Circuits, IEEE Journal of》1992,27(3):365-372

An 80-MFLOPS (peak) 64-b microprocessor that employs superscalar architecture to execute two instructions simultaneously in one 25-ns cycle, including the combination of 64-b floating-point add and multiply instructions, is described. The processor implemented in a 0.8-μm CMOS technology contains 1300 K transistors. The processor also employs a RISC architecture and Harvard-style bus organization. The authors provide an overview of the processor, especially focusing on processor architecture, floating-point hardware, and performance 相似文献

6.

RAPID PROTOTYPING - Field programmable gate array based parallel matrix multiplier for 3D affine transformations

Bensaali F. Amira A. 《Vision, Image and Signal Processing, IEE Proceedings -》2006,153(6):739-746

3D graphics performance is increasing faster than any other computing application. Almost all PC systems now include 3D graphics accelerators for games, computer aided design or visualisation applications. This article investigates the suitability of field programmable gate array devices as an accelerator for implementing 3D affine transformations. Proposed solution based on processing large matrix multiplication have been implemented, for large 3D models, on the RC1000 Celoxica board based development platform using Handel-C. Outstanding results have been obtained for the acceleration of 3D transformations using fixed and floating-point arithmetic 相似文献

7.

A highly integrated 40-MIPS (peak) 64-b RISC microprocessor

Miyake J. Maeda T. Nishimichi Y. Katsura J. Taniguchi T. Yamaguchi S. Edamatsu H. Watari S. Takagi Y. Tsuji K. Kuninobu S. Cox S. Duschatko D. MacGregor D. 《Solid-State Circuits, IEEE Journal of》1990,25(5):1199-1206

A 1-million transistor 64-b microprocessor has been fabricated using 0.8-μm double-metal CMOS technology. A 40-MIPS (million instructions per second) and 20-MFLOPS (million floating-point operations per second) peak performance at 40 MHz is realized by a self-clocked register file and two translation lookaside buffers (TLBs) with word-line transition detection circuits. The processor contains an integer unit based on the SPARC (scalable processor architecture) RISC (reduced instruction set computer) architecture, a floating-point unit (FPU) which executes IEEE-754 single- and double-precision floating-point operations a 6-KB three-way set-associative physical instruction cache, a 2-KB two-way set-associative physical data cache, a memory management unit that has two TLBs, and a bus control unit with an ECC (error-correcting code) circuit 相似文献

8.

An Embedded Stream Processor Core Based on Logarithmic Arithmetic for a Low-Power 3-D Graphics SoC

《Solid-State Circuits, IEEE Journal of》2009,44(5):1554-1570

A low-power and high-performance 4-way 32-bit stream processor core is developed for handheld low-power 3-D graphics systems. It contains a floating-point unified matrix, vector, and elementary function unit. By exploiting the logarithmic arithmetic and the proposed adaptive number conversion scheme, a 4-way arithmetic unit achieves a single-cycle throughput for all these operations except for the matrix-vector multiplication that takes 2 cycles per result, which were 4 cycles in conventional way. The processor featured by this functional unit and several proposed architectural schemes including embedded register index calculations, functional unit reconfiguration, and operand forwarding in logarithmic domain achieves 19.1% cycle count reduction for OpenGL transformation and lighting (TnL) operation from the latest work. 相似文献

9.

一种高速浮点加法器的设计实现

唐世庆尹勇生刘聪《微电子学与计算机》2003,20(8):163-166

浮点加法器是协处理器的核心运算部件，是实现浮点指令各种运算的基础，其设计优化是提高浮点运算速度和精度的关键途径。文章从浮点加法器算法和电路实现的角度给出设计方法，并且提出动态与静态结合设计进位链的方案以及前导O预测面积与速度的折衷方法。动态与静态结合设计进位链的方法有效地降低了功耗，提高了速度，改善了性能。目前已经嵌入协处理器的设计中，并且流片测试成功。相似文献

10.

Improving Floating-Point Performance in Less Area: Fractured Floating Point Units (FFPUs)

Neil Hockert Katherine Compton 《Journal of Signal Processing Systems》2012,67(1):31-46

Embedded systems designers often use fixed-point instead of floating-point due to the performance and area overhead of floating-point units. If the range of floating-point representation is required, the system may use a software-based floating-point library on an integer-only processor to save area—at the cost of much lower performance. Instead, we propose a Fractured Floating Point Unit (FFPU)—a hybrid solution that uses a set of custom hardware instructions to accelerate software-based floating-point emulation. An FFPU is intended as a compromise between software libraries and full FPUs in terms of both area and performance. We present four potential 32-bit FFPU designs for a Nios II soft processor. We compare their performance and area to the baseline Nios II, as well as a Nios II with a complete FPU. We show that an FFPU can improve various floating-point operations, including improving addition and subtraction performance by 24 to 52 percent over the baseline. This performance comes at a resource cost of only an 11 to 29 percent ALM increase, and no increase in DSP blocks. 相似文献

11.

A 16-bit LSI minicomputer

《Solid-State Circuits, IEEE Journal of》1976,11(5):696-702

A 16-bit LSI minicomputer, using n-channel MOS technology, has been developed. The instruction set contains 126 instructions including floating-point arithmetic and is fully compatible with commercially available minicomputers such as the TOSBAC-40 and the Interdata 70. An execution speed of 2 /spl mu/s is obtained for register to register (RR) instructions. All the central processing unit (CPU) functions are implemented on a single board. An external microprogram ROM and short-single address microinstructions are used to realize high-system performance and reduce the chip area and the package pin numbers. Two LSI chips for the system, a single-chip processor, and a bit-sliced bus controller, are fabricated by a new n-channel MOS technology named the gate oxidation method (GOM) which provides a high-packing density, high speed, and a simplified process. 相似文献

12.

FFT algorithms for prime transform sizes and their implementationson VAX, IBM3090VF, and IBM RS/6000

Lu C. Cooley J.W. Tolimieri R. 《Signal Processing, IEEE Transactions on》1993,41(2):638-648

Variants of the Winograd fast Fourier transform (FFT) algorithm for prime transform size that offer options as to operational counts and arithmetic balance are derived. Their implementations on VAX, IBM 3090 VF, and IBM RS/6000 are discussed. For processors that perform floating-point addition, floating-point multiplication, and floating-point multiply-add with the same time delay, variants of the FFT algorithm have been designed such that all floating-point multiplications can be overlapped by using multiply-add. The use of a tensor product formulation, throughout, gives a means for producing variants of algorithms matching computer architectures 相似文献

13.

A 450-MHz RISC microprocessor with enhanced instruction set andcopper interconnect

Nicoletta C. Alvarez J. Barkin E. Chai-Chin Chao Johnson B.R. Lassandro F.M. Patel P. Reid D. Sanchez H. Seigel J. Snyder M. Sullivan S. Taylor S.A. Minh Vo 《Solid-State Circuits, IEEE Journal of》1999,34(11):1478-1491

This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eight-way instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm² die is fabricated in a 1.8-V, 0.20-μm CMOS process with six layers of copper interconnect 相似文献

14.

基于FPGA的浮点算法在图像处理中的应用

曾祥萍杨涛《光电技术应用》2006,21(1):43-46

针对数字图像本身存在的特点，提出了一种基于FPGA的浮点运算方法．该方法根据数字图像中像素点的坐标值和灰度值均为正整数的特点，利用FPGA中较易实现的定点乘法、加减运算和移位操作来实现浮点运算，这种浮点运算方法能够克服传统的浮点运算结构复杂，延时长，难以保证结果的实时性等严重不足。该算法已成功应用于以XC2S200-5PQ208为核心处理器的实时图像消旋系统中，并用ModelSim SE仿真软件进行仿真。实验结果表明，该算法原理简单，速度快，精度可调，适于实时图像处理。相似文献

15.

A floating-point cell library and a 100-Mflops image signal processor

Fujii H. Hori C. Takada T. Hatanaka N. Demura T. Ootomo G. 《Solid-State Circuits, IEEE Journal of》1992,27(7):1080-1088

相似文献

16.

A 200-MHz 64-b dual-issue CMOS microprocessor

Dobberpuhl D.W. Witek R.T. Allmon R. Anglin R. Bertucci D. Britton S. Chao L. Conrad R.A. Dever D.E. Gieseke B. Hassoun S.M.N. Hoeppner G.W. Kuchler K. Ladd M. Leary B.M. Madden L. McLellan E.J. Meyer D.R. Montanaro J. Priore D.A. Rajagopalan V. Samudrala S. Santhanam S. 《Solid-State Circuits, IEEE Journal of》1992,27(11):1555-1567

A 400-MIPS/200-MFLOPS (peak) custom 64-b VLSI CPU is described. The chip is fabricated in a 0.75-μm CMOS technology utilizing three levels of metalization and optimized for 3.3-V operation. The die size is 16.8 mm×13.9 mm and contains 1.68 M transistors. The chip includes separate 8-kbyte instruction and data caches and a fully pipelined floating-point unit (FPU) that can handle both IEEE and VAX standard floating-point data types. It is designed to execute two instructions per cycle among scoreboarded integer, floating-point, address, and branch execution units. Power dissipation is 30 W at 200-MHz operation 相似文献

17.

NIOS浮点运算定制指令的实现 总被引：1，自引：1，他引：0

陈鹏蔡雪梅《现代电子技术》2011,34(10):166-168

为提高NIOS系统的浮点计算效率,使用Verilog语言实现了单精度浮点数加减及乘法运算的功能模块,并通过波形验证其功能,依据NIOSⅡ定制指令的制定规范,将这一功能添加到SOPCBuilder中,扩展出新的基于硬件电路的浮点运算指令,使之在NIOS软件环境中得到应用。通过NIOSⅡ本身软件浮点计算和新增硬件指令进行运算结果和时间上的对比,证实硬件指令计算的优越性,为NIOS下的浮点运算提供了更有效率的选择。相似文献

18.

Efficient weighted least-squares algorithm for the design of FIRfilters

Jou Y.-D. Hsieh C.-H. Kou C.-M. 《Vision, Image and Signal Processing, IEE Proceedings -》1997,144(4):244-248

The weighted least-squares (WLS) technique has been widely used for the design of digital FIR filters. In the conventional WLS, the filter coefficients are obtained by performing a matrix inverse operation, which needs computation of O(N³). The authors present a new WLS algorithm that introduces an extra frequency response including implicitly the weight function. In the new algorithm, the filter coefficients can be solved just by a matrix vector multiplication. It reduces the computational complexity from O(N³) to O(N²) 相似文献

19.

高质量声码器的TMS320VC33实现 总被引：1，自引：0，他引：1

吴平赵铭崔慧娟唐昆《电声技术》2003,(10):40-43

给出了SELP高质量声码器在TMS320VC33DSP上的实现。通过部分定点化、汇编指令优化和数学运算优化等方法，使算法的存储量满足芯片的片上存储能力，运算复杂度仅占用了芯片处理能力的1／4，达到了实时性要求。相似文献

20.

A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor

Hwa-Joon Oh Mueller S.M. Jacobi C. Tran K.D. Cottier S.R. Michael B.W. Nishikawa H. Totsuka Y. Namatame T. Yano N. Machida T. Dhong S.H. 《Solid-State Circuits, IEEE Journal of》2006,41(4):759-771

The floating-point unit (FPU) in the synergistic processor element (SPE) of a CELL processor is a fully pipelined 4-way single-instruction multiple-data (SIMD) unit designed to accelerate media and data streaming with 128-bit operands. It supports 32-bit single-precision floating-point and 16-bit integer operands with two different latencies, six-cycle and seven-cycle, with 11 FO4 delay per stage. The FPU optimizes the performance of critical single-precision multiply-add operations. Since exact rounding, exceptions, and de-norm number handling are not important to multimedia applications, IEEE correctness on the single-precision floating-point numbers is sacrificed for performance and simple design. It employs fine-grained clock gating for power saving. The design has 768K transistors in 1.3 mm/sup 2/, fabricated SOI in 90-nm technology. Correct operations have been observed up to 5.6 GHz with 1.4 V and 56/spl deg/C, delivering 44.8 GFlops. Architecture, logic, circuits, and integration are codesigned to meet the performance, power, and area goals. 相似文献