首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 298 毫秒
1.
A CMOS pipelined floating-point processing unit (FPU) for superscalar processors is described. It is fabricated using a 0.5 μm CMOS triple-metal-layer technology on a 61 mm2 die. The FPU has two execution modes to meet precise scientific computations and real-time applications. It can start two FPU operations in each cycle, and this achieves a peak performance of 160 MFLOPS double or single precision with an 80 MHz clock. Furthermore, the original computation mode, twin single-precision computation, double the peak performance and delivers 320 MFLOPS single precision. Its full bypass reduces the latency of operations, including load and store, and achieves an effective throughput even in nonvectorizable computations. An out-of-order completion is provided by using a new exception prediction method and a pipeline stall technique  相似文献   

2.
A 135K transistor, uniformly pipelined 50-MHz CMOS 64-bit floating-point arithmetic processor chip is described. The execution unit is capable of sustaining pipelined performance of one 32-bit or 64-bit result every 20 ns for all operations except double-precision multiply (40 ns) and divide. The chip employs an exponent difference prediction scheme and a unified leading-one and sticky-bit computation logic for the addition and subtraction operations. A hardware multiplier using a radix-8 modified Booth algorithm and a divider using a radix-2 SRT algorithm are employed.<>  相似文献   

3.
本文提出了一种新型混合基可重构FFT处理器,由支持基-2/3FFT的新型可重构蝶形单元和多路并行无冲突的存储器组成,实现了FFT过程中多路数据并行性和操作的连续性.本设计在TSMC28nm工艺下的最高频率为1.06GHz,同时在Xilinx的XC7V2000T FPGA芯片上搭建了混合基FFT处理器硬件测试系统.对混合基FFT处理器的FPGA硬件测试结果表明,本设计支持基-2、基-3和基-2/3混合模式FFT变换,且执行速度达到给定蝶乘器数量下的理论周期值,对单精度浮点数,混合基FFT处理器可提供10-5的结果精度.  相似文献   

4.
《Microelectronics Journal》2015,46(7):637-655
This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated.  相似文献   

5.
A general-purpose programmable digital signal processor (DSP) has been implemented in 1.5-/spl mu/m (L/SUB eff/) NMOS technology using full-custom circuit design for high performance. The DSP has a 32-bit instruction set, 32-bit data path, and full-hardware 32-bit floating-point arithmetic. The architecture is described section by section, and an overview of the instruction set is presented. The extensive design verification process applied to the DSP is also described.  相似文献   

6.
A 1-million transistor 64-b microprocessor has been fabricated using 0.8-μm double-metal CMOS technology. A 40-MIPS (million instructions per second) and 20-MFLOPS (million floating-point operations per second) peak performance at 40 MHz is realized by a self-clocked register file and two translation lookaside buffers (TLBs) with word-line transition detection circuits. The processor contains an integer unit based on the SPARC (scalable processor architecture) RISC (reduced instruction set computer) architecture, a floating-point unit (FPU) which executes IEEE-754 single- and double-precision floating-point operations a 6-KB three-way set-associative physical instruction cache, a 2-KB two-way set-associative physical data cache, a memory management unit that has two TLBs, and a bus control unit with an ECC (error-correcting code) circuit  相似文献   

7.
Embedded systems designers often use fixed-point instead of floating-point due to the performance and area overhead of floating-point units. If the range of floating-point representation is required, the system may use a software-based floating-point library on an integer-only processor to save area—at the cost of much lower performance. Instead, we propose a Fractured Floating Point Unit (FFPU)—a hybrid solution that uses a set of custom hardware instructions to accelerate software-based floating-point emulation. An FFPU is intended as a compromise between software libraries and full FPUs in terms of both area and performance. We present four potential 32-bit FFPU designs for a Nios II soft processor. We compare their performance and area to the baseline Nios II, as well as a Nios II with a complete FPU. We show that an FFPU can improve various floating-point operations, including improving addition and subtraction performance by 24 to 52 percent over the baseline. This performance comes at a resource cost of only an 11 to 29 percent ALM increase, and no increase in DSP blocks.  相似文献   

8.
This paper presents multi-functional double-precision and quadruple-precision floating-point multiply-add fused (FPMAF) designs. The double-precision FPMAF design can execute adouble-precision floating-point multiply-add, or two single-precision floating-point multiplications, or a single-precision floating-point dot product. The quadruple-precision FPMAF can perform similar operations with quadruple, double and single precision operands. These architectures can perform a dot-product operation two times or more faster than a basic FPMAF design. The presented multi-functional designs are compared with basic double-precision and quadruple-precision FPMAF designs by ASIC syntheses. The syntheses results show that the proposed double-precision implementation has 8%more area than a standard double-precision FPMAF implementation, and the proposed quadruple-precision design has 12.5% more area than a standard quadruple-precision FPMAF. Both of the proposed designs have one more pipeline stage compared to the basic designs.  相似文献   

9.
The first two members in a family of 64-bit superscalar microprocessors are presented. The 130-nm processor, which was introduced first, offers 5-way instruction dispatch, support for 4-way integer and floating-point single-instruction multiple-data (SIMD) operations, a 512-kB second level (L2) cache, and a high-speed external bus. The 90-nm processor is a technology remap of the 130-nm design. It retains the features of the 130-nm processor and adds others, including a new power management facility. The architecture, device characteristics, power management, and thermal details of these two processors are described. In addition, the dataflow layout, aspects of the circuit design, clocking, and timing are discussed.  相似文献   

10.
This article proposes design and architecture of a dynamically scalable dual-core pipelined processor. Methodology of the design is the core fusion of two processors where two independent cores can dynamically morph into a larger processing unit, or they can be used as distinct processing elements to achieve high sequential performance and high parallel performance. Processor provides two execution modes. Mode1 is multiprogramming mode for execution of streams of instruction of lower data width, i.e., each core can perform 16-bit operations individually. Performance is improved in this mode due to the parallel execution of instructions in both the cores at the cost of area. In mode2, both the processing cores are coupled and behave like single, high data width processing unit, i.e., can perform 32-bit operation. Additional core-to-core communication is needed to realise this mode. The mode can switch dynamically; therefore, this processor can provide multifunction with single design. Design and verification of processor has been done successfully using Verilog on Xilinx 14.1 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S500E FPGA.  相似文献   

11.
高吞吐浮点可灵活重构的快速傅里叶变换(FFT)处理器可满足尖端雷达实时成像和高精度科学计算等多种应用需求。与定点FFT相比,浮点运算复杂度更高,使得浮点型FFT的运算吞吐率与其实现面积、功耗之间的矛盾问题尤为突出。鉴于此,为降低运算复杂度,首先将大点数FFT分解成若干个小点数基2k 级联子级实现,提出分别针对128/256/512/1024/2048点FFT的优化混合基算法。同时,结合所提出同时支持单通道单精度和双通道半精度两种浮点模式的新型融合加减与点乘运算单元,首次提出一款高吞吐率双模浮点可变点FFT处理器结构,并在28 nm标准CMOS工艺下进行设计并实现。实验结果表明,单通道单精度和双通道半精度浮点两种模式下的运算吞吐率和输出平均信号量化噪声比分别为3.478 GSample/s, 135 dB和6.957 GSample/s, 60 dB。归一化吞吐率面积比相比于现有其他浮点FFT实现可提高约12倍。  相似文献   

12.
This work developed a modified direct form based on the radix-4 Booth algorithm to realize a finite impulse response (FIR) architecture with programmable dynamic ranges of input data and filter coefficients. This architecture comprises a preprocessing unit, data latches, configurable connection units, double Booth decoders, coefficient registers, a path control unit, and a postprocessing unit. Programmable dynamic ranges of input data and filter coefficients can be any positive even numbers or multiple of a word length of coefficient registers, using configurable connection units or a path control unit, respectively. In particular, the proposed architecture employs only data-path controls to accomplish programmable operations, without changing word lengths and components of data latches and filter taps. A practical 8-bit and 16-bit FIR processor has also been implemented by using the TSMC 5 V 0.6 μm CMOS technology. It is suitable for operations of asymmetric, symmetric, and anti-symmetric filters at 64, 63, 32, 31, and 16 taps, and is well explored to optimize its functional units. The proposed processor has throughput rates of 50 M and 25 M samples/s for 8-bit and 16-bit input data of various filter applications, respectively  相似文献   

13.
A low-power and high-performance 4-way 32-bit stream processor core is developed for handheld low-power 3-D graphics systems. It contains a floating-point unified matrix, vector, and elementary function unit. By exploiting the logarithmic arithmetic and the proposed adaptive number conversion scheme, a 4-way arithmetic unit achieves a single-cycle throughput for all these operations except for the matrix-vector multiplication that takes 2 cycles per result, which were 4 cycles in conventional way. The processor featured by this functional unit and several proposed architectural schemes including embedded register index calculations, functional unit reconfiguration, and operand forwarding in logarithmic domain achieves 19.1% cycle count reduction for OpenGL transformation and lighting (TnL) operation from the latest work.   相似文献   

14.
A 36 mm/sup 2/ graphics processor with fixed-point programmable vertex shader is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics applications. The graphics processor contains an ARM-10 compatible 32-bit RISC processor,a 128-bit programmable fixed-point single-instruction-multiple-data (SIMD)vertex shader, a low-power rendering engine, and a programmable frequency synthesizer (PFS). Different from conventional graphics hardware, the proposed graphics processor implements ARM-10 co-processor architecture with dual operations so that user-programmable vertex shading is possible for advanced graphics algorithms and various streaming multimedia processing in mobile applications. The circuits and architecture of the graphics processor are optimized for fixed-point operations and achieve the low power consumption with help of instruction-level power management of the vertex shader and pixel-level clock gating of the rendering engine. The PFS with a fully balanced voltage-controlled oscillator (VCO) controls the clock frequency from 8 MHz to 271 MHz continuously and adaptively for low-power modes by software. The chip shows 50 Mvertices/s and 200 Mtexels/s peak graphics performance, dissipating 155 mW in 0.18-/spl mu/m 6-metal standard CMOS logic process.  相似文献   

15.
A 32-b single-chip processor has been developed that is user object-code compatible with members of the 68000 processor family. The 14-4-mm×15.5-mm device contains over 1.2 million transistors and is fabricated with a double-layer-metal CMOS process. The processor integrates three major functional units: an integer processor: a floating-point processor; and a Harvard-style memory unit. Each major unit is described, and the implementation techniques that were employed and selected circuit issues that were confronted in the design are discussed  相似文献   

16.
This paper demonstrates how IEEE 754 floating-point standard compliant rounding can be merged with carry-propagate addition in floating-point unit (FPU) designs by using a novel adaptation of the prefix adder. The paper considers add/subtract, multiply, and SRT divide operations and demonstrates that in every case a generic rounding architecture based on a prefix adder with a small amount of additional logic is sufficient to cover all the rounding modes. Critical path analysis shows that the proposed architecture is compatible with contemporary pipelined FPU design practice, while using significantly less logic  相似文献   

17.
介绍了一种适用于Viterbi解码器的异步ACS(加法器-比较器-选择器)的设计.它采用异步握手信号取代了同步电路中的整体时钟.给出了一种异步实现结构的异步加法单元、异步比较单元和异步选择单元电路.采用全定制设计方法设计了一个异步4-bit ACS,并通过0.6μm CMOS工艺进行投片验证.经过测试,芯片在工作电压5V,工作频率20MHz时的功耗为75.5mW.由于采用异步控制,芯片在"睡眠"状态待机时不消耗动态功耗.芯片的平均响应时间为19.18ns,仅为最差响应时间23.37ns的82%.通过与相同工艺下的同步4-bit ACS在功耗和性能方面仿真结果的比较,可见异步ACS较同步ACS具有优势.  相似文献   

18.
为了满足基于嵌入式处理器的音频解决方案的需要,提出了一种嵌入式处理器中高精度、多功能的定点化运算单元(FPU)。FPU由移位、舍入、饱和3个部分组成。通过对FPU的实现和验证,证明FPU能够显著提高嵌入式处理器定点化操作的速度。  相似文献   

19.
A novel processor with micro-pipelined architecture is proposed for latch-type Josephson logic devices. The processor is segmented into several operating stages activated by a multi-phase power system. Independent register groups are allocated to each stage in order to support pipeline processing of several instruction streams. This architecture allows building of a fine pipeline pitch processor which is capable of MIMD processing. A 12-bit micro-pipelined Josephson processor, containing an ALU, a multiplier and 16 registers, is described. Driven by a 3-phase AC power system, it is able to process 4 instruction streams simultaneously. A pipeline pitch of 3.3 GHz is expected using conventional Josephson device technology. A 4-bit processor design for 12-bit data length is also discussed  相似文献   

20.
浮点乘加部件中三操作数前导1预测算法的设计   总被引:1,自引:0,他引:1  
提出了一种应用于高效浮点乘加部件的三操作数前导1预测算法。高效浮点乘加部件需要买现三个操作数的前导1预测(LOP)电路,传统的LOP算法不能直接处理三个操作数,通过间接方法实现又会增加关键路径延时并增大电路面积。三操作数LOP算法是针对传统LOP算法的这一局限提出的,可以有效缩短前导1预测电路的延时并减少面积.从而缩短整个乘加部件的延时。文章以龙芯2号通用CPU中浮点乘加部件的106位前导1预测电路为例.分别采用传统LOP算法和三操作数LOP算法实现了电路,实验结果表明,三操作数LOP算法比传统算法延时能降低约16.67%.总面积减少约19.63%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号