期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

何军黄永勤朱英《计算机科学》2013,40(12):15-18,51

如何减少四倍精度浮点运算的硬件开销和延迟是需要解决的重要问题。为减少四倍精度乘加器的硬件开销,基于支持64位×4的双精度浮点SIMD FMA部件,设计并实现了一种新的四倍精度浮点乘加器(QPFMA),来支持4种浮点乘加运算和乘法、加减法、比较运算,运算延迟为7拍。通过将四倍精度113位×113位尾数乘法器分解为4个57位×57位乘法器来共享双精度浮点SIMD FMA部件的53位×53位乘法器,显著减少了实现QPFMA的硬件开销。基于65nm工艺的逻辑综合结果表明,该QPFMA频率可达1.1GHz,面积是常规QPFMA设计的42.71％,仅与一个双精度浮点乘加器相当。与现有的QPFMA设计相比,相当工艺和频率下,其运算延迟减少了3拍,门数减少了65.96％。相似文献

2.

一种快速SIMD浮点乘加器的设计与实现 总被引：2，自引：0，他引：2

吴铁彬刘衡竹杨惠张剑锋侯申《计算机工程与科学》2012,34(1):69-73

本文设计和实现了5级全流水SIMD浮点乘加器,支持双精度和双单精度浮点乘法、乘累加(减)操作,用Modelsim和NC Verilog测试和验证了RTL代码实现,基于65nm工艺采用Synopsys公司的Design Complier工具综合硬件实现,运行频率可达714.286MHz。结果表明,相比文献[3]中经典的低延迟乘加结构,在相同综合条件下性能提升了17.89%,面积增加了6.61%,功耗降低了25.08%。相似文献

3.

可重构浮点混合/连续乘-加器的设计与实现

《计算机工程》2014,(7)

浮点连续乘-加、混合乘-加和三操作数加等浮点算术运算在科学计算领域中应用越来越频繁,为设计一款支持浮点连续乘-加、混合乘-加和三操作数加的多功能浮点运算单元,提出一种可重构浮点混合/连续乘-加器,通过对控制位的配置可以实现多种浮点数据操作。该乘-加器采用8级流水线,可以实现单周期的浮点乘累加,大幅提高数据处理吞吐量,同时支持三操作数加和两操作数和的累加。在Modelsim SE6.6f中对该设计进行仿真验证,结果表明其能够在Xilinx Virtex-6 FPGA上实现,资源消耗2 631个LUT,频率可达250 MHz,结果证明该浮点混合/连续乘-加器具有较大的使用价值。相似文献

4.

分离通路浮点乘加器设计与实现

何军黄永勤朱英《计算机科学》2013,40(8):28-33

针对传统浮点融合乘加器会增加独立浮点加减法、乘法等运算延迟的缺点,首先设计并实现了一种分离通路浮点乘加器SPFMA,通过分离乘法和加法通路,在保持融合乘加运算延迟6拍延迟不变的情况下,将独立乘法和加法等运算延迟由6拍减为4拍,克服了传统融合乘加器的缺点。然后经专用工艺单元库逻辑综合评估,SPFMA可工作在1.2GHz以上,面积60779.44um²。最后在硬件仿真加速器平台上运行SPEC CPU2000浮点测试课题对其进行性能评估,结果表明所有浮点课题性能均有所提高,最大提高5.25％,平均提高1.61％,证明SPFMA可进一步提高浮点性能。相似文献

5.

基于AltiVec技术的浮点乘加单元的设计

赵明亮樊晓桠黄小平姚涛《计算机测量与控制》2010,18(1)

Alti Vec技术是为提高PowerPC的向量处理能力而对PowerPC指令集体系结构的扩展;浮点乘加单元是向量处理单元的主要构成部分,设计一种基于Alti Vec技术的向量浮点乘加单元;在基本浮点乘加器的基础上,提出了java模式下对非规格化数的预规格化处理;设计采用了一种半并行的结构,与传统的全并行结构相比可以节省一半的硬件面积;时钟频率为266 MHz时,java模式下5拍可以完成,非java模式下4拍可以完成。相似文献

6.

超低成本的AES算法VLSI实现

赵佳曾晓洋韩军陈俊《小型微型计算机系统》2007,28(8):1512-1515

提出一种超低成本的先进密码算法（AES）的VLSI实现方案.为了尽量减小硬件开销,将每轮128位的加解密运算分成4次32位运算,以两级流水线结构实现,同时通过模块复用和优化运算次序,特别是提出了一种低成本的密钥扩展结构,以很小的硬件代价获得很高的性能.本设计采用HHNEC 0.25um标准CMOS工艺,单元面积仅约12k等效门;在100MHz工作频率下,128位加密的数据吞吐率达到256Mbps. 相似文献

7.

一种128位高精度浮点乘加部件的研究与实现 总被引：2，自引：0，他引：2

张峰黎铁军徐炜遐《计算机工程与科学》2009,31(2)

高性能高精度的浮点数值处理一直是科学计算追求的目标。为此,本文研究并实现了一种128位浮点乘加融合计算单元。在乘法模块中,使用分块乘法,复用57位乘法模块,减小了数据宽度。采用三输入前导1预期技术,简化了预编码,缩短了预测电路的延时并减小面积。该模块单元使用Verilog语言实现,用Design Compiler进行逻辑综合,在simc0.13μm工艺下频率达202MHz,关键路径延时为4.93μs,面积约为191000门。相似文献

8.

浮点乘加部件延迟对浮点性能影响的研究

何军田增郭勇陈诚《计算机工程》2013,39(7)

浮点融合乘加部件会增加独立浮点加减法、乘法等运算延迟.为克服该缺陷,研究将乘加部件独立乘法、加减法等运算延迟由6拍减为4拍时对浮点性能的影响.以某支持乘加运算的国产处理器为基础,修改相关的RTL级设计代码,利用硬件仿真加速器平台,对SPEC CPU2000浮点测试课题进行评估.实验结果表明,该延迟优化有利于提高浮点性能,最大提高5.25％,平均提高1.61％. 相似文献

9.

一种高性能四倍精度浮点乘加器的设计与实现

何军黄永勤朱英《计算机工程》2014,(2):294-299

高精度、高性能浮点运算部件是高性能微处理器设计的重要部分。通过对传统双精度浮点乘加运算算法的研究,结合四倍精度浮点数据格式特点,设计并实现一种高性能的四倍精度浮点乘加器(QPFMA),该乘加器支持多种浮点运算,运算延迟为7拍,全流水结构。采用双路加法器改进算法结构,优化头零预测和规格化移位逻辑,减小运算延迟和硬件开销。通过参数化设计验证方法,实现高效的正确性验证。逻辑综合结果表明,基于65 nm工艺,该QPFMA频率可达1.2 GHz,比现有的QPFMA设计运算延迟减少3拍,频率提高约11.63%。相似文献

10.

一种多功能阵列乘法器的设计方法

下载免费PDF全文

胡正伟仲顺安《计算机工程》2007,33(22):23-25

为了实现不同数制的乘法共享硬件资源，提出了一种可以实现基于IEEE754标准的64位双精度浮点与32位单精度浮点、32位整数和16位定点的多功能阵列乘法器的设计方法。采用超前进位加法和流水线技术实现乘法器性能的提高。设计了与TMS320C6701乘法指令兼容的乘法单元，仿真结果验证了设计方案的正确性。相似文献

11.

高性能子字并行运算单元的设计与实现

下载免费PDF全文

董冕吴丹饶金理黄威戴葵邹雪城《计算机工程》2012,38(16):249-252

通过硬件共享的方式实现一套高性能子字并行运算单元,运算单元采用流水线设计,可以一个周期进行1个64-bit、2个32-bit、4个16-bit或8个8-bit定点运算,1个双精度或2个单精度浮点运算。运算单元采用Verilog HDL设计,在0.18 μm 标准CMOS工艺库下实现,并针对实际多媒体应用程序基于ESCA系统进行性能评测。实验结果表明,该运算单元可以在硬件开销和性能上获得较好的平衡。相似文献

12.

Internet Streaming SIMD Extensions

Thakkur S. Huff T. 《Computer》1999,32(12):26-34

Because floating-point computation is the heart of 3D geometry, speeding up floating-point computation is vital to overall 3D performance. To produce a visually perceptible difference in graphics applications, Intel's 32-bit processors-based on the IA-32 architecture-required an increase of 1.5 to 2 times the native floating-point performance. One path to better performance involves studying how the system uses data. Today's 3D applications can execute a lot faster by differentiating between data used repeatedly and streaming data-data used only once and then discarded. The Pentium III's new floating-point extension lets programmers designate data as streaming and provides instructions that handle this data efficiently. The authors designed the Internet Streaming SIMD Extensions (ISSE) to enable a new level of visual computing on the volume PC platform. They discuss their results in terms of boosting the performance of 3D and video applications 相似文献

13.

A novel hardware/software partitioning for SIMD-based real-time AVS video decoder

Liwei Chen Ming Cong Jing Huang Ling Li Hongwei Liu Cheng Qian 《Multimedia Tools and Applications》2014,71(3):1651-1671

Decoding high-quality videos in real-time is becoming more and more difficult with the increasing resolution. In this paper, a novel hardware/software (HW/SW) partitioning is proposed with powerful SIMD (single instruction multiple data) instructions for the real-time AVS video decoder. Since most key functions that need large amounts of computations are optimized by SIMD instead of hardware, the distribution of workload between hardware and software is balanceable, and the performance of the video decoder is improved. Besides, the generality and programmability are also maintained. The proposed method is implemented on a 32-bit dual-issue RISC processor with 256-bit vector extension. The experimental results of conformation AVS test sequences show that the video decoder system can support the real-time decoding of AVS 1080p videos at 30 fps, and improve performance over 100 times compared to the original processor without the proposed method. Moreover, this approach could be easily applied to other video decoders, such as H.264 and VC-1. 相似文献

14.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

15.

The Intel i860 64-bit processor: a general-purpose CPU with 3Dgraphics capabilities

Grimes J.D. Kohn L. Bharadhwaj R. 《Computer Graphics and Applications, IEEE》1989,9(4):85-94

相似文献

16.

The UCSC Kestrel parallel processor

Di Bias A. Dahle D.M. Diekhans M. Grate L. Hirschberg J. Karplus K. Keller H. Kendrick M. Mesa-Martinez F.J. Pease D. Rice E. Schultz A. Speck D. Hughey R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(1):80-92

The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz. 相似文献

17.

Experimental application-driven architecture analysis of anSIMD/MIMD parallel processing system

Bronson E.C. Casavant T.L. Jamieson L.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):195-205

An experimental analysis of the architecture of an SIMD/MIMD parallel processing system is presented. Detailed implementations of parallel fast Fourier transform (FFT) programs were used to examine the performance of the prototype of the PASM (Partitionable SIMD/MIMD) parallel processing system. Detailed execution-time measurements using specialized timing hardware were made for the complete FFT and for components of SIMD, MIMD, and barrier-synchronized MIMD implementations. The component measurements isolated the effects of floating-point arithmetic operations, interconnection network transfer operations, and program control overhead. The measurements allow an accurate extrapolation of the execution time, speedup, and efficiency of the MIMD, SIMD, and barrier-synchronized MIMD programs to a full 1024-processor PASM system. This constitutes one of the first results of this kind, in which controlled experiments on fixed hardware were used to make comparisons of these fundamental modes of computing. Overall, the experimental results demonstrate the value of mixed-mode SIMD/MIMD computing and its suitability for computational intensive algorithms such as the FET 相似文献

18.

注入式红外成像仿真系统设计

黄勇吴根水李睿《测控技术》2012,31(2):123-126

根据红外成像系统的仿真试验需求,设计并实现了一套128像素×128像素的注入式红外成像仿真系统。详细介绍了注入式红外成像仿真系统的设计方案,论述了硬件系统的关键技术和实现方法,介绍了软件实现方案和系统调试方法。试验测试结果表明,模拟图像的帧频达到100 Hz,单元像素的灰度分辨率为16-bit,精度达14-bit。经实际使用表明此系统满足设计指标要求。相似文献

19.

一种64位浮点乘加器的设计与实现 总被引：2，自引：0，他引：2

靳战鹏白永强沈绪榜《计算机工程与应用》2006,42(18):95-98

乘加操作是许多科学与工程应用中的基本操作,特别是在图形加速器和DSP等应用领域,浮点乘加器有着广泛的应用。论文针对PowerPC603e微处理器系统,基于SMIC0.25μm1P5MCMOS工艺,采用正向全定制的电路及版图设计方法,设计实现了一个综合使用改进Booth算法、平衡的4-2压缩器构成的Wallace树形结构、先行进位加法器的支持IEEE-754标准的64bit浮点乘加器。相似文献