期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

何婷婷彭元喜雷元武《计算机应用》2015,35(7):1854-1857

针对双精度浮点除法通常运算过程复杂、延时较大这一问题,提出一种基于Goldschmidt算法设计支持IEEE-754标准的高性能双精度浮点除法器方法。首先,分析Goldschmidt算法运算除法的过程以及迭代运算产生的误差;然后,提出了控制误差的方法;其次,采用了较节约面积的双查找表法确定迭代初值,迭代单元采用并行乘法器结构以提高迭代速度;最后,合理划分流水站,控制迭代过程使浮点除法可以流水执行,从而进一步提高除法器运算速率。实验结果表明,在40 nm工艺下,双精度浮点除法器采用14位迭代初值流水结构,其综合cell面积为84902.2618 μm²,运行频率可达2.2 GHz;相比采用8位迭代初值流水结构运算速度提高了32.73%,面积增加了5.05%;计算一条双精度浮点除法的延迟为12个时钟周期,流水执行时,单条除法平均延迟为3个时钟周期,与其他处理器中基于SRT算法实现的双精度浮点除法器相比,数据吞吐率提高了3~7倍;与其他处理器中基于Goldschmidt算法实现的双精度浮点除法器相比,数据吞吐率提高了2~3倍。相似文献

2.

一种基于SRT-8算法的SIMD浮点除法器的设计与实现

邓子椰陈书明彭元喜雷元武《计算机工程与科学》2014,36(5):797-803

在科学计算、数字信号处理、通信和图像处理等应用中,除法运算是常用的基本操作之一。基于SRT 8除法算法,设计一个SIMD结构的IEEE 754标准浮点除法器,在同一硬件平台上能够实现双精度浮点除法和两个并行的单精度浮点除法。通过优化SRT 8迭代除法结构,提出商选择和余数加法的并行处理,并采用商数字存储技术降低迭代除法的计算延时,提高频率。同时,采用复用策略减少硬件资源开销,节省面积。实验表明,在40nm工艺下,本设计综合cell面积为18601.9681 μm2,运行频率可达2.5GHz,相对传统的SRT 8实现关键延迟减少了23.81%。相似文献

3.

浮点开方运算单元的电路设计 总被引：2，自引：0，他引：2

夏宏李笑盈王攻本《计算机工程与应用》2001,37(11):39-41,87

文章提出了一种基于逐位循环开方算法,"四位一开方"的浮点开方运算单元的电路设计方案,使限制周期时间的循环迭代部分的门级数降低到14级。按14级门延时为周期时间计算,完成一个IEEE单、双精度浮点数的开方运算分别需要15和29周期。同时,文章对目前开方运算所采用的两类主要的算法-逐位循环开方算法和牛顿－莱福森迭代开方算法进行了描述,其中包括数的冗余表示等内容。相似文献

4.

一种快速SIMD浮点乘加器的设计与实现 总被引：2，自引：0，他引：2

下载免费PDF全文

吴铁彬刘衡竹杨惠张剑锋侯申《计算机工程与科学》2012,34(1):69-73

本文设计和实现了5级全流水SIMD浮点乘加器,支持双精度和双单精度浮点乘法、乘累加(减)操作,用Modelsim和NC Verilog测试和验证了RTL代码实现,基于65nm工艺采用Synopsys公司的Design Complier工具综合硬件实现,运行频率可达714.286MHz。结果表明,相比文献[3]中经典的低延迟乘加结构,在相同综合条件下性能提升了17.89%,面积增加了6.61%,功耗降低了25.08%。相似文献

5.

基于RISC-V浮点指令集FPU的研究与设计

下载免费PDF全文

潘树朋刘有耀焦继业李昭《计算机工程与应用》2021,57(3):80-86

针对目前浮点运算软件实现速度慢,不能满足嵌入式处理器实时性要求以及运算种类有限等问题,提出了一种基于RISC-V指令集的浮点处理器,能够执行加法、减法、乘法、除法、平方根、乘累加以及比较运算,完全符合IEEE 754-2008标准。在VCS仿真环境下对浮点处理器进行了功能验证,各模块均能满足正确性要求。将浮点处理器与一款开源处理器核蜂鸟E203集成,使用SMIC 0.18工艺库完成了逻辑综合,并在FPGA上对设计进行了测试。结果表明,该浮点处理器的逻辑门数仅为24 200,吞吐量为150 MFLOPS,与已公开文献的设计方案相比,硬件面积分别减少7%、1.5%。综合运行频率可达100 MHz。相似文献

6.

单双精度浮点除法器的实现 总被引：1，自引：1，他引：0

王晨旭朱世林王新胜《微处理机》2009,30(5):20-23

通过对除法算法的研究,采用三级流水并精选SRT的冗余区域,在不减少运算精度的条件下,简化硬件设计,用硬件描述语言(Verilog)实现了单精度和双精度浮点数除法运算模块,并使用随机测试矢量对除法器进行验证,结果与参考机比较误差不超过2-64.如果采用SMIC 0.18μm CMOS工艺库实现该设计,该除法单元在占用芯片面积为168173μm2的情况下工作频率可达约455MHz. 相似文献

7.

高速深流水线浮点加法单元的设计

《微型机与应用》2015,(20):15-17

在X87执行环境下,采用基于Two-Path算法的并行深度流水线优化算法,设计了一种能够实现符合IEEE-754标准的单精度、双精度和扩展双精度及整型数据且舍入模式可控的高速浮点加法器。采用并行深度流水设计,经验证,功能满足设计要求,使用TSMC 65 nm工艺库进行综合,其工作频率可达900 MHz。相似文献

8.

基于OFDM基带的自适应调制与编码硬件实现

赵明佳黄凡《计算机与数字工程》2011,39(4):170-173

针对国内外少有关于在OFDM基带硬件实现自适应调制与编码的相关文献问题,采用改进的简单分组比特算法,提出了一种基于OFDM基带的自适应调制与编码硬件实现方案。设计的自适应调制与编码系统在20MHz下通过SMIC的CMOS 0.18μm工艺综合,仿真结果表明该自适应调制与编码方案具有面积小、功耗低等特点,特别适合在OFDM基带中使用。相似文献

9.

基于FPGA的可重构JH算法设计与实现

下载免费PDF全文

周权王奕李仁发《计算机工程》2012,38(11):208-210

针对现有可重构JH算法硬件实现方案吞吐量较低的问题,利用查找表方法对S盒进行优化,使改进的JH算法在现场可编程门阵列上实现时具有速度快和面积小的特点,在此基础上提出一种可重构方案。实验结果证明,该方案最高时钟频率可达322.81 MHz,占用 1 405 slices,具有资源占用少、性能参数较好、功耗较低等特点。相似文献

10.

一种新型面积优化的二维IDCT处理器

于宝东邹雪城《微处理机》2005,26(5):86-88

本文提出了一种基于行列分解算法的8×8二维反向离散余弦变换(IDCT)处理器。不再需要传统的为保持输入列向量的输入寄存器和并串转换寄存器,这既减小了芯片面积又减小了处理延时。其中的一维离散余弦变换采用查找表实现,作为查找表的ROM比传统的分布式算法的ROM也小的多。我们提出的二维IDCT处理器不仅具有面积优化、低延时、高吞吐率的特点,并且具有规整的、全流水结构,因此非常适合VLSI和FPGA实现。相似文献

11.

Division and square root: choosing the right implementation

Soderquist P. Leeser M. 《Micro, IEEE》1997,17(4):56-66

Floating-point support has become a mandatory feature of new microprocessors due to the prevalence of business, technical, and recreational applications that use these operations. Spreadsheets, CAD tools, and games, for instance, typically feature floating-point-intensive code. Over the past few years, the leading architectures have incorporated several generations of floating-point units (FPUs). However, while addition and multiplication implementations have become increasingly efficient, support for division and square root has remained uneven. The design community has reached no consensus on the type of algorithm to use for these two functions, and quality and performance of the implementations vary widely. This situation originates in skepticism about the importance of division and square root and an insufficient understanding of the design alternatives. Quantifying what constitutes good performance is challenging. One rule thumb, for example, states that the latency of division should be three times that of multiplication; this figure is based on division frequencies in a selection of typical scientific applications. Even if we accept this doctrine at face value, implementing division-and square root-involves much more than relative latencies. We must also consider area, throughput, complexity, and the interaction with other operations. This article explores the various trade-offs involved and illuminates the consequences of different design choices, thus enabling designers to make informed decisions 相似文献

12.

Rijndael加密算法在低成本FPGA上的实现

沈涵飞甘萌《计算机工程与应用》2004,40(22):116-119,134

论文介绍了Rijndael加密算法的不同硬件实现方式。为了兼顾硬件资源和电路性能两个方面,根据XilinxFPGA内在的结构特点,设计采用了inner-round流水线结构,利用了FPGA的内置RAM和丰富的寄存器资源,在消耗很少资源的情况下获得了极高的加密速度。相似文献

13.

The secure wavelet transform

Amit Pande Joseph Zambreno 《Journal of Real-Time Image Processing》2012,7(2):131-142

There has been an increasing concern for the security of multimedia transactions over real-time embedded systems. Partial and selective encryption schemes have been proposed in the research literature, but these schemes significantly increase the computation cost leading to tradeoffs in system latency, throughput, hardware requirements and power usage. In this paper, we propose a light-weight multimedia encryption strategy based on a modified discrete wavelet transform (DWT) which we refer to as the secure wavelet transform (SWT). The SWT provides joint multimedia encryption and compression by two modifications over the traditional DWT implementations: (a) parameterized construction of the DWT and (b) subband re-orientation for the wavelet decomposition. The SWT has rational coefficients which allow us to build a high throughput hardware implementation on fixed point arithmetic. We obtain a zero-overhead implementation on custom hardware. Furthermore, a Look-up table based reconfigurable implementation allows us to allocate the encryption key to the hardware at run-time. Direct implementation on Xilinx Virtex FPGA gave a clock frequency of 60 MHz while a reconfigurable multiplier based design gave a improved clock frequency of 114 MHz. The pipelined implementation of the SWT achieved a clock frequency of 240 MHz on a Xilinx Virtex-4 FPGA and met the timing constraint of 500 MHz on a standard cell realization using 45 nm CMOS technology. 相似文献

14.

FPGA implementation and performance evaluation of a high throughput crypto coprocessor

Mostafa I. Soliman^{Author Vitae} Ghada Y. AbozaidAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(8):1075-1084

This paper describes the FPGA implementation of FastCrypto, which extends a general-purpose processor with a crypto coprocessor for encrypting/decrypting data. Moreover, it studies the trade-offs between FastCrypto performance and design parameters, including the number of stages per round, the number of parallel Advance Encryption Standard (AES) pipelines, and the size of the queues. Besides, it shows the effect of memory latency on the FastCrypto performance. FastCrypto is implemented with VHDL programming language on Xilinx Virtex V FPGA. A throughput of 222 Gb/s at 444 MHz can be achieved on four parallel AES pipelines. To reduce the power consumption, the frequency of four parallel AES pipelines is reduced to 100 MHz while the other components are running at 400 MHz. In this case, our results show a FastCrypto performance of 61.725 bits per clock cycle (b/cc) when 128-bit single-port L2 cache memory is used. However, increasing the memory bus width to 256-bit or using 128-bit dual-port memory, improves the performance to 112.5 b/cc (45 Gb/s at 400 MHz), which represents 88% of the ideal performance (128 b/cc). 相似文献

15.

FPGA implementation of full-search vector quantization based on partial distance search

Wen-Jyi Hwang Wen-Kang Wei Yao-Jung Yeh 《Microprocessors and Microsystems》2007,31(8):516-528

This paper presents a novel algorithm for field programmable gate array (FPGA) realization of vector quantizer (VQ) encoders using partial distance search (PDS). In most applications, the PDS is adopted as a software approach for attaining moderate codeword search acceleration. In this paper, a novel PDS algorithm well suited for hardware realization is proposed. The algorithm employs subspace search, bitplane reduction, and multiple-coefficient accumulation techniques for the effective reduction of the area complexity and computation latency. Concurrent encoding of different input vectors for further computation acceleration is also allowed by the employment of multiple-module PDS. The proposed implementation has been embedded in a softcore CPU for physical performance measurement. Experimental results show that the implementation provides a cost-effective solution to the FPGA realization of VQ encoding systems where both high throughput and high fidelity are desired. 相似文献

16.

可重构的串行高级加密标准加解密电路设计

谢惠敏郭东辉《计算机应用》2013,33(2):450-459

为了进一步提高高级加密标准(AES)算法在现场可编程门阵列(FPGA)上的硬件资源使用效率,提出一种可支持密钥长度128/192/256位串行AES加解密电路的实现方案。该设计采用复合域变换实现字节乘法求逆,同时实现列混合与逆列混合的资源共享以及三种AES算法密钥扩展共享。该电路在Xilinx Virtex-Ⅴ系列的FPGA上实现,硬件资源消耗为1871slice、4RAM。结果表明,在最高工作频率173.904MHz时,密钥长度128/192/256位AES加解密吞吐率分别可达2119/1780/1534Mb·s^(-1)。该设计吞吐率/硬件资源比值较高,且适用支持千兆以太网。相似文献

17.

64-bit Block ciphers: hardware implementations and comparison analysis

P. N. M.D. O. 《Computers & Electrical Engineering》2004,30(8):593-604

A performance comparison for the 64-bit block cipher (Triple-DES, IDEA, CAST-128, MISTY1, and KHAZAD) FPGA hardware implementations is given in this paper. All these ciphers are under consideration from the ISO/IEC 18033-3 standard in order to provide an international encryption standard for the 64-bit block ciphers. Two basic architectures are implemented for each cipher. For the non-feedback cipher modes, the pipelined technique between the rounds is used, and the achieved throughput ranges from 3.0 Gbps for IDEA to 6.9 Gbps for Triple-DES. For feedback ciphers modes, the basic iterative architecture is considered and the achieved throughput ranges from 115 Mbps for Triple-DES to 462 Mbps for KHAZAD. The throughput, throughput per slice, latency, and area requirement results are provided for all the ciphers implementations. Our study is an effort to determine the most suitable algorithm for hardware implementation with FPGA devices. 相似文献

18.

Efficient FPGA implementation of corrected reversible contrast mapping algorithm for video watermarking

《Microprocessors and Microsystems》2020

This paper analyses and rectifies the shortcomings of reversible contrast mapping (RCM) algorithm for invisible watermarking. The proposed corrected RCM algorithm is tested by taking a gray scaled video input. The quality of services and structural similarity index matrix (SSIM) of each frame of the input video are tested in software environment. The video is obtained by OV7670 camera through Zed-board in fully field programmable gate array (FPGA) based hardware environment. FPGA devices based corrected high level synthesis of the proposed algorithm is presented. The suggested system engages pipeline structure and practices parallelism to achieve high performance. The quality of services and SSIM are also tested using FPGA devices and the comparative results with software implementations are explained. To process thirty (640 × 480) image blocks with 150 MHz clock we obtain a latency of 876.626 ns with throughput 62.328 Mbps. The critical path for single cycle is 5.992 ns. The number of resources essential is similar for watermark decoding with an improved schedule. The results acquired after implementation on Xilinx Virtex 7-XC7V2000T and programmable system-on-chip (Zynq - XC7Z030) FPGA devices confirm the practicability of real-time use with low cost and great speed. 相似文献

19.

Evolutionary circuit design for fast FPGA-based classification of network application protocols

《Applied Soft Computing》2016

The evolutionary design can produce fast and efficient implementations of digital circuits. It is shown in this paper how evolved circuits, optimized for the latency and area, can increase the throughput of a manually designed classifier of application protocols. The classifier is intended for high speed networks operating at 100 Gbps. Because a very low latency is the main design constraint, the classifier is constructed as a combinational circuit in a field programmable gate array (FPGA). The classification is performed using the first packet carrying the application payload. The improvements in latency (and area) obtained by Cartesian genetic programming are validated using a professional FPGA design tool. The quality of classification is evaluated by means of real network data. All results are compared with commonly used classifiers based on regular expressions describing application protocols. 相似文献