首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
黄兆伟  王连明 《计算机应用研究》2020,37(9):2762-2765,2771
针对目前采用IEEE 754浮点标准设计的FPGA浮点运算器中吞吐率与资源利用率低等问题,提出一种运算精度与运算器数量可配置的并行浮点向量乘法运算单元。通过浮点运算器的指数、尾数位数可配置化设计,提高系统资源利用率,并将流水线技术与并行结构结合,提高数据吞吐率。以EP4CE115型FPGA为测试平台,当配置10组FP14运算器时,系统的逻辑资源占用约为4.2%,峰值吞吐率可达4.5 GFLOPS。结果表明,提出的浮点向量乘法单元有效提高了FPGA资源利用率与运算吞吐率,同时具有高度的可移植性与通用性,适用于FPGA向量乘法运算的加速。  相似文献   

2.
提出了一种基于维纳插值模型的帧内错误隐藏插值算法WIEC。算法采用最近的相邻点作为插值的参考点,同时对传统插值选点方法进行改进,将已恢复的像素点用作估值,使得算法可以较好的恢复图像中如边缘特征等一些重要信息。考虑到误差积累和计算复杂度等问题,又提出了一种螺旋型插值顺序模式。该插值顺序模式不仅提高了插值精度,而且减小了最大插值误差。仿真实验表明,算法具有良好的错误隐藏效果。  相似文献   

3.
高精度、高性能浮点运算部件是高性能微处理器设计的重要部分。通过对传统双精度浮点乘加运算算法的研究,结合四倍精度浮点数据格式特点,设计并实现一种高性能的四倍精度浮点乘加器(QPFMA),该乘加器支持多种浮点运算,运算延迟为7拍,全流水结构。采用双路加法器改进算法结构,优化头零预测和规格化移位逻辑,减小运算延迟和硬件开销。通过参数化设计验证方法,实现高效的正确性验证。逻辑综合结果表明,基于65 nm工艺,该QPFMA频率可达1.2 GHz,比现有的QPFMA设计运算延迟减少3拍,频率提高约11.63%。  相似文献   

4.
Ahmet   《Journal of Systems Architecture》2008,54(12):1129-1142
Most modern microprocessors provide multiple identical functional units to increase performance. This paper presents dual-mode floating-point adder architectures that support one higher precision addition and two parallel lower precision additions. A double precision floating-point adder implemented with the improved single-path algorithm is modified to design a dual-mode double precision floating-point adder that supports both one double precision addition and two parallel single precision additions. A similar technique is used to design a dual-mode quadruple precision floating-point adder that implements the two-path algorithm. The dual-mode quadruple precision floating-point adder supports one quadruple precision and two parallel double precision additions. To estimate area and worst-case delay, double, quadruple, dual-mode double, and dual-mode quadruple precision floating-point adders are implemented in VHDL using the improved single-path and the two-path floating-point addition algorithms. The correctness of all the designs is tested and verified through extensive simulation. Synthesis results show that dual-mode double and dual-mode quadruple precision adders designed with the improved single-path algorithm require roughly 26% more area and 10% more delay than double and quadruple precision adders designed with the same algorithm. Synthesis results obtained for adders designed with the two-path algorithm show that dual-mode double and dual-mode quadruple precision adders requires 33% and 35% more area and 13% and 18% more delay than double and quadruple precision adders, respectively.  相似文献   

5.
The size of geometric data sets in scientific and industrial applications is constantly increasing. Storing surface or volume meshes in standard uncompressed formats results in large files that are expensive to store and slow to load and transmit. Scientists and engineers often refrain from using mesh compression because currently available schemes modify the mesh data. While connectivity is encoded in a lossless manner, the floating-point coordinates associated with the vertices are quantized onto a uniform integer grid to enable efficient predictive compression. Although a fine enough grid can usually represent the data with sufficient precision, the original floating-point values will change, regardless of grid resolution.In this paper we describe a method for compressing floating-point coordinates with predictive coding in a completely lossless manner. The initial quantization step is omitted and predictions are calculated in floating-point. The predicted and the actual floating-point values are broken up into sign, exponent, and mantissa and their corrections are compressed separately with context-based arithmetic coding. As the quality of the predictions varies with the exponent, we use the exponent to switch between different arithmetic contexts. We report compression results using the popular parallelogram predictor, but our approach will work with any prediction scheme. The achieved bit-rates for lossless floating-point compression nicely complement those resulting from uniformly quantizing with different precisions.  相似文献   

6.
本文在深入分析CRAY类巨型机浮点运算精度的基础上,设计了精度更高的流水线向量机浮点支运算精度控制方案。  相似文献   

7.
提出了一种结合486SX级别的X86微处理器和可编程逻辑器件CPLD两级控制的嵌入式数控系统设计方案,阐述了该系统的硬件接口电路设计;提出了基于改进S形加减速的NURBS曲线直接插补算法,在满足最大弦高误差、最大法向加速度以及最大进给速度要求的情况下,对插补曲线的加速段和减速段进行速度规划;最后采用基于该插补算法的嵌入式数控系统,在半圆形毛坯上进行了五角星NURBS曲线的实际加工,验证了所设计嵌入式数控系统的可行性和有效性,具有一定的工程应用价值。  相似文献   

8.
Recursive procedures used for sequential calculations of polynomial basis coefficients in discrete orthogonal moments produce unreliable results for high moment orders as a result of error accumulation. This paper demonstrates accurate reconstruction of arbitrary-size images using full-order (orders as large as the image size) Tchebichef and Krawtchouk moments by calculating polynomial coefficients directly from their definition formulas in hypergeometric functions and by creating lookup tables of these coefficients off-line. An arbitrary precision calculator is used to achieve greater numerical range and precision than is possible with software using standard 64-bit IEEE floating-point arithmetic. This reconstruction scheme is content and noise independent.  相似文献   

9.
A new method of inter-neuron communication called incremental communication is presented. In the incremental communication method, instead of communicating the whole value of a variable, only the increment or decrement of its previous value is sent on a communication link. The incremental value may be either a fixed-point or a floating-point value. Multilayer feedforward network architecture is used to illustrate the effectiveness of the proposed communication scheme. The method is applied to three different learning problems and the effect of the precision of incremental input-output values of the neurons on the convergence behavior is examined. It is shown through simulation that for some problems even four-bit precision in fixed- and/or floating-point representations is sufficient for the network to converge. With 8-12 bit precisions almost the same results are obtained as that with the conventional communication using 32-bit precision. The proposed method of communication can lead to significant savings in the intercommunication cost for implementations of artificial neural networks on parallel computers as well as the interconnection cost of direct hardware realizations. The method can be incorporated into most of the current learning algorithms in which inter-neuron communications are required. Moreover, it can be used along with the other limited precision strategies for representation of variables suggested in literature.  相似文献   

10.
Choosing an internal floating-point representation for a binary computer with given word-length is influenced by two factors: the size of the range of admissible numbers and the precision of the respective floating-point arithmetic. In this paper “precision” is defined by a statistical model of rounding errors. According to this definition base 4 floating-point arithmetic on an average produces smaller rounding errors than all other floating-point arithmetics with a base 2k, provided that the ranges of numbers have equal size.  相似文献   

11.
龚鸣清  叶煌  张鉴  卢兴敬  陈伟 《计算机应用》2019,39(6):1557-1562
针对使用ARM处理器的移动智能设备执行神经网络推理计算效率不高的问题,提出了一套基于ARMv8架构的单精度浮点通用矩阵乘法(SGEMM)算法优化方案。首先,确定ARMv8架构的处理器执行SGEMM算法的计算效率受限于向量化计算单元使用方案、指令流水线和缓存未命中的发生概率;其次,针对三点导致计算效率受限的原因实现向量指令内联汇编、数据重排和数据预取三条优化技术;最后,根据语音方向的神经网络中常见的三种矩阵模式设计测试实验,实验中使用RK3399硬件平台运行程序。实验结果表示:方阵模式下单核计算速度为10.23 GFLOPS,达到实测浮点峰值的78.2%;在细长矩阵模式下单核计算速度为6.35 GFLOPS,达到实测浮点峰值的48.1%;在连续小矩阵模式下单核计算速度为2.53 GFLOPS,达到实测浮点峰值19.2%。将优化后的SGEMM算法部署到语音识别神经网络程序中,程序的实际语音识别速度取得了显著提高。  相似文献   

12.
Single-precision floatingpoint computations may yield an arbitrary false result due to cancellation and rounding errors. This is true even for very simple, structured arithmetic expressions such as Horner's scheme for polynomial evaluation. A simple procedure will be presented for fast calculation of the value of an arithmetic expression to least significant bit accuracy in single precision computation. For this purpose in addition to the floating-point arithmetic only a precise scalar product (cf. [2]) is required. If the initial floatingpoint approximation is not too bad, the computing time of the new algorithm is approximately the same as for usual floating-point computation. If not, the essential progress of the presented algorithm is that the inaccurate approximation is recognized and corrected. The algorithm achieves high accuracy, i.e. between the left and the right bound of the result there is at most one more floating-point number. A rigorous estimation of all rounding errors introduced by floating-point arithmetic is given for general triangular linear systems. The theorem is applied to the evaluation of arithmetic expressions.  相似文献   

13.
Recent advances in the parallelizability of fast N-body algorithms, and the programmability of graphics processing units (GPUs) have opened a new path for particle based simulations. For the simulation of turbulence, vortex methods can now be considered as an interesting alternative to finite difference and spectral methods. The present study focuses on the efficient implementation of the fast multipole method and pseudo-particle method on a cluster of NVIDIA GeForce 8800 GT GPUs, and applies this to a vortex method calculation of homogeneous isotropic turbulence. The results of the present vortex method agree quantitatively with that of the reference calculation using a spectral method. We achieved a maximum speed of 7.48 TFlops using 64 GPUs, and the cost performance was near $9.4/GFlops. The calculation of the present vortex method on 64 GPUs took 4120 s, while the spectral method on 32 CPUs took 4910 s.  相似文献   

14.
车文博  刘衡竹  田甜 《计算机应用》2016,36(8):2213-2218
针对高性能M型数字信号处理器(M-DSP)对浮点运算的性能、面积和功耗要求,研究分析了M-DSP总体结构和浮点运算的指令特点,设计和实现了一种高性能低功耗的浮点乘累加器(FMAC)。该乘加器采用单、双精度通路分离的主体结构,分为六级流水站执行,对乘法器、对阶移位等关键模块进行了复用设计,支持双精度和单精度浮点乘法、乘累加、乘累减、单精度点积和复数运算。对所设计的乘加器进行了全面的验证,基于45nm工艺采用Synopsys公司的Design Compiler工具综合所设计的代码,综合结果表明运行频率可达1GHz,单元面积36856μm2;与FT-XDSP中的乘加器相比,面积节省了12.95%,关键路径长度减少了2.17%。  相似文献   

15.
介绍自主设计的龙腾C2微处理器中浮点运算单元的设计与实现。该处理器与Intel80486DX4指令系统兼容,支持IEEE754标准扩展精度的浮点基本函数和超越函数运算。介绍了浮点运算单元的结构,分析了实现超越函数的高精度CORDIC算法的流程,讨论了实现浮点超越函数运算的数据通路和控制通路结构,并给出了仿真结果和精度评估结果。仿真和分析的结果表明,浮点运算单元的设计满足龙腾C2微处理器的设计要求。  相似文献   

16.
基于FPGA的高精度科学计算加速器研究   总被引:2,自引:0,他引:2  
雷元武  窦勇  郭松 《计算机学报》2012,35(1):112-122
探索了 FPGA平台加速高精度科学计算应用的能力和灵活性.首先,研究科学计算中最常用的操作——向量内积,提出基于定点操作的精确向量内积算法.以IEEE 754-2008标准的四精度(Quadruple Precision)浮点算术为例,在FPGA平台上设计了一个基于全展开方法的全流水四精度浮点乘累加单元(QPMAC):提出两级存储策略精确存储乘累加和;采用保留进位累加策略减少定点加法器位宽、简化进位处理、优化关键路径;引入累加和划分策略,实现流水吞吐率.最后,在XC5VLX330 FPGA芯片上设计一个LU分解和MGS-QR分解加速器原型来验证QPMAC的性能.实验结果表明,与运行在Intel四核处理器上的基于OpenMP的并行算法相比,集成4个QP-MAC单元的加速器能获得42倍到97倍的性能提升,并且能获得更高结果精度和更低能量消耗.  相似文献   

17.
A parallel implementation of Chebyshev method is presented for the B-spline surface interpolation problem. The algorithm finds the control points of a uniform bicubic B-spline surface that interpolates m × n data points on an m × n mesh-connected processor array in constant time. Hence it is optimal. Due to its numerical stability, the algorithm can successfully be used in finite precision floating-point arithmetic.  相似文献   

18.
In the high-speed free-form surface machining, the real-time motion planning and interpolation is a challenging task. This paper presents the design and implementation of a dedicated processor for the interpolation task in computerized numerical control (CNC) machine tools. The jerk-limited look-ahead motion planning and interpolation algorithm has been integrated in the interpolation processor to achieve smooth motion in the high-speed machining. The processor features a compactly designed floating-point parallel computing architecture, which employs a 3-stage pipelined reduced instruction set computer (RISC) core and a very long instruction word (VLIW) floating-point arithmetic unit. A new asynchronous execution mechanism has been employed in the processor to allow multi-cycle instructions to be performed in parallel. The proposed processor has been verified on a low-cost field programmable gate array (FPGA) chip in a prototype controller. Experimental result has demonstrated the significant improvement of the computing performance with the interpolation processor in the free-form surface machining.  相似文献   

19.
数控机床加工的零件轮廓一般都可以用直线或圆弧去逼近。插补计算是数控系统根据输入的基本数据,描述工件轮廓的一种技术。文章针对硬件电路实现插补运算柔性差和精度低的缺陷,探索了一种以软IP实现插补运算的思路。并以较常用的逐点比较插补算法为例详细讨论了整个VHDL建模过程,作了比较全面的时序分析。最后在EDA6000实验开发平台验证了方案的可行性。该软IP插补具有升级容易、成本低廉的优点。  相似文献   

20.
在DCS中,主控单元模块的数据存储资源有限,为了节省数据的存储空间,开发算法块时尽可能使用低精度的数据类型。由于浮点运算超过精度能表示的范围就需要近似或舍入,这样就会产生误差,对于一些涉及到复杂运算的算法,数据精度不足有时会造成算法计算错误。所以在算法测试过程中,应包含数据精度方面的测试。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号