首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
黄兆伟  王连明 《计算机应用研究》2020,37(9):2762-2765,2771
针对目前采用IEEE 754浮点标准设计的FPGA浮点运算器中吞吐率与资源利用率低等问题,提出一种运算精度与运算器数量可配置的并行浮点向量乘法运算单元。通过浮点运算器的指数、尾数位数可配置化设计,提高系统资源利用率,并将流水线技术与并行结构结合,提高数据吞吐率。以EP4CE115型FPGA为测试平台,当配置10组FP14运算器时,系统的逻辑资源占用约为4.2%,峰值吞吐率可达4.5 GFLOPS。结果表明,提出的浮点向量乘法单元有效提高了FPGA资源利用率与运算吞吐率,同时具有高度的可移植性与通用性,适用于FPGA向量乘法运算的加速。  相似文献   

2.
基于DNN的低资源语音识别特征提取技术   总被引:1,自引:0,他引:1  
秦楚雄  张连海 《自动化学报》2017,43(7):1208-1219
针对低资源训练数据条件下深层神经网络(Deep neural network,DNN)特征声学建模性能急剧下降的问题,提出两种适合于低资源语音识别的深层神经网络特征提取方法.首先基于隐含层共享训练的网络结构,借助资源较为丰富的语料实现对深层瓶颈神经网络的辅助训练,针对BN层位于共享层的特点,引入Dropout,Maxout,Rectified linear units等技术改善多流训练样本分布不规律导致的过拟合问题,同时缩小网络参数规模、降低训练耗时;其次为了改善深层神经网络特征提取方法,提出一种基于凸非负矩阵分解(Convex-non-negative matrix factorization,CNMF)算法的低维高层特征提取技术,通过对网络的权值矩阵分解得到基矩阵作为特征层的权值矩阵,然后从该层提取一种新的低维特征.基于Vystadial 2013的1小时低资源捷克语训练语料的实验表明,在26.7小时的英语语料辅助训练下,当使用Dropout和Rectified linear units时,识别率相对基线系统提升7.0%;当使用Dropout和Maxout时,识别率相对基线系统提升了12.6%,且网络参数数量相对其他系统降低了62.7%,训练时间降低了25%.而基于矩阵分解的低维特征在单语言训练和辅助训练的两种情况下都取得了优于瓶颈特征(Bottleneck features,BNF)的识别率,且在辅助训练的情况下优于深层神经网络隐马尔科夫识别系统,提升幅度从0.8%~3.4%不等.  相似文献   

3.
张爱英 《计算机科学》2018,45(9):308-313
利用多语言信息可以提高资源稀缺语言识别系统的性能。但是,在利用多语言信息提高资源稀缺目标语言识别系统的性能时,并不是所有语言的语音数据对资源稀缺目标语言语音识别系统的性能提高都有帮助。文中提出利用长短时记忆递归神经网络语言辨识方法 选择 多语言数据以提高资源稀缺目标语言识别系统的性能;选出更加有效的多语言数据用于训练多语言深度神经网络和深度Bottleneck神经网络。通过跨语言迁移学习获得的深度神经网络和通过深度Bottleneck神经网络获得的Bottleneck特征都对 提高 资源稀缺目标语言语音识别系统的性能有很大的帮助。与基线系统相比,在插值的Web语言模型解码条件下,所提系统的错误率分别有10.5%和11.4%的绝对减少。  相似文献   

4.
Pipelining is now a standard technique for increasing the speed of computers, particularly for floating-point arithmetic. Single-chip, pipelined floating-point functional units are available as “off the shelf” components. Addressing arithmetic can be done concurrently with floating-point operations to construct a fast processor that can exploit fine-grain parallelism. This paper describes a metric to estimate the optimal execution time of DO loops on particular processors. This metric is parameterized by the memory bandwidth and peak floating-point rate of the processor, as well as the length of the pipelines used in the functional units. Data dependence analysis provides information about the execution order constraints of the operations in the DO loop and is used to estimate the amount of pipeline interlock required by a loop. Several transformations are investigated to determine their impact on loops under this metric.  相似文献   

5.
D. W. Lozier  P. R. Turner 《Computing》1992,48(3-4):239-257
In this paper we consider the parallel computation of vector norms and inner products in floating-point and a proposed new form of computer arithmetic, the symmetric level-index system. The vector norms provide an illuminating example of the contrast between the two arithmetic systems under discussion in terms of the ability to program for (complete) robustness and parallelizability. The conflict between robustness of the computation—in the sense of the dual requirements of accuracy and freedom from overflow and underflow—and easy parallelization of the algorithms within a floating-point environment is made plain. It is seen that this conflict disappears if the symmetric level-index system of arithmetic is used. The freedom from overflow and underflow offered by this system allows the programming of the straightforward definitions in a way which is simple, robust and immediately parallelizable. Numerical results are given to illustrate the fact that the symmetric level-index system yields results of comparable accuracy to those of floating-point in cases where the latter system works and still yields results of high accuracy when the floating-point system fails altogether.  相似文献   

6.
董冕  吴丹  饶金理  黄威  戴葵  邹雪城 《计算机工程》2012,38(16):249-252
通过硬件共享的方式实现一套高性能子字并行运算单元,运算单元采用流水线设计,可以一个周期进行1个64-bit、2个32-bit、4个16-bit或8个8-bit定点运算,1个双精度或2个单精度浮点运算。运算单元采用Verilog HDL设计,在0.18 μm 标准CMOS工艺库下实现,并针对实际多媒体应用程序基于ESCA系统进行性能评测。实验结果表明,该运算单元可以在硬件开销和性能上获得较好的平衡。  相似文献   

7.
Single-precision floatingpoint computations may yield an arbitrary false result due to cancellation and rounding errors. This is true even for very simple, structured arithmetic expressions such as Horner's scheme for polynomial evaluation. A simple procedure will be presented for fast calculation of the value of an arithmetic expression to least significant bit accuracy in single precision computation. For this purpose in addition to the floating-point arithmetic only a precise scalar product (cf. [2]) is required. If the initial floatingpoint approximation is not too bad, the computing time of the new algorithm is approximately the same as for usual floating-point computation. If not, the essential progress of the presented algorithm is that the inaccurate approximation is recognized and corrected. The algorithm achieves high accuracy, i.e. between the left and the right bound of the result there is at most one more floating-point number. A rigorous estimation of all rounding errors introduced by floating-point arithmetic is given for general triangular linear systems. The theorem is applied to the evaluation of arithmetic expressions.  相似文献   

8.
Dr. G. Bohlender 《Computing》1980,24(2-3):149-160
In numerical computations mainly real and complex numbers, intervals as well as matrices and vectors with such components occur. It is well known that the arithmetic operations with real numbers, complex numbers etc. can be carried over to real floating-point numbers, complex floating-point numbers etc. using roundings. This proceeding results in agreeable arithmetic-, order- and compatibility-properties for an abundance of numerical data types and the accompanying arithmetic operations. Most programming languages however only provide real floating-point numbers; all the other data types and operations have to be simulated, e. g. in the form of arrays and procedure calls, which often causes loss of accuracy and arithmetic properties. Furthermore the complicate notation makes programs difficult to read. Therefore in this article an extension of PASCAL is presented which serves as an example for the way these numerical data types can be embedded into the syntax of a programming language.  相似文献   

9.
The development of the first two members in a family of scalable-processor-architecture (Sparc)-compatible parts is described. With varying frequency and latency performance, the chips work with the first two integer unit (IU) implementations from other Sparc vendors. These are the first Sparc chips to integrate all floating-point controller functions, floating-point register files, and 64-b ALU (arithmetic and logic unit), multiplier, and divide/square-root units in one die. A strong relationship with original equipment manufacturers in system behavioral-level modeling and a short time to production were key factors in the product development plan. Implementation goals, bus organization, overall processor operation, and the operation of the ALU, multiplier, and divide/square-root units are discussed  相似文献   

10.
Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 108 MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores.  相似文献   

11.
Some processors designed for consumer applications, such as graphics processing units (CPUs) and the CELL processor, promise outstanding floating-point performance for scientific applications at commodity prices. However, IEEE single precision is the most precise floating-point data type these processors directly support in hardware. Pairs of native floating-point numbers can be used to represent a base result and a residual term to increase accuracy, but the resulting order of magnitude slowdown dramatically reduces the price/performance advantage of these systems. By adding a few simple microarchitectural features, acceptable accuracy can be obtained with relatively little performance penalty. To reduce the cost of native-pair arithmetic, a residual register is used to hold information that would normally have been discarded after each floating-point computation. The residual register dramatically simplifies the code, providing both lower latency and better instruction-level parallelism.  相似文献   

12.
Choosing an internal floating-point representation for a binary computer with given word-length is influenced by two factors: the size of the range of admissible numbers and the precision of the respective floating-point arithmetic. In this paper “precision” is defined by a statistical model of rounding errors. According to this definition base 4 floating-point arithmetic on an average produces smaller rounding errors than all other floating-point arithmetics with a base 2k, provided that the ranges of numbers have equal size.  相似文献   

13.
沈洁  龙标  姜浩  黄春 《计算机研究与发展》2020,57(12):2610-2620
得益于单指令多数据(single instruction multiple data, SIMD)向量化技术,处理器浮点计算能力获得了成倍的提升,然而当前SIMD向量部件和指令集仅支持加、减、乘、除、逻辑运算等基本操作,对浮点超越函数没有提供直接的支持.作为浮点计算中最耗时的一类函数,如何提高其性能成为底层数学库优化工作的一个重点.面向超越函数中的三角函数,提出一种利用SIMD向量部件设计、实现与优化向量三角函数的方法.该方法结合标量数学库分段计算与向量数学库向量化实现的优势,增加和优化了向量三角函数中的分支处理,既减少了函数实现中的冗余计算,又提高了分支情况下向量部件的利用率.在飞腾处理器上的实验表明:所提优化方法既保证了向量三角函数的精度,同时有效提高了函数性能,与原始向量三角函数相比平均性能加速比为2.04倍.  相似文献   

14.
The IEEE 754 and 854 standards regulate the behaviour of real floating-point arithmetic, as implemented in most current hard- and software systems. Although a myriad of libraries for complex floating-point arithmetic is available and in use, there is no general consensus on their implementation. The International C Standard describes in its Annex G guidelines for the implementation of complex arithmetic, in order to achieve a similar behaviour of complex floating-point arithmetic across C-language compliant implementations. In Section 2 we summarize its recommendations and outline the problems inherent to this approach. In Section 3 we describe how the lack of reliability, when computing certain complex-valued expressions, can be overcome. Throughout the discussion the rounding mode is assumed to be round-to-nearest, as in Annex G.  相似文献   

15.
In this paper, we present two versions of a hardware processing architecture for modeling large networks of leaky-integrate-and-fire (LIF) neurons; the second version provides performance enhancing features relative to the first. Both versions of the architecture use fixed-point arithmetic and have been implemented using a single field-programmable gate array (FPGA). They have successfully simulated networks of over 1000 neurons configured using biologically plausible models of mammalian neural systems. The neuroprocessor has been designed to be employed primarily for use on mobile robotic vehicles, allowing bio-inspired neural processing models to be integrated directly into real-world control environments. When a neuroprocessor has been designed to act as part of the closed-loop system of a feedback controller, it is imperative to maintain strict real-time performance at all times, in order to maintain integrity of the control system. This resulted in the reevaluation of some of the architectural features of existing hardware for biologically plausible neural networks (NNs). In addition, we describe a development system for rapidly porting an underlying model (based on floating-point arithmetic) to the fixed-point representation of the FPGA-based neuroprocessor, thereby allowing validation of the hardware architecture. The developmental system environment facilitates the cooperation of computational neuroscientists and engineers working on embodied (robotic) systems with neural controllers, as demonstrated by our own experience on the Whiskerbot project, in which we developed models of the rodent whisker sensory system.  相似文献   

16.
Approximation by interval Bezier curves   总被引:5,自引:0,他引:5  
  相似文献   

17.
We present a system, ESOLID, that performs exact boundary evaluation of low-degree curved solids in reasonable amounts of time. ESOLID performs accurate Boolean operations using exact representations and exact computations throughout. The demands of exact computation require a different set of algorithms and efficiency improvements than those found in a traditional inexact floating-point based modeler. We describe the system architecture, representations, and issues in implementing the algorithms. We also describe a number of techniques that increase the efficiency of the system based on lazy evaluation, use of floating-point filters, arbitrary floating-point arithmetic with error bounds, and lower-dimensional formulation of subproblems.ESOLID has been used for boundary evaluation of many complex solids. These include both synthetic datasets and parts of a Bradley Fighting Vehicle designed using the BRL-CAD solid modeling system. It is shown that ESOLID can correctly evaluate the boundary of solids that are very hard to compute using a fixed-precision floating-point modeler. In terms of performance, it is about an order of magnitude slower as compared to a floating-point boundary evaluation system on most cases.  相似文献   

18.
19.
Hardware Support for Interval Arithmetic   总被引:1,自引:0,他引:1  
A hardware unit for interval arithmetic (including division by an interval that contains zero) is described in this paper. After a brief introduction an instruction set for interval arithmetic is defined which is attractive from the mathematical point of view. These instructions consist of the basic arithmetic operations and comparisons for intervals including the relevant lattice operations. To enable high speed, the case selections for interval multiplication (9 cases) and interval division (14 cases) are done in hardware. The lower bound of the result is computed with rounding downwards and the upper bound with rounding upwards by parallel units simultaneously. The rounding mode must be an integral part of the arithmetic operation. Also the basic comparisons for intervals together with the corresponding lattice operations and the result selection in more complicated cases of multiplication and division are done in hardware. There they are executed by parallel units simultaneously. The circuits described in this paper show that with modest additional hardware costs interval arithmetic can be made almost as fast as simple floating-point arithmetic.  相似文献   

20.
并行浮点加法器架构与核心算法的研究   总被引:1,自引:0,他引:1  
考虑到浮点运算在图形处理中的重要作用,依据速度和面积的优化原理,文章从两个方面对FAU结构中最复杂的双精度浮点加法进行了研究。其一:在结构上采用了三条相互并行的主线,设计了一种尽可能并行处理的三级浮点流水结构,极大地提高了运算的速度,节约了芯片资源;其二:对结构中制约浮点加法速度的关键运算——尾加和移位操作进行了创新设计与实现,并就设计的先进性和高速性与传统设计进行了参数比较和综合分析。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号