期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

徐睿马胜郭阳黄友李艺煌《计算机工程与科学》2019,41(9):1557-1566

随着卷积神经网络得到愈加广泛的应用,针对其复杂运算的定制硬件加速器得到越来越多的重视与研究。但是,目前定制硬件加速器多采用传统的卷积算法,并且缺乏对神经网络稀疏性的支持,从而丧失了进一步改进硬件,提升硬件性能的空间。重新设计一款卷积神经网络加速器,该加速器基于Winograd稀疏算法,该算法被证明有效降低了卷积神经网络的计算复杂性,并可以很好地适应稀疏神经网络。通过硬件实现该算法,本文的设计可以在减少硬件资源的同时,获得相当大的计算效率。实验表明,相比于传统算法,该加速器设计方案将运算速度提升了近4.15倍;从乘法器利用率的角度出发,相比现有的其他方案,该方案将利用率最多提高了近9倍。相似文献

2.

基于FPGA的稀疏化卷积神经网络加速器

狄新凯杨海钢《计算机工程》2021,47(7):189-195,204

为消除卷积神经网络前向计算过程中因模型参数的稀疏性而出现的无效运算,基于现场可编程门阵列（FPGA）设计针对稀疏化神经网络模型的数据流及并行加速器。通过专用逻辑模块在输入通道方向上筛选出特征图矩阵和卷积滤波器矩阵中的非零点,将有效数据传递给由数字信号处理器组成的阵列做乘累加操作。在此基础上,对所有相关的中间结果经加法树获得最终输出特征图点,同时在特征图宽度、高度和输出通道方向上做粗颗粒度并行并寻找最佳的设计参数。在Xilinx器件上进行实验验证,结果表明,该设计实现VGG16卷积层综合性能达到678.2 GOPS,性能功耗比为69.45 GOPS/W,其性能与功耗指标较基于FPGA的稠密网络加速器和稀疏网络加速器有较大提升。相似文献

3.

面向卷积神经网络的高能效比特稀疏加速器设计

肖航许浩博王颖李佳骏王郁杰韩银和《计算机辅助设计与图形学学报》2023,(7):1122-1131

为解决当前比特稀疏架构的性能瓶颈,提出高能效比特稀疏加速器设计.首先提出一种激活值编码方法和相应的电路来提高卷积神经网络的比特稀疏度,结合比特串行电路实时跳过激活值的零值比特来加速神经网络的计算;然后提出一种列共享同步机制,以解决比特稀疏架构的同步问题,并在较小的面积和功耗开销下大幅提高比特稀疏架构的计算性能.在SMIC40 nm工艺和1 GHz频率下,评估不同的比特稀疏架构在卷积神经网络上的能效.实验结果表明,与非稀疏加速器VAA和比特稀疏加速器LS-PRA相比,所提出的加速器AS-PRA分别提高了544%和179%的能效. 相似文献

4.

面向卷积神经网络的FPGA加速器架构设计

李炳剑秦国轩朱少杰裴智慧《计算机科学与探索》2020,14(3):437-448

随着人工智能的快速发展,卷积神经网络(CNN)在很多领域发挥着越来越重要的作用。分析研究了现有卷积神经网络模型,设计了一种基于现场可编程门阵列(FPGA)的卷积神经网络加速器。在卷积运算中四个维度方向实现了并行化计算;提出了参数化架构设计,在三种参数条件下,单个时钟周期分别能够完成512、1024、2048次乘累加;设计了片内双缓存结构,减少片外存储访问的同时实现了有效的数据复用;使用流水线实现了完整的神经网络单层运算过程,提升了运算效率。与CPU、GPU以及相关FPGA加速方案进行了对比实验,实验结果表明,所提出的设计的计算速度达到了560.2 GOP/s,为i7-6850K CPU的8.9倍。同时,其计算的性能功耗比达到了NVDIA GTX 1080Ti GPU的3.0倍,与相关研究相比,所设计的加速器在主流CNN网络的计算上实现了较高的性能功耗比,同时不乏通用性。相似文献

5.

基于粗粒度数据流架构的稀疏卷积神经网络加速

吴欣欣欧焱李文明王达张浩范东睿《计算机研究与发展》2021,58(7):1504-1517

卷积神经网络(convolutional neural network, CNN)在图像处理、语音识别、自然语言处理等领域实现了很好的性能.大规模的神经网络模型通常遭遇计算、存储等资源限制,稀疏神经网络的出现有效地缓解了对计算和存储的需求.尽管现有的领域专用加速器能够有效处理稀疏网络,它们通过算法和结构的紧耦合实现高能效,却丧失了结构的灵活性.粗粒度数据流架构通过灵活的指令调度可以实现不同的神经网络应用.基于该架构,密集卷积规则的计算特性使不同通道共享相同的一套指令执行,然而稀疏网络中存在权值稀疏,使得这些指令中存在0值相关的无效指令,而现有的指令执行方式无法自动跳过它们从而产生无效计算.同时在执行不规则的稀疏网络时,现有的指令映射方法造成了计算阵列的负载不均衡.这些问题阻碍了稀疏网络性能的提升.基于不同通道共享一套指令的前提下,根据稀疏网络的数据和指令特征增加指令控制单元实现权值数据中0值相关指令的检测和跳过,同时使用负载均衡的指令映射算法解决稀疏网络中指令执行不均衡问题.实验表明：与密集网络相比稀疏网络实现了平均1.55倍的性能提升和63.77%的能耗减少.同时比GPU(cuSparse)和Cambricon-X实现的稀疏网络分别快2.39倍(Alexnet)、2.28倍(VGG16)和1.14倍(Alexnet)、1.23倍(VGG16). 相似文献

6.

基于异构FPGA的卷积网络加速器

周锡雄钟胜张伟俊王建辉《模式识别与人工智能》2019,32(10):927-935

基于神经网络的方法计算量通常十分庞大,限制方法在嵌入式场景领域的应用.为了解决这一问题,文中提出基于异构现场可编程门阵列的卷积网络加速器.采用滑动窗并行加速卷积计算过程,可同时处理不同输入、输出通道的卷积过程.同时结合网络量化过程进行8 bit定点加速器设计,降低计算资源的使用.实验表明,文中定点加速器运算速度较快,功耗较小,算法性能损失较小. 相似文献

7.

基于FPGA的卷积神经网络并行加速设计

龚豪杰周海冯水春《计算机工程与设计》2022,43(7):1872-1878

为提升在资源、功耗受限的嵌入式平台上运行的深度卷积网络算法的速度和能效,提出一种基于现场可编程门阵列(FPGA)的卷积并行加速方案。利用卷积层与批归一化(batch normalization,BN)层融合减少计算复杂度;利用数据分片减少片上存储消耗;利用数据复用、并行计算提升运算速度,减少系统硬件开销;利用设计空间探索找到最符合硬件资源约束的计算并行度。实验结果表明,在100MHz的工作频率下,加速器的峰值计算性能可以达到52.56GFLOPS,性能是CPU的4.1倍,能耗仅为GPU的9.9%,与其它FPGA方案相比综合性能有一定的提升。相似文献

8.

多数据流并行卷积运算加速引擎研究与设计

马佳利朱智强戴乐育郭松辉向建安《计算机工程与设计》2020,41(12):3557-3562

为解决卷积神经网络中卷积运算耗时长、运算复杂的问题,分析卷积运算的数据路由方式,提出一种多数据流并行卷积运算方法,实现卷积运算加速引擎的设计。通过在FPGA上进行实验验证,该设计能正确输出卷积运算的结果,相比已有加速器设计,所需寄存器数量减少30.6%,节省了逻辑资源,缩短了数据传输带来的时延,运算速度提升了7.37%,能够有效加速卷积运算完成。相似文献

9.

轻量级卷积神经网络的硬件加速方法

吕文浩支小莉童维勤《计算机工程与设计》2024,(3):699-706

为提升轻量级卷积神经网络在硬件平台的资源利用效率和推理速度,基于软硬件协同优化的思想,提出一种面向FPGA平台的轻量级卷积神经网络加速器,并针对网络结构的特性设计专门的硬件架构。与多级并行策略结合,设计一种统一的卷积层计算单元。为降低模型存储成本、提高加速器的吞吐量,提出一种基于可微阈值的选择性移位量化方案,使计算单元能够以硬件友好的形式执行计算。实验结果表明,在Arria 10 FPGA平台上部署的MobileNetV2加速器能够达到311 fps的推理速度,相比CPU版本实现了约9.3倍的加速比、GPU版本约3倍的加速比。在吞吐量方面,加速器能够实现98.62 GOPS。相似文献

10.

基于FPGA动态重构的卷积神经网络硬件架构设计

《微型机与应用》2019,(3):77-81

为了解决卷积神经网络硬件实现阶段的资源限制问题,提出了基于FPGA动态重构的卷积神经网络加速器设计。首先,设计了卷积神经网络加速器的整体并行策略和VLSI架构,并针对卷积神经网络的功能模块进行了流水线设计。其次,对卷积神经网络加速器进行动态重构设计,建立动态重构区域及其模块功能划分;并选用BPI Flash存储配置文件,通过内部配置端口读取配置文件对动态重构区域进行动态配置。实验结果表明,针对Lenet-5手写体识别网络,基于动态重构设计的加速器与相应的静态设计相比,使用的Slice LUTs、Slice Registers与DSP资源分别减少44%、27. 8%与71%。与基于软件平台实现作对比,系统执行时间大幅度降低。但是由于内部配置端口的带宽限制,重构配置时间延长了整个卷积网络的执行时间。相似文献

11.

FAQ-CNN:面向量化卷积神经网络的嵌入式FPGA可扩展加速框架

谢坤鹏卢冶靳宗明刘义情龚成陈新伟李涛《计算机研究与发展》2022,59(7):1409-1427

卷积神经网络(convolutional neural network, CNN)模型量化可有效压缩模型尺寸并提升CNN计算效率.然而,CNN模型量化算法的加速器设计,通常面临算法各异、代码模块复用性差、数据交换效率低、资源利用不充分等问题.对此,提出一种面向量化CNN的嵌入式FPGA加速框架FAQ-CNN,从计算、通信和存储3方面进行联合优化,FAQ-CNN以软件工具的形式支持快速部署量化CNN模型.首先,设计面向量化算法的组件,将量化算法自身的运算操作和数值映射过程进行分离;综合运用算子融合、双缓冲和流水线等优化技术,提升CNN推理任务内部的并行执行效率.然后,提出分级编码与位宽无关编码规则和并行解码方法,支持低位宽数据的高效批量传输和并行计算.最后,建立资源配置优化模型并转为整数非线性规划问题,在求解时采用启发式剪枝策略缩小设计空间规模.实验结果表明,FAQ-CNN能够高效灵活地实现各类量化CNN加速器.在激活值和权值为16 b时,FAQ-CNN的加速器计算性能是Caffeine的1.4倍;在激活值和权值为8 b时,FAQ-CNN可获得高达1.23TOPS的优越性能. 相似文献

12.

基于改进CNN的局部相似性预测推荐模型

吴国栋宋福根涂立静史明哲《计算机工程与科学》2019,41(6):1071-1077

为缓解推荐系统中数据稀疏性问题,利用卷积神经网络CNN具有较强捕捉局部特征能力的优势,通过加入一个调节层,提出一种改进CNN的局部相似性预测推荐模型LSPCNN。新模型对初始用户-项目评分矩阵进行迭代调整,使用户兴趣偏好局部特征化,再融合CNN对缺失评分进行预测,从而实施个性化推荐。实验结果表明,LSPCNN模型在不同数据稀疏度下的MAE值较传统推荐方法平均下降4%,有效缓解了数据稀疏性,提高了推荐系统的性能。相似文献

13.

基于线性脉动阵列的卷积神经网络计算优化与性能分析

下载免费PDF全文

刘勤让刘崇阳周俊王孝龙《网络与信息安全学报》2018,4(12):16-24

针对大部分FPGA端上的卷积神经网络（CNN,convolutional neural network）加速器设计未能有效利用稀疏性的问题,从带宽和能量消耗方面考虑,提出了基于线性脉动阵列的2种改进的CNN计算优化方案。首先,卷积转化为矩阵相乘形式以利用稀疏性;其次,为解决传统的并行矩阵乘法器存在较大I/O需求的问题,采用线性脉动阵列改进设计;最后,对比分析了传统的并行矩阵乘法器和2种改进的线性脉动阵列用于CNN加速的利弊。理论证明及分析表明,与并行矩阵乘法器相比,2种改进的线性脉动阵列都充分利用了稀疏性,具有能量消耗少、I/O带宽占用少的优势。相似文献

14.

1D convolutional neural networks for chart pattern classification in financial time series

Liu Liying Si Yain-Whar 《The Journal of supercomputing》2022,78(12):14191-14214

This paper proposes a novel deep learning-based approach for financial chart patterns classification. Convolutional neural networks (CNNs) have made notable achievements in image recognition and computer vision applications. These networks are usually based on two-dimensional convolutional neural networks (2D CNNs). In this paper, we describe the design and implementation of one-dimensional convolutional neural networks (1D CNNs) for the classification of chart patterns from financial time series. The proposed 1D CNN model is compared against support vector machine, extreme learning machine, long short-term memory, rule-based and dynamic time warping. Experimental results on synthetic datasets reveal that the accuracy of 1D CNN is highest among all the methods evaluated. Results on real datasets also reveal that chart patterns identified by 1D CNN are also the most recognized instances when they are compared to those classified by other methods.

相似文献

15.

Zernike-CNNs for image preprocessing and classification in printed register detection

Wang Sheng Lv Lin-Tao Yang Hong-Cai Lu Di 《Multimedia Tools and Applications》2021,80(21-23):32409-32421

In the register detection of printing field, a new approach based on Zernike-CNNs is proposed. The edge feature of image is extracted by Zernike moments (ZMs), and a recursive algorithm of ZMs called Kintner method is derived. An improved convolutional neural networks (CNNs) are investigated to improve the accuracy of classification. Based on the classic convolutional neural network (CNN), the improved CNNs adopt parallel CNN to enhance local features, and adopt auxiliary classification part to modify classification layer weights. A printed image is trained with 7?×?400 samples and tested with 7?×?100 samples, and then the method in this paper is compared with other methods. In image processing, Zernike is compared with Sobel method, Laplacian of Gaussian (LoG) method, Smallest Univalue Segment Assimilating Nucleus (SUSAN) method, Finite Impusle Response (FIR) method, Multi-scale Morphological Gradient (MMG) method. In image classification, improved CNNs are compared with classical CNN. The experimental results show that Zernike-CNNs have the best performance, the mean square error (MSE) of the training samples reaches 0.0143, and the detection accuracy of training samples and test samples reached 91.43% and 94.85% respectively. The experiments reveal that Zernike-CNNs are a feasible approach for register detection.

相似文献

16.

一种卷积神经网络集成的多样性度量方法

下载免费PDF全文

汤礼颖贺利乐何林屈东东《智能系统学报》2021,16(6):1030-1038

分类器模型之间的多样性是分类器集成的一个重要性能指标。目前大多数多样性度量方法都是基于基分类器模型的0/1输出结果（即Oracle 输出）进行计算,针对卷积神经网络的概率向量输出结果,仍需要将其转化为Oracle输出方式进行度量,这种方式未能充分利用卷积神经网络输出的概率向量所包含的丰富信息。针对此问题,利用多分类卷积神经网络模型的输出特性,提出了一种基于卷积神经网络的概率向量输出方式的集成多样性度量方法,建立多个不同结构的卷积神经网络基模型并在CIFAR-10和CIFAR-100数据集上进行实验。实验结果表明,与双错度量、不一致性度量和Q统计多样性度量方法相比,所提出的方法能够更好地体现模型之间的多样性,为模型选择集成提供更好的指导。相似文献

17.

基于改进卷积神经网络的多标记分类算法

下载免费PDF全文

余鹰王乐为吴新念伍国华张远健《智能系统学报》2019,14(3):566-574

良好的特征表达是提高模型性能的关键，然而当前在多标记学习领域，特征表达依然采用人工设计的方式，所提取的特征抽象程度不高，包含的可区分性信息不足。针对此问题，提出了基于卷积神经网络的多标记分类模型ML_DCCNN，该模型利用卷积神经网络强大的特征提取能力，自动学习能刻画数据本质的特征。为了解决深度卷积神经网络预测精度高，但训练时间复杂度不低的问题，ML_DCCNN利用迁移学习方法缩减模型的训练时间，同时改进卷积神经网络的全连接层，提出双通道神经元，减少全连接层的参数量。实验表明，与传统的多标记分类算法以及已有的基于深度学习的多标记分类模型相比，ML_DCCNN保持了较高的分类精度并有效地提高了分类效率，具有一定的理论与实际价值。相似文献

18.

An algorithm/hardware co-optimized method to accelerate CNNs with compressed convolutional weights on FPGA

Jiangwei Shang Zhan Zhang Kun Zhang Chuanyou Li Lei Qian Hongwei Liu 《Concurrency and Computation》2024,36(11):e8011

Convolutional neural networks (CNNs) have shown remarkable advantages in a wide range of domains at the expense of huge parameters and computations. Modern CNNs still tend to be more complex and larger to achieve better inference accuracy. However, the complex and large structures of CNNs could slow down the inference speed. Recently, Compressing the convolutional weights to be sparse by pruning the unimportant parameters has been demonstrated as an efficient way to reduce the computations of CNNs. On the other hand, field-programmable gate arrays (FPGAs) have been a popular hardware platform to accelerate CNN inference. In this paper, we propose an algorithm/hardware co-optimized method for accelerating CNN inference on FPGAs. For the algorithm, we take advantage of unstructured and structured parameter sparsifying methods to achieve high sparsity and keep the regularity of convolutional weights. Correspondingly, hardware-friendly index representations of sparse convolutional weights are proposed. For the hardware architecture, we propose row-wise input-stationary dataflow, which is tightly coupled with the algorithm. A row-wise computing engine (RConv Engine) is proposed, which is based on the dataflow. Inside the RConv Engine, the scalar-vector structure is applied to implement the basic processing elements (PEs). To flexibly calculate the feature map with various sizes, the PEs are organized in a 2D structure with two work modes. The experimental results demonstrate that our co-optimized method implements high sparsity of convolutional weights, and the computing engine achieves high computation efficiency. Compared with other accelerators, our co-optimized method implements a 10.9

speedup on FPS at most with the highest sparsity of convolutional weights and negligible accuracy loss. 相似文献

19.

Feature Analysis of Unsupervised Learning for Multi-task Classification Using Convolutional Neural Network

Jonghong Kim Waqas Bukhari Minho Lee 《Neural Processing Letters》2018,47(3):783-797

This study analyzes the characteristics of unsupervised feature learning using a convolutional neural network (CNN) to investigate its efficiency for multi-task classification and compare it to supervised learning features. We keep the conventional CNN structure and introduce modifications into the convolutional auto-encoder design to accommodate a subsampling layer and make a fair comparison. Moreover, we introduce non-maximum suppression and dropout for a better feature extraction and to impose sparsity constraints. The experimental results indicate the effectiveness of our sparsity constraints. We also analyze the efficiency of unsupervised learning features using the t-SNE and variance ratio. The experimental results show that the feature representation obtained in unsupervised learning is more advantageous for multi-task learning than that obtained in supervised learning. 相似文献

20.

Adaptive Convolutional Neural Network and Its Application in Face Recognition 总被引：1，自引：0，他引：1

Yuanyuan Zhang Dong Zhao Jiande Sun Guofeng Zou Wentao Li 《Neural Processing Letters》2016,43(2):389-399

Convolutional neural network (CNN) has more and more applications in image recognition. However, the structure of CNN is often determined after a performance comparison among the CNNs with different structures, which impedes the further development of CNN. In this paper, an adaptive convolutional neural network (ACNN) is proposed, which can determine the structure of CNN without performance comparison. The final structure of ACNN is determined by automatic expansion according to performance requirement. First, the network is initialized by a one-branch structure. The system average error and recognition rate of the training samples are set to control the expansion of the structure of CNN. That is to say, the network is extended by global expansion until the system average error meets the requirement and when the system average error is satisfied, the local network is expanded until the recognition rate meets the requirement. Finally, the structure of CNN is determined automatically. Besides, the incremental learning for new samples can be achieved by adding new branches while keeping the original network unchanged. The experiment results of face recognition on ORL face database show that there is a better tradeoff between the consumption of training time and the recognition rate in ACNN. 相似文献