一种高度并行的卷积神经网络加速器设计方法 A highly parallel design method for convolutional neural networks accelerator期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种高度并行的卷积神经网络加速器设计方法

引用本文：	徐欣,刘强,王少军.一种高度并行的卷积神经网络加速器设计方法[J].哈尔滨工业大学学报,2020,52(4):31-37.

作者姓名：	徐欣刘强王少军

作者单位：	天津市成像与感知微电子技术重点实验室(天津大学),天津300072,天津市成像与感知微电子技术重点实验室(天津大学),天津300072,哈尔滨工业大学电子与信息工程学院,哈尔滨150001

基金项目：	国家自然科学基金(61574099)；天津市交通运输委科技发展基金(2017b-40)

摘要：	为实现卷积神经网络数据的高度并行传输与计算,生成高效的硬件加速器设计方案,提出了一种基于数据对齐并行处理、多卷积核并行计算的硬件架构设计和探索方法. 该方法首先根据输入图像尺寸对数据进行对齐预处理,实现数据层面的高度并行传输与计算,以提高加速器的数据传输和计算速度,并适应多种尺寸的输入图像；采用多卷积核并行计算方法,使不同的卷积核可同时对输入图片进行卷积,以实现卷积核层面的并行计算；基于该方法建立硬件资源与性能的数学模型,通过数值求解,获得性能与资源协同优化的高效卷积神经网络硬件架构方案. 实验结果表明: 所提出的方法,在Xilinx Zynq XC7Z045上实现的基于16位定点数的SSD网络(single shot multibox detector network)模型在175 MHz的时钟频率下,吞吐量可以达到44.59帧/s,整板功耗为9.72 W,能效为31.54 GOP/(s·W);与实现同一网络的中央处理器(CPU)和图形处理器(GPU)相比,功耗分别降低85.1%与93.9%；与现有的其他卷积神经网络硬件加速器设计相比,能效提升20%~60%,更适用于低功耗嵌入式应用场合.
关键词：	现场可编程门阵列卷积神经网络并行处理硬件结构优化 SSD网络
收稿时间：	2018/12/25 0:00:00
A highly parallel design method for convolutional neural networks accelerator

XU Xin,LIU Qiang and WANG Shaojun.A highly parallel design method for convolutional neural networks accelerator[J].Journal of Harbin Institute of Technology,2020,52(4):31-37.

Authors:	XU Xin LIU Qiang and WANG Shaojun

Affiliation:	Key Laboratory of Imaging and Sensing Microelectronic Technology Tianjin University, Tianjin 300072, China,Key Laboratory of Imaging and Sensing Microelectronic Technology Tianjin University, Tianjin 300072, China and School of Electronic and Information Engineering, Harbin Institute of Technology, Harbin 150001, China

Abstract:	To achieve highly parallel data transmission and computation of convolutional neural network acceleration and generate efficient hardware accelerator design, a hardware design and exploration method based on data-alignment and multi-filter parallel computing was proposed. In order to improve the data transmission and computation speed and adapt to various input image sizes, the method first aligned the data according to the input image size to achieve highly parallel transmission and computation at the data level. The method also used the multi-filter parallel computing method so that different filters can simultaneously convolve the input image to achieve parallel computing at the filters level. Based on this method, mathematical models of hardware resources and performance were formulated and numerically solved to obtain the performance and resource co-optimized neural network hardware architecture. The proposed design method was applied to the single shot multibox detector (SSD) network, and results show that the accelerator on Xilinx Zynq XC7Z045 at 175 MHz clock frequency could achieve the throughput of 44.59 FPS, power consumption of 9.72 W, and power efficiency of 31.54 GOP/(s·W). The accelerator consumed 85.1% and 93.9% less power than the central processing unit (CPU) and graphics processing unit (GPU) implementations respectively. Compared with the exiting designs, the power efficiency of the proposed design increased 20%~60%. Therefore, the design method is more suitable for embedded applications with low power requirements.

Keywords:	field programmable gate array (FPGA) convolutional neural network parallelism structure optimization single shot multibox detector (SSD) network
本文献已被万方数据等数据库收录！
	点击此处可从《哈尔滨工业大学学报》浏览原始摘要信息
	点击此处可从《哈尔滨工业大学学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏