基于平铺数据流的可配置神经网络加速器 A configurable convolutional neural network accelerator based on tiling dataflow期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于平铺数据流的可配置神经网络加速器

引用本文：	李艺煌,马胜,郭阳,陈桂林,徐睿.基于平铺数据流的可配置神经网络加速器[J].计算机工程与科学,2019,41(6):963-972.

作者姓名：	李艺煌马胜郭阳陈桂林徐睿

作者单位：	国防科技大学计算机学院,湖南长沙,410073;国防科技大学计算机学院,湖南长沙,410073;国防科技大学计算机学院,湖南长沙,410073;国防科技大学计算机学院,湖南长沙,410073;国防科技大学计算机学院,湖南长沙,410073

基金项目：	国家自然科学基金(61672526);校预研基金(ZK17 03 06)

摘要：	卷积神经网络已经是公认最好的用于深度学习的算法，被广泛地应用于图像识别、自动翻译和广告推荐。由于神经网络结构规模的逐渐增大，使其具有大量的神经元和突触，所以，使用专用加速硬件挖掘神经网络的并行性已经成为了热门的选择。在硬件设计中,经典的平铺结构实现了很高的性能，但是平铺结构的单元利用率很低。目前，随着众多深度学习应用对硬件性能要求的逐渐提高，加速器对单元利用率也具有越来越严格的要求。为了在平铺数据流结构上获得更高的单元利用率，可以调换并行的顺序,采用并行输入特征图和输出通道的方式来提高计算的并行性。但是，随着神经网络运算对硬件性能要求的提高，运算单元阵列必然会越来越大。当阵列大小增加到一定程度，相对单一的并行方式会使利用率逐渐下降。这就需要硬件可以开发更多的神经网络并行度，从而抑制单元空转。同时，为了适应不同的网络结构，要求硬件阵列对神经网络的运算是可配置的。但是，可配置硬件会极大地增加硬件开销和数据的调度难度。提出了一种基于平铺结构加速器的并行度可配置的神经网络加速器。为了减少硬件复杂度，提出了部分配置的技术，既能满足大型单元阵列下单元利用率的提升，也能尽可能地减少硬件额外开销。在阵列大小超过512之后，硬件单元利用率平均可以维持在82%～90%。同时加速器性能与单元阵列数量基本成线性比例上升。
关键词：	神经网络平铺数据流可配置单元利用率并行性
收稿时间：	2018-11-23
修稿时间：	2019-06-25
A configurable convolutional neural network accelerator based on tiling dataflow

LI Yi huang,MA Sheng,GUO Yang,CHEN Gui lin,XU Rui.A configurable convolutional neural network accelerator based on tiling dataflow[J].Computer Engineering & Science,2019,41(6):963-972.

Authors:	LI Yi huang MA Sheng GUO Yang CHEN Gui lin XU Rui

Affiliation:	（School of Computer,National University of Defense Technology,Changsha 410073,China）

Abstract:	Convolutional neural networks (CNNs) have been recognized as the best algorithm for deep learning, and they are widely used in image recognition, automatic translation and advertising recommendations. Due to the increasing size of the neural network, the number of the neurons and synapses of the network is also enlarged. Therefore, using specific acceleration hardware to mine the parallelism of CNNs becomes a popular choice. For hardware design, the classic tiling dataflow has achieved high performance. However, the utilization of processing elements of the tiling structure is very low. As deep learning applications demand higher hardware performance, accelerators require higher utilization of processing elements. In order to achieve this goal, we can change the scheduling order to improve the performance, and use parallel input feature graphs and output channels to improve computing parallelism. However, as neural network computation's demand on hardware performance increases, the array size of processing elements inevitably becomes larger and larger. When the array size is increased to a certain extent, a single parallel approach makes utilization gradually to decrease. This requires hardware to develop more neural network parallelism, thereby suppressing element idling. At the same time, in order to adapt to different network structures, configurable operation of hardware arrays on the neural network is required. But configurable hardware can greatly increase hardware overhead and data scheduling difficulty. So, we propose a configurable neural network accelerator based on tiling dataflow. In order to reduce hardware complexity, we propose a partial configuration technique, which can not only improve the utilization of processing elements under large array, but also reduce hardware overhead as much as possible. When the array size of processing elements exceeds 512, the utilization can maintain at an average of 82%~90%. And the accelerator performance is almost linearly proportional to the number of processing elements.

Keywords:	CNN tiling dataflow configurable parallelism
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与科学》浏览原始摘要信息
	点击此处可从《计算机工程与科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏