首页 | 本学科首页   官方微博 | 高级检索  
     

面向深度神经网络加速芯片的高效硬件优化策略
引用本文:张萌,张经纬,李国庆,吴瑞霞,曾晓洋.面向深度神经网络加速芯片的高效硬件优化策略[J].电子与信息学报,2021,43(6):1510-1517.
作者姓名:张萌  张经纬  李国庆  吴瑞霞  曾晓洋
作者单位:1.东南大学电子学院国家专用集成电路系统工程技术研究中心 南京 2100962.复旦大学专用集成电路与系统国家重点实验室 上海 200433
基金项目:国家重点研发计划(2018YFB2202703),江苏省自然科学基金(BK20201145)
摘    要:轻量级神经网络部署在低功耗平台上的解决方案可有效用于无人机(UAV)检测、自动驾驶等人工智能(AI)、物联网(IOT)领域,但在资源有限情况下,同时兼顾高精度和低延时来构建深度神经网络(DNN)加速器是非常有挑战性的。该文针对此问题提出一系列高效的硬件优化策略,包括构建可堆叠共享计算引擎(PE)以平衡不同卷积中数据重用和内存访问模式的不一致;提出了可调的循环次数和通道增强方法,有效扩展加速器与外部存储器之间的访问带宽,提高DNN浅层网络计算效率;优化了预加载工作流,从整体上提高了异构系统的并行度。经Xilinx Ultra96 V2板卡验证,该文的硬件优化策略有效地改进了iSmart3-SkyNet和SkrSkr-SkyNet类的DNN加速芯片设计。结果显示,优化后的加速器每秒处理78.576帧图像,每幅图像的功耗为0.068 J。

关 键 词:深度神经网络    目标检测    神经网络加速器    低功耗    硬件优化
收稿时间:2021-01-04

Efficient Hardware Optimization Strategies for Deep Neural Networks Acceleration Chip
Meng ZHANG,Jingwei ZHANG,Guoqing LI,Ruixia WU,Xiaoyang ZENG.Efficient Hardware Optimization Strategies for Deep Neural Networks Acceleration Chip[J].Journal of Electronics & Information Technology,2021,43(6):1510-1517.
Authors:Meng ZHANG  Jingwei ZHANG  Guoqing LI  Ruixia WU  Xiaoyang ZENG
Affiliation:1.National ASIC Engineering Center, School of Electronic Sci. and Eng., Southeast University, Nanjing 210096, China2.National ASIC Key Laboratory, Fudan University, Shanghai 200433, China
Abstract:Lightweight neural networks deployed on low-power platforms have proven to be effective solutions for Artificial Intelligence (AI) and Internet Of Things (IOT) domains such as Unmanned Aerial Vehicle (UAV) detection and unmanned driving. However, in the case of limited resources, it is very challenging to build Deep Neural Networks (DNN) accelerator with both high precision and low delay. In this paper, a series of efficient hardware optimization strategies are proposed, including stackable shared Processing Engine (PE) to balance the inconsistency of data reuse and memory access patterns in different convolutions; Regulable loop parallelism and channel augmentation are proposed to increase effectively the access bandwidth between accelerator and external memory. It also improve the efficiency of DNN shallow layers computing; Pre-Workflow is applied to improve the overall parallelism of heterogeneous systems. Verified by Xilinx Ultra96 V2 board, the hardware optimization strategies in this paper improve effectively the design of DNN acceleration chips like iSmart3-SkyNet and SkrSkr-SkyNet. The results show that the optimized accelerator processes 78.576 frames per second, and the power consumption of each picture is 0.068 Joules.
Keywords:
点击此处可从《电子与信息学报》浏览原始摘要信息
点击此处可从《电子与信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号