首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
为满足实际应用对卷积神经网络(CNN)推理的低时延、小体积和高吞吐率等要求,设计了一个采用如下优化方法的加速器:针对外存访问带宽限制,基于设计空间探索确定循环分块因子以最大化数据重用;针对CNN计算密度高,采用循环展开技术充分挖掘四种计算并行度;内存池、乒乓缓存和动态数据量化等技术用于管理片内外存储资源。将生成加速器流程封装为CNN加速框架;采用生成的加速器实现了AlexNet网络,仿真结果表明,该设计最高可达1?493.4?Gops的计算峰值,是被比较工作的多达24.2倍,DSP效率也超过了其他设计方法,最低为1.2倍,实现了CNN快速部署,开发效率高,加速性能优异。  相似文献   

2.
余成宇    李志远    毛文宇  鲁华祥       《智能系统学报》2020,15(2):323-333
针对卷积神经网络计算硬件化实现困难的问题,之前大部分卷积神经网络加速器的设计都集中于解决计算性能和带宽瓶颈,忽视了卷积神经网络稀疏性对加速器设计的重要意义,近来少量的能够利用稀疏性的卷积神经网络加速器设计也往往难以同时兼顾计算灵活度、并行效率和资源开销。本文首先比较了不同并行展开方式对利用稀疏性的影响,分析了利用稀疏性的不同方法,然后提出了一种能够利用激活稀疏性加速卷积神经网络计算的同时,相比于同领域其他设计,并行效率更高、额外资源开销更小的并行展开方法,最后完成了这种卷积神经网络加速器的设计并在FPGA上实现。研究结果表明:运行VGG-16网络,在ImageNet数据集下,该并行展开方法实现的稀疏卷积神经网络加速器和使用相同器件的稠密网络设计相比,卷积性能提升了108.8%,整体性能提升了164.6%,具有明显的性能优势。  相似文献   

3.
设计了一种基于FPGA的低功耗深度可分离卷积加速核;根据PW卷积和DW卷积计算中的共性,采用一种固定乘法阵列通过改变特征和权重输入数据流的方式实现两种卷积的计算结构,最大化DSP的利用率;针对8位非对称量化中符号位可能会溢出的问题,采用符号位单独处理的方法重新封装了双乘法器结构;通过层内7级流水结构保证每个周期数据处理的并行度;在Zynq UltraScale+系列FPGA上成功部署了加速结构;经实验测试,提出的加速结构在提高网络推理速度的同时降低了片上资源的依赖度和整体功耗,原生MobilenetV2在所提FPGA加速器上的平均吞吐率高达130.6GOPS且整体功耗只有4.1w,满足实时边缘计算的要求;相比其他硬件平台,能效比有明显提升;与FPGA上的同类型加速器相比,在性能密度(GOPS/LUT)、功率效率(GOPS/W)和DSP效率(GOPS/DSP)上均有优势。  相似文献   

4.
吴健凤  郑博文  聂一  柴志雷 《计算机工程》2021,47(12):147-155,162
在数字货币、区块链、云端数据加密等领域,传统以软件方式运行的数据加解密存在计算速度慢、占用主机资源、功耗高等问题,而以Verilog/VHDL等方式实现的现场可编程门阵列(FPGA)加解密系统又存在开发周期长、维护升级困难等问题。针对3DES算法,提出一种基于OpenCL的FPGA加速器设计方案。设计具有48轮迭代的流水并行结构,在数据传输模块中采用数据存储调整、数据位宽改进策略提高内核实际带宽利用率,在算法加密模块中采用指令流优化策略形成流水线并行架构,同时采用内核矢量化、计算单元复制策略进一步提高内核性能。实验结果表明,该加速器在Intel Stratix 10 GX2800上可获得111.801 Gb/s的吞吐率,与Intel Core i7-9700 CPU相比性能提升372倍,能效提升644倍,与NvidiaGeForce GTX 1080Ti GPU相比性能提升20%,能效提升9倍。  相似文献   

5.
The availability of huge structured and unstructured data, advanced highly dense memory and high performance computing machines have provided a strong push for the development in artificial intelligence (AI) and machine learning (ML) domains. AI and machine learning has rekindled the hope of efficiently solving complex problems which was not possible in the recent past. The generation and availability of big-data is a strong driving force for the development of AI/ML applications, however, several challenges need to be addressed, like processing speed, memory requirement, high bandwidth, low latency memory access, and highly conductive and flexible connections between processing units and memory blocks. The conventional computing platforms are unable to address these issues with machine learning and AI. Deep neural networks (DNNs) are widely employed for machine learning and AI applications, like speech recognition, computer vison, robotics, and so forth, efficiently and accurately. However, accuracy is achieved at the cost of high computational complexity, sacrificing energy efficiency and throughput like performance measuring parameters along with high latency. To address the problems of latency, energy efficiency, complexity, power consumption, and so forth, a lot of state of the art DNN accelerators have been designed and implemented in the form of application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). This work provides the state of the art of all these DNN accelerators which have been developed recently. Various DNN architectures, their computing units, emerging technologies used in improving the performance of DNN accelerators will be discussed. Finally, we will try to explore the scope for further improvement in these accelerator designs, various opportunities and challenges for the future research.  相似文献   

6.
PCI Express作为第三代高性能I/O互连技术具有很多技术优势,如基于报文交换、点对点连接、LVDS高速串行互连、高带宽等。但是,PCI Express技术更多地应用于通用高性能计算机领域,鲜有将其应用于嵌入式系统设计中的实例。本文基于自行研制的一款嵌入式多核SoC系统YHFT-QDSP,根据系统设计需求,结合PCI Express技术特点,采用基于IP裁剪的快速设计方法将PCI Express技术应用于系统片间互连模块的设计中,缩短了设计周期并获得了良好的设计效果。采用0.13μm工艺单元库实现,PCI Express片间互连模块总面积为0.65mm2,其中协议转换模块面积为0.12mm2,片间数据传输有效带宽可达1.63Gb/s。  相似文献   

7.
In this paper, we propose a novel Convolutional Neural Network hardware accelerator called CoNNA, capable of accelerating pruned, quantized CNNs. In contrast to most existing solutions, CoNNA offers a complete solution to the compressed CNN acceleration, being able to accelerate all layer types commonly found in contemporary CNNs. CoNNA is designed as a coarse-grained reconfigurable architecture, which uses rapid, dynamic reconfiguration during CNN layer processing. The CoNNA architecture enables the on-the-fly selection of the CNN network that should be accelerated and also supports the acceleration of CNN networks with dynamic topology. Furthermore, by being able to directly process compressed feature and kernel maps, and skip all ineffectual computations during CNN layer processing, the CoNNA CNN accelerator is able to achieve higher CNN processing rates than some of the previously proposed solutions. The CoNNA architecture has been implemented using Xilinx ZynqUtrascale+ FPGA family and compared with seven previously proposed CNN hardware accelerators. Results of the experiments seem to indicate that the CoNNA architecture is up to 14.10, 6.05, 4.91, 2.67, 11.30, 3.08 and 3.58 times faster than previously proposed MIT's Eyeriss, NullHop, NVIDIA's Deep Learning Accelerator (NVDLA), NEURAghe, CNN_A1, fpgaConvNet, and Deephi's Aristotle CNN accelerators respectively, while using identical number of computing units and operating at the same clock frequency.  相似文献   

8.
High demand 3-D scenes on embedded systems draw the developers’ attention to use the whole resources of current low-power processors and add dedicated hardware as a graphic accelerator unit to deal with real-time realistic scene rendering. Photon mapping, as one of the most powerful techniques to render highly realistic 3-D images by high amounts of floating-point operations, is very time-consuming. To use the advantages of multiprocessor systems to make 3-D scenes, parallel photon-mapping rendering on a homogeneous multiprocessor SoC (MPSoC) platform along with a mesh NoC by an adaptive wormhole routing method to communicate packets among cores is proposed in this paper. To make efficient use of the MPSoC platform to carry out photon-mapping rendering, many methods concerning the increase of load balancing, the efficient use of memory, and the decrease of communication cost to achieve a scalable application are explored in this paper. The resulting MPSoC platform is verified and evaluated by cycle-accurate simulations for different sizes of the mesh NoC. As expected, the proposed methods can obtain excellent load balancing and achieve a maximum of 44.3 times faster on an 8-by-8 MPSoC platform than on a single-core MPSoC platform.  相似文献   

9.
矩阵乘法是数值分析以及图形图像处理算法的基础,通用的矩阵乘法加速器设计一直是嵌入式系统设计的研究热点。但矩阵乘法由于计算复杂度高,处理效率低,常常成为嵌入式系统运算速度的瓶颈。为了在嵌入式领域更好地使用矩阵乘法,提出了基于MPSoC(MultiProcessor System-on-Chip)的软硬件协同加速的架构。在MPSoC的架构下,一方面,设计了面向硬件约束的矩阵分块方法,从而实现了通用的矩阵乘法加速器系统;另一方面,通过利用MPSoC下的多核架构,提出了相应的任务划分和负载平衡调度算法,提高了并行效率和整体系统加速比。实验结果表明,所提架构及算法实现了通用的矩阵乘法计算,并且通过软硬件协同设计实现的多核并行调度算法与传统单核设计相比在计算效率方面得到了显著的提高。  相似文献   

10.
在视频编码的运动估计运算中,全搜索结构最为主流,然而相应传统的全搜索1-D、2-D脉动结构或树形结构在计算的过程中,往往会出现I/O带宽大或计算效率低等问题。针对这些问题,提出一种新的数据流和相应的两维脉动阵列结构,利用相邻当前块搜索域的数据重合,在保证高性能的同时最大程度地减小I/O带宽。结果表明,提出的结构可以在256周期内完成一个宏块41个运动矢量计算,并且只有3个数据输入。  相似文献   

11.
为了将图像数据高质量地传输,便于图像数据的检索、分析、处理和存储,设计了一个以JPEG标准的静态图像压缩编码系统;将原始图像进行图像压缩编码,同时利用DSP芯片处理速度快的特点,进行核心算法的处理满足系统实时性的要求;经过系统的多次验证,图像压缩编码系统成功地将一个31K大小的BMP格式图像转换为3K大小的JPEG格式图像,完成了10∶1的JPEG标准静态图像压缩;因此将原始图像进行图像压缩编码是解决图像数据量大小与有限存储容量和传输带宽矛盾的最合理方法。  相似文献   

12.
基于闪存的高速海量存储模块设计   总被引:2,自引:0,他引:2       下载免费PDF全文
王立峰  胡善清  刘峰  龙腾 《计算机工程》2011,37(7):255-257,261
鉴于嵌入式实时存储领域对存储带宽与容量的高要求,设计一种高速海量存储模块。模块由高密度NAND Flash存储阵列、大规模FPGA和高性能DSP构成,通过FPGA与DSP对数据存储过程的联合控制,实现海量数据的超高速存储。给出模块的存储管理设计与DSP软件设计。实际应用验证了该存储模块的实用性。  相似文献   

13.
分块存储的滑动窗口数据重用技术   总被引:1,自引:0,他引:1  
刘陶刚  赵荣彩  姚远  瞿进 《计算机应用》2010,30(5):1371-1375
滑动窗口操作在基于可重构系统的典型应用(如图像处理、模式识别和数字信号处理等)中广泛采用,但是当前生成的滑动窗口电路,存在着存储冗余、操作等待等问题,导致执行效率不高。从增大滑动窗口数据吞吐量出发,提出了分块存储的滑动窗口数据重用方法,通过并行存取窗口数据,来减少存储器访问时间,加速滑动窗口执行。实验表明,在三个滑动窗口典型应用实例中,采用该方法生成的硬件电路能够将程序性能分别提高7.0~9.0倍。  相似文献   

14.
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.  相似文献   

15.
脉动阵列结构规整、吞吐量大,适合矩阵乘算法,广泛用于设计高性能卷积、矩阵乘加速结构。在深亚微米工艺下,通过增大阵列规模来提升芯片计算性能,会导致频率下降、功耗剧增等问题。因此,结合3D集成电路技术,提出了一种将平面脉动阵列结构映射到3D集成电路上的双精度浮点矩阵乘加速结构3D-MMA。首先,设计了针对该结构的分块映射调度算法,提升矩阵乘计算效率;其次,提出了基于3D-MMA的加速系统,构建了3D-MMA的性能模型,并对其设计空间进行探索;最后,评估了该结构实现代价,并同已有先进加速器进行对比分析。实验结果表明,访存带宽为160GB/s时,采用4层16×16脉动阵列的堆叠结构时,3D-MMA计算峰值性能达3TFLOPS,效率达99%,且实现代价小于二维实现。在相同工艺下,同线性阵列加速器及K40GPU相比,3D-MMA的性能是后者的1.36及1.92倍,而面积远小于后者。探索了3D集成电路在高性能矩阵乘加速器设计中的优势,对未来进一步提升高性能计算平台性能具有一定的参考价值。  相似文献   

16.
The existing SCSI parallel bus has been widely used in various multimedia applications. However, due to the unfair bus accesses the SCSI bus may not be able to fully utilize the potential aggregate throughput of disks. The number of disks that can be attached to the SCSI bus is limited, and link level fault tolerance is not provided. The serial storage interfaces such as Serial Storage Architecture (SSA) provide high data bandwidth, fair accesses, long transmission distance between adjacent devices (disks or hosts) and link level fault tolerance. The fairness algorithm of SSA ensures a fraction of data bandwidth to be allocated to each device. In this paper we would like to know whether SSA is a better alternative in supporting continuous media than SCSI. The scalability of a multimedia server is very important since the storage requirement may grow incrementally as more contents are created and stored. SSA in a shared-storage cluster environment also supports concurrent accesses by different hosts as long as their access paths are not overlapped. This feature is called spatial reuse. Therefore, the effective bandwidth over an SSA can be higher than the raw data bandwidth and the spatial reuse feature is critical to the scalability of a multimedia server. This feature is also included in FC-AL3 with a new mode called Multiple Circuit Mode (MCM). Using MCM, all devices can transfer data simultaneously without collision. In this paper we have investigated the scalability of shared-stroage clusters over an SSA environment.  相似文献   

17.
李献球 《微处理机》2012,33(3):51-53,57
针对导航卫星信号监测的新要求,提出一种集成了数据采集、处理、存储和回放功能的一体化卫星信号采集系统。该系统采用PCIe架构,系统带宽可达到全双工900MB;数据采集480MB持续数据,以及480MB突发数据;数据处理采用FPGA+DSP联合架构,完成对数据实时处理;存储采用全固态结构,用Flash代替传统的磁盘存储;并且能够通过PCIe接口完成960MB带宽的数据回放。该系统具有性能高、集成度高的优点,可以运用在雷达、侦查等高端领域。  相似文献   

18.
A shrinking energy budget for mobile devices and increasingly complex communication standards make architecture development for software-defined radio very challenging. Coarse-grained array accelerators are strong candidates for achieving both high performance and low power. The C-programmable hybrid CGA-SIMD accelerator presented here targets emerging broadband cellular and wireless LAN standards, achieving up to 100-Mbps throughput with an average power consumption of 220 mW.  相似文献   

19.
在大数据时代,图被用于各种领域表示具有复杂联系的数据.图计算应用被广泛用于各种领域,以挖掘图数据中潜在的价值.图计算应用特有的不规则执行行为,引发了不规则负载、密集读改写更新操作、不规则访存和不规则通信等挑战.现有通用架构无法有效地应对上述挑战.为了克服加速图计算应用面临的挑战,大量的图计算硬件加速架构设计被提出.它们...  相似文献   

20.
针对传统SATA控制器存储系统性能受限、安全性不足问题,提出并设计了一款可实现PCIe(peripheral component interconnect express)与SATA(serial advanced technology attachment)协议传输数据互转,基于SM4算法实现本地数据安全存储的高速安全存储SoC(system of chip)芯片。通过构建合理的片内PCIe与SATA互转数据传输通路,利用PCIe VIP(verification intellectual property)及UVM(universal verification methodology)技术搭建系统应用级仿真验证平台,设计基于SystemVerilog语言的源激励用例和C固件,利用脚本自动化控制实现仿真验证。仿真结果表明,该SoC芯片通路上各设备链路建立正确,实现PCIe与SATA互转通路数据正确传输,测试带宽472 MBps,基于SM4算法的本地安全存储加解密无误,SM4算法加解密带宽1.33 Gbps。根据仿真实验结果可知,该PCIe与SATA桥接转换SoC芯片架构设计是可行的,实现了本地数据的安全存储,为进一步进行数据高速转换访问、安全传输存储研究奠定了重要基础。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号