首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
张宇  冯丹 《计算机科学》2010,37(5):274-277
由于应用种类、实时性以及处理效率等要求,高性能嵌入式计算硬件平台需要具备相当的计算能力以及一定的适应性。为此提出了一种基于Xilinx FPGA的动态可重构的片上系统设计方案。系统采用专用硬件来执行计算密集型任务,运用动态可重构技术来支持硬件处理模块功能的动态配置。研究了Xilinx可编程片上系统上的3种硬件加速方案:CPU协处理器、PLB扩展加速器和MPMC扩展加速器。实验数据表明MPMC加速器性能最优。在Vir-tex5 FPGA器件上实现了可动态重构的MPMC加速器,以128位AES加密、解密两个功能模块为例,从硬件资源占用率、重构延时等角度考察了可重构系统的特点。  相似文献   

2.
针对嵌入式系统的高速通信需求,提出一种基于RapidIO接口的嵌入式系统,该方案以Xilinx的FPGA以及PowerPC嵌入式CPU为平台,建立RapidIO IP核与嵌入式系统的硬件和软件接口,对其功能进行验证.给出硬件结构图及关键部分的设计思路.采用逻辑仿真与物理仿真证明了方案的有效性.  相似文献   

3.
为了提高嵌入式系统设计的灵活性,适应未来嵌入式系统发展的方向,提出了一个基于Xilinx FPGA利用片上可编程系统(system on a programmable chip,SOPC)技术实现以太网数据传输的方法.对整个系统的硬件结构,Xilinx SOPC设计所使用的工具、流程和采用的关键技术进行了介绍.在此基础上应用Xilinx的嵌入式开发套件,以Xilinx公司提供的知识产权(intellectual property,IP)核为基础,搭建了一个SOPC平台.将uIP协议栈应用于软件设计中,实现了驱动程序和应用程序的开发.与上位机进行通信测试的结果表明了该设计的可行性和有效性.  相似文献   

4.
为了适应未来嵌入式快速发展的要求,提出了一种基于Xilinx FPGA利用片上可编程系统(system on a programmable chip,SOPC)技术实现大量数据的CF卡(compact flash)存储以及使用轻型IP协议(light weight IP,LwIP)进行数据传输的方法.对整个系统的硬件结构,Xilinx SOPC开发工具、流程和采用的关键技术进行了介绍.在此基础上应用Xilinx的嵌入式开发套件,以Xilinx公司提供的知识产权(intellectual property,IP)核为基础,搭建了一个SOPC平台.通过将CF卡、LwIP协议应用于软件设计中,实现了驱动程序和应用程序的开发,并通过与上位机进行通信测试,测试结果表明了该设计的可行性和有效性.  相似文献   

5.
王文虎  伍祁林  黄慧 《自动化仪表》2010,31(3):41-43,46
以网络化实验教学为背景,设计了一种基于NiosⅡ软核处理器和嵌入式实时操作系统的网络化实验教学数据采集系统。利用Quartus Ⅱ及SoPC Builder开发工具,对Nios ⅡCPU硬件资源进行重配置,自行定制CPU,并根据应用需要组建外围硬件接口,完成了一个基于FPGA的高性能嵌入式硬件系统的设计。采用嵌入式实时操作系统μC/OSⅡ,为应用系统的实时性和人机交互功能提供了保证。实践表明,该应用方案具有功能可扩展和硬件可升级等优势,具有很好的应用潜力。  相似文献   

6.
针对基于嵌入式现场可编程门阵列(FPGA)平台的卷积神经网络加速器由于资源有限导致处理速度受限的问题,提出一种高性能卷积神经网络加速器.首先根据卷积神经网络和嵌入式FPGA平台的特点,设计软硬件协同操作架构;然后在存储资源和计算资源的限制下,分别提出二维直接内存存取分块和权衡数字信号处理单元与查找表使用的优化策略;最后针对人脸检测的应用,对SSD网络模型进行优化,采用软硬件流水结构,提高人脸检测系统的整体性能.在Xilinx ZC706开发板上实现此加速器,实验结果表明,该加速器可达到167.5 GOPS的平均性能和81.2帧/s的人脸检测速率,其平均性能和人脸检测速率是嵌入式GPU平台TX2的1.58倍.  相似文献   

7.
提出并实现了一种基于嵌入式GPU(OES:OpenGL?ES)的跨平台图形应用软件的系统框架.它包括外部事件的驱动,图形应用软件,嵌入式系统入口,嵌入式系统硬件等四个模块.外部事件的驱动主要是响应外部数据或事件的变化,从而控制图形显示内容的实时更新,以及功能画面的实时切换.图形应用软件模块包括了三个组成部分(1)接口界面(2)中间通讯层(3)处理单元.图形应用软件的接口界面主要是实现客户化的目标要求,采用C++类的面向对象的设计方法.中间通讯层,是为了实现图形应用的任务而安排的结构化的类.处理单元是各种最基本内容的单元实现,它建立在我们的各种实用库之上.嵌入式系统入口,它封装了图形软件的核心函数功能,实现和上层的处理单元间的数据调度.嵌入式系统硬件模块主要是各主流平台(CPU,GPU)相关的数据信息,支持上层的图形应用.本文在虚拟仪表盘面上实践了上述应用软件系统,满足了实时响应,高效处理,高质量图形显示的要求.为实现嵌入式平台的图形显示应用打下了重要的基础.同时,本文的工作提出并解决了若干嵌入式图形显示技术的优化问题,为嵌入式图形显示开发提供了有力的帮助.  相似文献   

8.
为提升二维图形操作的执行效率,提出一种并行计算结构的二维图形加速引擎,能够同时对典型的二维图形、文本和图像进行处理,显著增强二维图形图像的处理效率。基于Xilinx Virtex6 xc6v1x760构建FPGA原型系统,进行功能验证和性能评估,评估结果表明,相比Marvell PXA300,该二维图形加速器能更加有效地加速二维图形操作,CPU 使用硬件调用在加速引擎上执行二维图形操作比软件执行平均快23倍,在SMIC 65nm CMOS工艺下,加速器的工作速度可达325 M Hz ,满足设计需求。  相似文献   

9.
文中在对各种防火墙的综合比较的基础上,讨论了嵌入式Linux的优缺点,提出了一个基于嵌入式Linux实现硬件防火墙的方案,最后给出了实现该方案的各个步骤.并针对嵌入式系统的实时性要求和防火墙的性能要求,介绍了RTLinux(一种硬实时Linux API)和RTnet的原理,给出了基于RTLinux和RTnet的防火墙整体框架.采用本框架的嵌入式硬件防火墙在性能上优于纯软件防火墙,而在价格上低于纯硬件防火墙.  相似文献   

10.
基于SOPC的指纹识别系统的设计与实现   总被引:2,自引:0,他引:2  
  相似文献   

11.
大多数基于卷积神经网络(CNN)的算法都是计算密集型和存储密集型的,很难应用于具有低功耗要求的航天、移动机器人、智能手机等嵌入式领域。针对这一问题,提出一种面向CNN的高并行度现场可编程逻辑门阵列(FPGA)加速器。首先,比较研究CNN算法中可用于FPGA加速的4类并行度;然后,提出多通道卷积旋转寄存流水(MCRP)结构,简洁有效地利用了CNN算法的卷积核内并行;最后,采用输入输出通道并行+卷积核内并行的方案提出一种基于MCRP结构的高并行度CNN加速器架构,并将其部署到XILINX的XCZU9EG芯片上,在充分利用片上数字信号处理器(DPS)资源的情况下,峰值算力达到2 304 GOPS。以SSD-300算法为测试对象,该CNN加速器的实际算力为1 830.33 GOPS,硬件利用率达79.44%。实验结果表明,MCRP结构可有效提高CNN加速器的算力,基于MCRP结构的CNN加速器可基本满足嵌入式领域大部分应用的算力需求。  相似文献   

12.
为加快深度学习人脸检测算法MTCNN(multi-task convolution neural network)的推理速度,满足许多应用场合检测的实时性的要求,基于Xilinx FPGA ZCU102开发板设计针对MTCNN专门优化的卷积和全连接加速硬件.该加速硬件不仅适用于MTCNN网络,其它神经网络推理算法也可以...  相似文献   

13.
MobileNet网络是一种广泛应用于嵌入式领域的深度神经网络,为了解决其硬件实现效率低的问题,同时达到在不同硬件资源下具有一定可伸缩性,提出了基于FPGA的一款MobileNet网络加速器结构,针对网络的堆叠结构特性设计了三级流水的加速阵列,并实现了在0~4000乘法器开销下都达到70% 以上的计算效率.最终在XIL...  相似文献   

14.
There are many design challenges in the hardware-software co-design approach for performance improvement of data-intensive streaming applications with a general-purpose microprocessor and a hardware accelerator. These design challenges are mainly to prevent hardware area fragmentation to increase resource utilization, to reduce hardware reconfiguration cost and to partition and schedule the tasks between the microprocessor and the hardware accelerator efficiently for performance improvement and power savings of the applications.In this paper a modular and block based hardware configuration architecture named memory-aware run-time reconfigurable embedded system (MARTRES) is proposed for efficient resource management and performance improvement of streaming applications. Subsequently we design a task placement algorithm named hierarchical best fit ascending (HBFA) algorithm to prove that MARTRES configuration architecture is very efficient in increased resource utilization and flexible in task mapping and power savings. The time complexity of HBFA algorithm is reduced to O(n) compared to traditional Best Fit (BF) algorithm’s time complexity of O(n2), when the quality of the placement solution by HBFA is better than that of BF algorithm. Finally we design an efficient task partitioning and scheduling algorithm named balanced partitioned and placement-aware partitioning and scheduling algorithm (BPASA). In BPASA we exploit the temporal parallelism in streaming applications to reduce reconfiguration cost of the hardware, while keeping in mind the required throughput of the output data. We balance the exploitation of spatial parallelism and temporal parallelism in streaming applications by considering the reconfiguration cost vs. the data transfer cost. The scheduler refers to the HBFA placement algorithm to check whether contiguous area on FPGA is available before scheduling the task for HW or for SW.  相似文献   

15.
This paper starts proposing a complete recommender system implemented on reconfigurable hardware with the purpose of testing on-chip, low-energy embedded collaborative filtering applications. Although the computing time is lower than the one obtained from usual multicore microprocessors, this proposal has the advantage of providing an approach to solve any prediction problem based on collaborative filtering by using an off-line, highly-portable light computing environment. This approach has been successfully tested with state-of-the-art datasets. Next, as a result of improving certain tasks related to the on-chip recommender system, we propose a custom, fine-grained parallel circuit for quick matrix multiplication with floating-point numbers. This circuit was designed to accelerate the predictions from the model obtained by the recommender system, and tested with two small datasets for experimental purposes. The accelerator is built from two levels of parallelism. On the one hand, several predictions run in parallel through the simultaneous multiplication of different vectors of two matrices. On the other hand, the operation of each vector is executed in parallel by multiplying pairs of floating-point values to later add the corresponding results in parallel as well. This circuit was compared with other approaches designed for the same purpose: circuits built using automatized tools of high-level synthesis, a general-purpose microprocessor, and high-performance graphical processing units. The performance of the prediction accelerator in terms of time surpassed that of the other approaches. We also evaluated the scalability of the circuit to practical problems using the high-level synthesis approach, and confirmed that implementations based on reconfigurable hardware allow acceptable speedups of multi-core processors.  相似文献   

16.
递归神经网络(RNN)近些年来被越来越多地应用在机器学习领域,尤其是在处理序列学习任务中,相比CNN等神经网络性能更为优异。但是RNN及其变体,如LSTM、GRU等全连接网络的计算及存储复杂性较高,导致其推理计算慢,很难被应用在产品中。一方面,传统的计算平台CPU不适合处理RNN的大规模矩阵运算;另一方面,硬件加速平台GPU的共享内存和全局内存使基于GPU的RNN加速器的功耗比较高。FPGA 由于其并行计算及低功耗的特性,近些年来被越来越多地用来做 RNN 加速器的硬件平台。对近些年基于FPGA的RNN加速器进行了研究,将其中用到的数据优化算法及硬件架构设计技术进行了总结介绍,并进一步提出了未来研究的方向。  相似文献   

17.
Physical prototyping with Smart-Its   总被引:1,自引:0,他引:1  
Exploring novel ubiquitous computing systems and applications inevitably requires prototyping physical components. Smart-Its are hardware and software components that augment physical objects with embedded processing and interaction to address this need. Our work, which uses small computing devices called Smart-Its, addresses the need to create embedded interactive systems that disappear from the foreground to become secondary to the physical objects with which people interact during everyday activities. Such systems create new design challenges related to prototyping with embedded technologies and require careful consideration of the physical design context.  相似文献   

18.
随着训练数据规模的增大以及训练模型的日趋复杂,深度神经网络的训练成本越来越高,对计算平台提出了更高的算力需求,模型训练并行化成为增强其应用时效性的迫切需求。近年来基于分布式训练的AI加速器(如FPGA、TPU、AI芯片等)层出不穷,为深度神经网络并行训练提供了硬件基础。为了充分利用各种硬件资源,研究人员需要在集合了多种不同算力、不同硬件架构AI加速器的计算平台上进行神经网络的模型并行训练,因此,如何高效利用各种AI加速器计算资源,并实现训练任务在多种加速器上的负载均衡,一直是研究人员关心的热点问题。提出了一种面向模型并行训练的模型拆分策略自动生成方法,该方法能够基于静态的网络模型自动生成模型拆分策略,实现网络层在不同AI加速器上的任务分配。基于该方法自动生成的模型分配策略,能够高效利用单个计算平台上的所有计算资源,并保证模型训练任务在各设备之间的负载均衡,与目前使用的人工拆分策略相比,具有更高的时效性,节省拆分策略生成时间100倍以上,且降低了由于人为因素带来的不确定性。  相似文献   

19.
Utilizing hardware resources efficiently is vital to building the future generation of high-performance computing systems. The sparse matrix – dense vector multiplication (SpMV) kernel, which is notorious for its poor efficiency on conventional processors, is a key component in many scientific computing applications and increasing SpMV efficiency can contribute significantly to improving overall system efficiency. The major challenge in implementing SpMV efficiently is handling the input-dependent memory access patterns, and reconfigurable logic is a strong candidate for tackling this problem via memory system customization. In this work, we consider three schemes (all off-chip, all on-chip, caching) for servicing the irregular-access component of SpMV and investigate their effects on accelerator efficiency. To combine the strengths of on-chip and off-chip random accesses, we propose a hardware-software caching scheme named NCVCS that combines software preprocessing with a nonblocking cache to enable highly efficient SpMV accelerators with modest on-chip memory requirements. Our results from the comparison of the three schemes implemented as part of an FPGA SpMV accelerator show that our scheme effectively combines the high efficiency from on-chip accesses with the capability of working with large matrices from off-chip accesses.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号