首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
针对便携设备上不断增强的视频处理要求和H.264编解码算法相对较高的计算复杂度之间的矛盾,提出了基于片上多核结构的H.264并行化方案,以达到实时编码的效果。该方案以FPGA为验证平台,通过硬件结构与软件算法协同优化的方式,在单总线双核结构的MPSoC上实现了基于片的H.264并行编码。实验结果表明,在嵌入式环境下利用多核技术实现H.264并行编码可以取得良好的加速效果。  相似文献   

2.
多核嵌入式系统总线冲突避免的节能调度综述   总被引:1,自引:0,他引:1  
在多核嵌入式系统中,避免总线冲突并在时间限制下调度通信任务和计算任务来降低能量消耗是非常重要的,有效节能调度可以避免总线冲突并在时间限制下实现有效节能。由于任务的粒度、优先级、通信任务和计算任务的协同调度对此类算法的节能有着重要影响,概述了这三个方面的研究现状,指出总线冲突避免的节能调度算法设计中有待解决的问题,并给出了多核嵌入式系统总线冲突避免的节能调度算法的发展方向。  相似文献   

3.
油藏数值模拟和很多其他科学计算问题一样需要求解大型稀疏线性代数方程组.在求解稀疏线性代数方程组的迭代法中,稀疏矩阵向量乘法(SpMV)是影响计算效率的核心函数之一.随着计算机硬件架构异构化,科学计算从单核、多核CPU计算架构逐渐发展到多核CPU+众核加速卡(GPU卡或MIC等)的计算架构.SpMV的实现效率与稀疏矩阵的存储格式及硬件架构关系密切.本文针对油藏模拟中常见的Jacobian矩阵的稀疏模式,利用GPU核心的合并访问和并发计算等特点,结合油藏模拟线性解法器的算法要求,设计了一种BHYB矩阵存储格式及其对应的线程组并行策略.数值实验测得基于该存储格式的SpMV相对串行BCSR格式的SpMV的加速比可达19倍,比cuSPARSE库中效率最高的HYB格式的SpMV快30%到80%.此外,本文所提出的BHYB存储格式对块状矩阵在GPU上的存储以及线程组并行策略对其它GPU并行程序中内核函数的设计和优化能起到一定的借鉴作用.  相似文献   

4.
用一种遗传算法的调度策略,以大维度矩阵求逆为实验对象,探索在多核中如何完成任务的均衡分配问题,以达到加速效果.算法利用系统资源的弹性,自动搜寻可以并行的子任务并将其合理地分配到相应计算节点中,提高了多核系统资源调度性能,实现了对用户提交的任务的优化调度,达到了均衡系统各处理器计算负载和提高多核系统的总体性能的目标.  相似文献   

5.
冉德成  吴东  钱磊 《计算机工程》2019,45(10):40-45
为满足深度学习推理中对不同规模矩阵乘法的计算需求,提出一种基于Zynq SoC平台的整数矩阵乘法加速器。采用基于总线广播的并行结构,充分利用片上数据的重用性并最小化中间累加结果的移动范围,以降低外部DRAM的访问需求。通过动态调整矩阵分块的大小,使加速器在计算形状不规则的矩阵乘时保持较高效率。实验结果表明,在DeepBench测试基准下,该加速器可对双核ARM Cortex-A9 CPU的矩阵乘运算实现8.4倍的加速效果。  相似文献   

6.
王晞阳  陈继林  李猛  刘首文 《计算机工程》2022,48(7):199-205+213
在电力系统仿真中,大型稀疏矩阵的求解会消耗大量存储和计算资源,未有效利用矩阵的稀疏性将导致存储空间浪费以及计算效率低下的问题。当前关于稀疏矩阵求解算法的研究主要针对众核加速硬件,聚焦于挖掘层次集合的并行度以提升算法的并行效率,而在众核处理器架构上频繁地进行缓存判断及细粒度访问可能导致潜在的性能问题。针对基于现场可编程门阵列(FPGA)的下三角稀疏矩阵求解问题,在吴志勇等设计的FPGA稀疏矩阵求解器硬件结构的基础上,提出一种静态调度求解算法。通过对稀疏矩阵进行预处理,设计数据分布和指令排布流程,将下三角稀疏矩阵的求解过程静态映射到多个FPGA片上的处理单元,以实现下三角稀疏矩阵在FPGA上的并行高速求解。将串行算法中所有的隐式并行关系排布到缓冲中,使得所有计算单元都能实现计算、访存和单元间通信的高效并行,从而最大限度地利用FPGA的硬件资源。典型算例上的测试结果表明,相较传统的CPU/GPU求解算法,该算法能够实现5~10倍的加速效果。  相似文献   

7.
多核图像处理并行设计范式的研究与应用   总被引:1,自引:0,他引:1       下载免费PDF全文
王成良  谢克家  刘昕 《计算机工程》2011,37(14):220-222
多核计算环境下采用图像处理并行算法可提高图像处理的速度,但已有的并行设计只针对边缘检测、图像投影等特定算法进行,没有形成通用的并行算法设计范式。为此,在研究图像处理算法可并行处理机制和多核架构特点的基础上,提出分析、建模、映射、调试和性能评价及测试发布等5个设计步骤的基于多核计算环境的图像处理算法并行设计范式,以图像傅里叶变换并行算法设计为例在单核、双核、四核、八核计算环境下验证了该并行范式的有效性。实验结果表明,该范式在图像处理并行设计方面可扩展图像处理的应用空间。  相似文献   

8.
基于NVIDIA Jetson Tx2平台,结合OpenCV计算机视觉库与计算统一设备架构(CUDA)程序设计,对汗孔特征提取与匹配算法实现了并行设计.实验结果表明:并行设计算法能够实现最多180倍的加速,推动指纹匹配算法在嵌入式系统领域的应用.  相似文献   

9.
HPL是高性能计算广泛采用的Linpack测试软件包,传统HPL算法中,求解矩阵将以块为单位循环分布到所有处理器,由于国产加速器(China Accelerator)的底层矩阵乘接口仅支持定制接口,传统HPL算法已不适合CPU+China Accelerator异构系统,因此,必须基于定制接口完成矩阵分布细致划分与封装dPEM,以提供一个通用的HPL测试配置环境;同时,为了充分发挥国产异构系统的效率,设计了异构协同矩阵乘调度算法OA4MM,以提高国产异构系统的效率。实验验证了dPEM的有效性和OA4MM算法的高效性,OA4MM较传统的异构HPL调度算法性能提升近10%。  相似文献   

10.
为提升轻量级卷积神经网络在硬件平台的资源利用效率和推理速度,基于软硬件协同优化的思想,提出一种面向FPGA平台的轻量级卷积神经网络加速器,并针对网络结构的特性设计专门的硬件架构。与多级并行策略结合,设计一种统一的卷积层计算单元。为降低模型存储成本、提高加速器的吞吐量,提出一种基于可微阈值的选择性移位量化方案,使计算单元能够以硬件友好的形式执行计算。实验结果表明,在Arria 10 FPGA平台上部署的MobileNetV2加速器能够达到311 fps的推理速度,相比CPU版本实现了约9.3倍的加速比、GPU版本约3倍的加速比。在吞吐量方面,加速器能够实现98.62 GOPS。  相似文献   

11.
During the past few years, embedded digital systems have been requested to provide a huge amount of processing power and functionality. A very likely foreseeable step to pursue this computational and flexibility trend is the generalization of on-chip multiprocessor platforms (MPSoC). In that context, choosing a programming model and providing optimized hardware support to it on these platforms is a challenging task. To deal in a portable way with MPSoCs having a different number of processors running possibly at different frequencies, work-stealing (WS) based parallelization is a current research trend.The contribution of this paper is to evaluate the impact of some simple MPSoCs’ architecture characteristics on the performance of WS in the MPSoC context. The previous evaluations of WS, either theoretical or experimental, were done on fixed multicores architectures. This work extends these studies by exploring the use of WS for the codesign of embedded applications on MPSoC platforms with different hardware capabilities, thanks to cycle-accurate measures.We firstly study the architectural choices suited to WS algorithms and measure the benefit of these architectural modifications. To assert whether WS is suited to the MPSoC context, we experimentally measure its intrinsic implementation overhead on the most efficient architectural designs. Finally, we validate the performances of the approach on two real applications: a regular multimedia application (temporal noise reduction) and an irregular computation intensive application (frames of the Mandelbrot set).Our results show that enhancing MPSoC platforms having up to 16 processors with widespread hardware support mechanisms can lead to important performance improvements at acceptable hardware cost for the considered applications.  相似文献   

12.
首先介绍了CUDA架构特点,在GPU上基于CUDA使用两种方法实现了矩阵乘法,并根据CUDA特有的软硬件架构对矩阵乘法进行了优化。然后计算GPU峰值比并进行了分析。实验结果表明,基于CUDA的矩阵乘法相对于CPU矩阵乘法获得了很高的加速比,最高加速比达到1079.64。GPU浮点运算能力得到有效利用,峰值比最高达到30.85%。  相似文献   

13.
Zhang  Yonghua  Jiang  Hongxu  Liu  Xiaojian  Cao  Haiheng  Du  Yu 《The Journal of supercomputing》2022,78(3):3205-3225

The convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity, which involve enormous communication bandwidth and storage resources requirement. The computation requirement can be addressed effectively to achieve high throughput by highly parallel compute paradigms of current CNNs accelerators. But the energy consumption still remains high as communication can be more expensive than computation, especially for low power embedded platform. To address this problem, this paper proposes a CNNs accelerator based on a novel storage and dataflow on multi-processor system on chip (MPSoC) platform. By minimizing data access and movement and maximizing data reuse, it can achieve the energy efficient CNNs inference acceleration. The optimization strategies mainly involve four aspects. Firstly, an external memory sharing architecture adopting two-dimensional array storage mode for CPU-FPGA collaborative processing is proposed to achieve high data throughput and low bandwidth requirement for off-chip data transmission. Secondly, the minimized data access and movement on chip are realized by designing a multi-level hierarchical storage architecture. Thirdly, a cyclic data shifting method is proposed to achieve maximized data reuse based on both spatial and temporal. In addition, a bit fusion method based on the 8-bit dynamic fixed-point quantization is adopted to achieve double throughput and computational efficiency of a single DSP. The accelerator proposed in this paper is implemented on Zynq UltraScale?+?MPSoC ZCU102 evaluation board. By running the benchmark network of VGG16 and Tiny-YOLO on the accelerator, the throughput and the energy efficiency are evaluated. Compared with the current typical accelerators, the proposed accelerator can increase system throughput by up to 41x, single DSP throughput by up to 7.63x, and system energy efficiency by up to 6.3x.

  相似文献   

14.
Multiprocessor System on Chip (MPSoC) platform plays a vital role in parallel processor architecture design. However, with the growing number of processors, interconnect on chip is becoming one of the major bottlenecks of MPSoC architecture. In this paper, we propose a star network based on peer to peer links on FPGA. The star network utilizes fast simplex links (FSL) as basic structure to connect the scheduler with heterogeneous processing elements, including processors and hardware IP cores. Blocking and nonblocking application interfaces are provided for high level programming. We built a prototype system on FPGA to evaluate the transfer time and hardware cost of the proposed star network architecture. Experiment results demonstrated that the average transfer time for each word could be reduced to 7 cycles, which achieves 14× speedup against state-of-the-art shared memory literatures. Moreover, the star network cost only 1.2?% Flip Flops and 2.45?% LUTs of a single FPGA.  相似文献   

15.
The ever-increasing complexity of MPSoCs is putting the production of software on the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim to facilitate application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware, and that custom features are provided to the programmer to control it. In this paper we consider a representative template of a modern multi-cluster embedded MPSoC and present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementation variants that are aware of the memory hierarchy and of the heterogeneous interconnection.  相似文献   

16.
In this paper, we describe a technique to design UML-based software models for MPSoC architecture, which focuses on the development of the platform specific model of embedded software. To develop the platform specific model, we define a process for the design of UML-based software model and suggest an algorithm with precise actions to map the model to MPSoC architecture. In order to support our design process, we implemented our approach in an integrated tool. Using the tool, we applied our design technique to a target system. We believe that our technique provides several benefits such as improving parallelism of tasks and fast-and-valid mapping of software models to hardware architecture.  相似文献   

17.
A parallel architecture for an on-line implementation of the recursive least squares (RLS) identification algorithm on a field programmable gate array (FPGA) is presented. The main shortcoming of this algorithm for on-line applications is its computational complexity. The matrix computation to update error covariance consumes most of the time. To improve the processing speed of the RLS architecture, a multi-stage matrix multiplication (MMM) algorithm was developed. In addition, a trace technique was used to reduce the computational burden on the proposed architecture. High throughput was achieved by employing a pipelined design. The scope of the architecture was explored by estimating the parameters of a servo position control system. No vendor dependent modules were used in this design. The RLS algorithm was mapped to a Xilinx FPGA Virtex-5 device. The entire architecture operates at a maximum frequency of 339.156 MHz. Compared to earlier work, the hardware utilization was substantially reduced. An application-specific integrated circuit (ASIC) design was implemented in 180 nm technology with the Cadence RTL compiler.  相似文献   

18.
复合域乘法运算是对称密码算法中的基本运算和重要模块,因操作复杂且计算时间长,其实现性能在很大程度上制约着对称密码算法的运算速度。文章研究了对称密码算法中的复合域乘法运算特点及实现原理,设计了以GF(28)为基域,扩展到GF((28 )h(k=1,2,3,4)域上的复合域乘法可重构架构,通过配置能够灵活高效地实现GF(2 8)、GF((2H)2)、GF(2 8)3、CF((28)4)域上的有限域乘法操作。同时结合处理器的指令设计方法,设计了通用的复合域乘法操作及配置指令,能够极大的提高对称密码算法中复合域乘法运算的处理效率。最后文章对复合域乘法可重构架构进行了模拟与验证,在0.18μmCMOS工艺标准单元库下进行逻辑综合以及布局布线,并对综合结果进行了性能评估。结果表明,文章提出的复合域乘法可重构架构及相应的专用指令,在灵活性的前提下提供了较高的执行效率,具有较高的实用价值。  相似文献   

19.
苗旭鹏  王驭捷  沈佳  邵蓥侠  崔斌 《软件学报》2023,34(9):4407-4420
图神经网络由于其强大的表示能力和灵活性最近取得了广泛的关注. 随着图数据规模的增长和显存容量的限制, 基于传统的通用深度学习系统进行图神经网络训练已经难以满足要求, 无法充分发挥GPU设备的性能. 如何高效利用GPU硬件进行图神经网络的训练已经成为该领域重要的研究问题之一. 传统做法是基于稀疏矩阵乘法, 完成图神经网络中的计算过程, 当面对GPU显存容量限制时, 通过分布式矩阵乘法, 把计算任务分发到每个设备上, 这类方法的主要不足有: (1)稀疏矩阵乘法忽视了图数据本身的稀疏分布特性, 计算效率不高; (2)忽视了GPU本身的计算和访存特性, 无法充分利用GPU硬件. 为了提高训练效率, 现有一些研究通过图采样方法, 减少每轮迭代的计算带价和存储需求, 同时也可以支持灵活的分布式拓展, 但是由于采样随机性和方差, 它们往往会影响训练的模型精度. 为此, 提出了一套面向多GPU的高性能图神经网络训练框架, 为了保证模型精度, 基于全量图进行训练, 探索了不同的多GPU图神经网络切分方案, 研究了GPU上不同的图数据排布对图神经网络计算过程中GPU性能的影响, 并提出了稀疏块感知的GPU访存优化技术. 基于C++和CuDNN实现了该原型系统, 在4个不同的大规模GNN数据集上的实验表明: (1)通过图重排优化, 提高了GPU约40%的缓存命中率, 计算加速比可达2倍; (2)相比于现有系统DGL, 取得了5.8倍的整体加速比.  相似文献   

20.
This paper presents a hardware architecture for singular spectrum analysis of Hankel tensors, including computation of tucker decomposition, tensor reconstruction and final Hankelization. In the proposed design, we explore two level of optimization. First, in algorithm level, we optimize the calculation process by exploiting the Hankel property to reduce the computation complexity and on-chip BRAM resource usage. Secondly, in hardware level, parallelism is explored for acceleration. Resource sharing is applied to reduce look-up tables (LUTs) usage. To enable flexibility, the number of processing elements (PEs) can be changed through parameter setting. Our proposed design is implemented on Field-Programmable Gate Arrays (FPGAs) to process third order tensors. Experiment results show that our design achieve a speed-up from 172 to 1004 compared with CPU implementation via Intel MKL and 5 to 40 compared with GPU implementation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号