首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 156 毫秒
1.
随着智能计算和大数据应用的发展,人们对GPU等加速部件的需求不断增长.计算软件栈比如CUDA、OpenCL软件栈是能充分发挥GPU硬件性能的关键.考虑计算软件栈未来在国产基础软硬件平台(比如飞腾CPU和麒麟操作系统)上的可移植性和适配性,重点研究OpenCL开源计算软件栈.测试分析OpenCL应用在不同平台上的表现,评估应用在不同OpenCL软件栈上(比如Mesa、ROCm等)进行GPU计算的表现,评估软件栈中驱动、内核等对GPU计算的影响,并且整个测试涵盖了编译、数据传输和内核执行等OpenCL计算各个阶段的时间开销.经过测试评估发现,国产平台更迫切也更适合使用GPU进行加速计算,ROCm是比较理想的OpenCL开源软件栈,有较好的性能和稳定性,并且与闭源软件栈相比存在一定的优化空间.  相似文献   

2.
随着计算机科学技术的迅速发展,嵌入式领域实时图像处理应用越来越广泛,然而传统硬件因为自身架构导致并行化程度不高,针对在视频监控、机器视觉、视频压缩、医疗影像分析等领域需要对图像进行高性能计算的问题,提出一种以OpenCL软件模型和FPGA异构模式的高性能图像处理解决方案,实现了图像显示和OpenCL加速功能,以Sobel边缘检测算法为研究对象,进行了算法并行性分析,并在系统中运用OpenCL加速内核算法,与基本的ARM平台和OpenCL共享内存加速机制相比较,展开性能测试,对加速效果进行了研究。实验数据表明,使用该系统处理不同分辨率的图像,OpenCL加速子系统的处理较基于片上ARM硬核的软件处理,实现相同功能上有100倍左右的性能提升。  相似文献   

3.
基于OpenCL的并行方腔流加速性能分析*   总被引:1,自引:0,他引:1  
本文提出了一种使用OpenCL技术对方腔流问题进行加速计算的方法。在计算方腔流问题时,本文将其转换为N-S方程通过空间有限差分和龙格库塔时间差分求解,并使用局部缓存等技术进行GPU优化。实验在Nvidia和ATI平台对所给算法进行评测。结果显示,OpenCL相对其串行版本加速约30倍左右。  相似文献   

4.
韦博文  李涛  李广宇  汪致恒  何沐  师悦龄  刘路遥  张瑞 《计算机科学》2016,43(Z11):167-169, 196
针对海量遥感数据应用中日益显著的处理效率低下和计算瓶颈问题,基于通用计算机图形处理单元的编程开发使用OpenCL并行处理技术对遥感数据处理及其过程进行加速,旨在为遥感影像大数据处理提供一条更为高效的途径。在不同显卡平台上对影像畸变纠正实施并行处理,结果表明,OpenCL技术在提高影像畸变纠正的速度方面作用显著,可取得29.1倍的最高加速效果;与CUDA并行处理技术的交叉验证进一步凸显了OpenCL技术在异构平台上实施并行处理时所具有的通用性的优势。  相似文献   

5.
OpenCL作为一种面向多种平台、通用目的的编程标准,已经对许多应用程序进行了加速。由于平台硬件和软件环境的差异,通用的优化方法不一定在所有平台都有很好的加速。通过对均值平移算法在GPU和APU平台的优化,探讨了不同平台各种优化方法的贡献力,一方面研究各个平台的计算特性,另一方面体会不同优化方法的优劣,在优劣的相互转化中寻求最优的解决方案。实验表明,算法并行优化前、后在AVIV 5850,Tesla 02050和APU A6365。上分别达到了9.68, 5.74和1.27倍加速,并行相比串行程序达到79.73,93.88和2.22倍加速,前两个平台OpcnCL版本相比,CUVA版本的OpenCV程序达到1.27和1.24倍加速。  相似文献   

6.
【目的】目前,TensorFlow 这一主流机器学习框架与CUDA异构编程环境的组合在学术界与工业界得到大量使用,使用CUDA实现的TensorFlow算子是加速计算的关键。然而,TensorFlow对于OpenCL 这一开放通用的异构编程标准的不支持严重限制了TensorFlow的通用性,并导致OpenCL硬件设备的算力无法充分发挥。【方法】针对此问题,本文深入探索TensorFlow的底层实现,在对TensorFlow代码结构深入分析的基础上实现了OpenCL算子,并且在2.2.0版本的TensorFlow框架实现了OpenCL算子的集成。【结果】基于上述实现,TensorFlow能够借助OpenCL算子在支持OpenCL 1.2的硬件设备上运行。同时,本文提出的优化方法也大幅提升了OpenCL算子的计算效率。【结论】通过实验表明,本文提出的方法能够有效地解决TensorFlow无法应用在OpenCL硬件设备上的问题。  相似文献   

7.
随着智能计算和大数据应用的发展,人们对GPU等加速部件的需求不断增长。基于国产基础软硬件平台运行显控应用做加速计算的需求,研究了OpenCL计算平台的移植和实现途径,就国产软硬件平台进行GPU计算做出了初步探索。研究的计算平台包括Mesa、ROCm、Pocl和Beignet,最后给出了如何将ROCm在国产平台上移植适配的思路和解决方案。  相似文献   

8.
GROMACS是应用广泛的开源分子动力学模拟软件,当前主要通过CUDA使用NVIDIA GPU进行加速计算。ROCm是一个开源的高性能异构计算平台。基于ROCm平台的HIP编程语言,首次实现了GROMACS 2020系列在ROCm平台上的完整移植。在MI50 GPU上,以一个复杂离子液体模拟算例为目标,使用GPU性能分析工具rocprof对移植代码进行了性能分析。针对MI50硬件特性,先后对成键力核函数、静电力的PME核函数和短程非成键力核函数进行了优化,优化后运行目标算例的性能相比初始版本整体上获得了约2.8倍的加速比,在 MI50上的性能高于GROMACS原版OpenCL代码60.5%,相对纯CPU版本有约2.7倍的加速比。在另外2个具有代表性算例的单结点测试以及离子液体算例的多结点扩展性测试中,优化后的代码也达到了较好的性能提升,这表明所采用的优化操作具有一定的通用性。  相似文献   

9.
詹云  赵新灿  谭同德 《计算机工程与设计》2012,33(11):4191-4195,4293
针对异构处理器在传统通用计算中利用率低的问题,提出基于开放计算语言OpenCL(open computing language)的新的通用计算技术,它提供了统一的编程模型。介绍了OpenCL的特点、架构及实现原理等,并提出OpenCL性能优化策略。将OpenCL与计算统一设备架构CUDA(compute unified device architecture)及其它通用计算技术进行对比。对比结果表明,OpenCL能够充分发挥异构处理平台上各种处理器的性能潜力,充分合理地分配任务,为进行大规模并行计算提供了新的强有力的工具。  相似文献   

10.
针对畸变差改正算法的处理速度不高和CUDA实现算法加速的设备局限性问题,提出了一种OpenCL并行改进畸变差纠正算法实现加速的方法。该方法是对传统的畸变差纠正算法进行并行改进,通过调用计算机GPU的计算单元实现算法加速;采用CPU+GPU的异构模式实现算法加速,将传统算法中逐像素密集计算部分分配到GPU进行处理;与CUDA实现算法加速针对NVIDIA显卡设备不同,OpenCL并行改进的算法没有了设备的限制。实验结果表明,相对于传统算法来说,影像畸变差纠正处理速度显著提升,总体加速比最高达5.976,计算部分加速比最高达到63.432,同时在AMD显卡设备上也得到了较好的加速效果。  相似文献   

11.
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores.  相似文献   

12.
提出了一种基于开放运算语言(OpenCL)的GPU加速三维时域有限差分(FDTD)电磁场仿真计算的方法.该方法利用图形处理单元(GPU)的并行处理特性并结合OpenCL接口标准实现了时域卷积完全匹配层(CPML)吸收边界条件的三维FDTD的高性能加速计算.首先设置FDTD仿真参数并动态申请内存空间,然后初始化OpenCL的计算参数,对三维电磁模型基于OpenCL进行FDTD加速仿真.本方法显著提升了FDTD电磁场仿真速度,与利用CPU计算相比速度提升可达5-8倍,且具有CPML吸收边界条件,可以模拟电磁波在自由空间的传播;基于OpenCL编译的语言程序可以运行在CPU或GPU硬件上,并可充分发挥多核CPU的并行计算能力,使得FDTD电磁场仿真具有更广泛的实际应用.  相似文献   

13.
Swan: A tool for porting CUDA programs to OpenCL   总被引:1,自引:0,他引:1  
The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

Program summary

Program title: SwanCatalogue identifier: AEIH_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License version 2No. of lines in distributed program, including test data, etc.: 17 736No. of bytes in distributed program, including test data, etc.: 131 177Distribution format: tar.gzProgramming language: CComputer: PCOperating system: LinuxRAM: 256 MbytesClassification: 6.5External routines: NVIDIA CUDA, OpenCLNature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion.Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time.Restrictions: No support for CUDA C++ featuresRunning time: Nominal  相似文献   

14.
When a tsunami occurred on a sea area, prediction of its arrival time is critical for evacuating people from the coastal area. There are many problems related to tsunami to be solved for reducing negative effects of this serious disaster. Numerical modeling of tsunami wave propagation is a computationally intensive problem which needs to accelerate its calculations by parallel processing. The method of splitting tsunami (MOST) is one of the well-known numerical solvers for tsunami modeling. We have developed a tsunami propagation code based on MOST algorithm and implemented different parallel optimizations for GPU and FPGA. In the latest study, we have the best performance of OpenCL kernel which is implemented tsunami simulation on AMD Radeon 280X GPU. This paper targets on design and evaluation on FPGA using OpenCL. The performance on FPGA design generated automatically by Altera offline compiler follows the results of GPU by several kernel modifications.  相似文献   

15.
王丽娜  史晓华 《计算机应用》2014,34(11):3121-3125
针对人脸轮廓提取中Chan-Vese模型计算量大、分割速度缓慢等问题,采用开放计算语言(OpenCL)并行编程模型,提出了一种基于图形处理器(GPU)和多核CPU加速的并行算法。该算法首先将模型的框架进行重构,消除模型中的数据依赖关系;然后,利用开放计算语言对算法进行并行化以及相应的优化。实验结果表明,与单线程算法相比,在NVIDIA GTX660和AMD FX-8530下达到了较高的加速比。  相似文献   

16.
OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.  相似文献   

17.
A novel approach to developing an FPGA-based chirp signal generator using high-level synthesis implementation is proposed. OpenCL, which is a framework used for high-level synthesis (HLS) methodologies, is employed instead of the Verilog/VHDL language to program FPGA. OpenCL has been used for FPGA programming, particularly in high-performance computing applications. Utilizing OpenCL for FPGA development reduces development time because of the high-level abstraction of the code. However, compared to Verilog/VHDL, standard OpenCL does not enable direct access to the FPGA's I/O. In this study, the FPGA needs to access the I/O pins to communicate with the DAC and generate the chirp signal. Thus, direct access to the FPGA I/O pin from the OpenCL environment is required. Therefore, a new OpenCL component is developed to enable the FPGA to communicate with the DAC, thus allowing data streaming to generate the chirp signal. This OpenCL component enables us to stream the data from the FPGA to generate the chirp signal. Here, we demonstrate that by using OpenCL implementation, the FPGA can generate an I/Q chirp signal efficiently. Moreover, the same OpenCL kernel can be employed to generate different bandwidths of the chirp signal without having to reprogram the FPGA. To demonstrate the capability of the system, we generated the I/Q chirp signal from 1 MHz to 5 MHz, 5 MHz to 10 MHz, 10 MHz to 15 MHz and 15 MHz to 20 MHz for a period of 10 µs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号