期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

朱浩周博洋卢雪山杜溢墨《计算机工程与科学》2021,43(12):2105-2114

随着智能计算和大数据应用的发展,人们对GPU等加速部件的需求不断增长.计算软件栈比如CUDA、OpenCL软件栈是能充分发挥GPU硬件性能的关键.考虑计算软件栈未来在国产基础软硬件平台(比如飞腾CPU和麒麟操作系统)上的可移植性和适配性,重点研究OpenCL开源计算软件栈.测试分析OpenCL应用在不同平台上的表现,评估应用在不同OpenCL软件栈上(比如Mesa、ROCm等)进行GPU计算的表现,评估软件栈中驱动、内核等对GPU计算的影响,并且整个测试涵盖了编译、数据传输和内核执行等OpenCL计算各个阶段的时间开销.经过测试评估发现,国产平台更迫切也更适合使用GPU进行加速计算,ROCm是比较理想的OpenCL开源软件栈,有较好的性能和稳定性,并且与闭源软件栈相比存在一定的优化空间. 相似文献

2.

基于OpenCL与FPGA异构模式的Sobel算法研究

鲍云峰曾张帆唐文龙田茂《计算机测量与控制》2018,26(1)

随着计算机科学技术的迅速发展,嵌入式领域实时图像处理应用越来越广泛,然而传统硬件因为自身架构导致并行化程度不高,针对在视频监控、机器视觉、视频压缩、医疗影像分析等领域需要对图像进行高性能计算的问题,提出一种以OpenCL软件模型和FPGA异构模式的高性能图像处理解决方案,实现了图像显示和OpenCL加速功能,以Sobel边缘检测算法为研究对象,进行了算法并行性分析,并在系统中运用OpenCL加速内核算法,与基本的ARM平台和OpenCL共享内存加速机制相比较,展开性能测试,对加速效果进行了研究。实验数据表明,使用该系统处理不同分辨率的图像,OpenCL加速子系统的处理较基于片上ARM硬核的软件处理,实现相同功能上有100倍左右的性能提升。相似文献

3.

基于OpenCL的并行方腔流加速性能分析* 总被引：1，自引：0，他引：1

李森李新亮王龙陆忠华迟学斌《计算机应用研究》2011,28(4):1401-1403

本文提出了一种使用OpenCL技术对方腔流问题进行加速计算的方法。在计算方腔流问题时,本文将其转换为Ｎ-Ｓ方程通过空间有限差分和龙格库塔时间差分求解,并使用局部缓存等技术进行GPU优化。实验在Nvidia和ATI平台对所给算法进行评测。结果显示,OpenCL相对其串行版本加速约30倍左右。相似文献

4.

使用OpenCL技术的影像快速畸变纠正方法在异构平台上的应用分析

韦博文李涛李广宇汪致恒何沐师悦龄刘路遥张瑞《计算机科学》2016,43(Z11):167-169, 196

针对海量遥感数据应用中日益显著的处理效率低下和计算瓶颈问题,基于通用计算机图形处理单元的编程开发使用OpenCL并行处理技术对遥感数据处理及其过程进行加速,旨在为遥感影像大数据处理提供一条更为高效的途径。在不同显卡平台上对影像畸变纠正实施并行处理,结果表明,OpenCL技术在提高影像畸变纠正的速度方面作用显著,可取得29.1倍的最高加速效果;与CUDA并行处理技术的交叉验证进一步凸显了OpenCL技术在异构平台上实施并行处理时所具有的通用性的优势。相似文献

5.

基于OpenCL的均值平移算法在多个众核平台的性能优化研究

庞旭张云泉龙国平贾海鹏颜深根《计算机科学》2013,40(3):79-85

OpenCL作为一种面向多种平台、通用目的的编程标准,已经对许多应用程序进行了加速。由于平台硬件和软件环境的差异,通用的优化方法不一定在所有平台都有很好的加速。通过对均值平移算法在GPU和APU平台的优化,探讨了不同平台各种优化方法的贡献力,一方面研究各个平台的计算特性,另一方面体会不同优化方法的优劣,在优劣的相互转化中寻求最优的解决方案。实验表明,算法并行优化前、后在AVIV 5850,Tesla 02050和APU A6365。上分别达到了9.68, 5.74和1.27倍加速,并行相比串行程序达到79.73,93.88和2.22倍加速,前两个平台OpcnCL版本相比,CUVA版本的OpenCV程序达到1.27和1.24倍加速。相似文献

6.

TensorFlow框架中OpenCL算子的实现及集成

郭强程大果孙羽菲周建宇张玉志裴嘉傲甘润东陈锐《数据与计算发展前沿》2022,(2):3-16

【目的】目前,TensorFlow 这一主流机器学习框架与CUDA异构编程环境的组合在学术界与工业界得到大量使用,使用CUDA实现的TensorFlow算子是加速计算的关键。然而,TensorFlow对于OpenCL 这一开放通用的异构编程标准的不支持严重限制了TensorFlow的通用性,并导致OpenCL硬件设备的算力无法充分发挥。【方法】针对此问题,本文深入探索TensorFlow的底层实现,在对TensorFlow代码结构深入分析的基础上实现了OpenCL算子,并且在2.2.0版本的TensorFlow框架实现了OpenCL算子的集成。【结果】基于上述实现,TensorFlow能够借助OpenCL算子在支持OpenCL 1.2的硬件设备上运行。同时,本文提出的优化方法也大幅提升了OpenCL算子的计算效率。【结论】通过实验表明,本文提出的方法能够有效地解决TensorFlow无法应用在OpenCL硬件设备上的问题。相似文献

7.

基于国产软硬件的OpenCL计算平台研究

安婷玉郭宝宝《计算机工程与科学》2019,41(11):1919-1923

随着智能计算和大数据应用的发展,人们对GPU等加速部件的需求不断增长。基于国产基础软硬件平台运行显控应用做加速计算的需求,研究了OpenCL计算平台的移植和实现途径,就国产软硬件平台进行GPU计算做出了初步探索。研究的计算平台包括Mesa、ROCm、Pocl和Beignet,最后给出了如何将ROCm在国产平台上移植适配的思路和解决方案。相似文献

8.

GROMACS 2020在ROCm平台上的移植与优化

张驭洲曹武迪卜景德谭光明吉青《计算机工程与科学》2021,43(11):1901-1909

GROMACS是应用广泛的开源分子动力学模拟软件,当前主要通过CUDA使用NVIDIA GPU进行加速计算。ROCm是一个开源的高性能异构计算平台。基于ROCm平台的HIP编程语言,首次实现了GROMACS 2020系列在ROCm平台上的完整移植。在MI50 GPU上,以一个复杂离子液体模拟算例为目标,使用GPU性能分析工具rocprof对移植代码进行了性能分析。针对MI50硬件特性,先后对成键力核函数、静电力的PME核函数和短程非成键力核函数进行了优化,优化后运行目标算例的性能相比初始版本整体上获得了约2.8倍的加速比,在 MI50上的性能高于GROMACS原版OpenCL代码60.5%,相对纯CPU版本有约2.7倍的加速比。在另外2个具有代表性算例的单结点测试以及离子液体算例的多结点扩展性测试中,优化后的代码也达到了较好的性能提升,这表明所采用的优化操作具有一定的通用性。相似文献

9.

基于OpenCL的异构系统并行编程

詹云赵新灿谭同德《计算机工程与设计》2012,33(11):4191-4195,4293

针对异构处理器在传统通用计算中利用率低的问题,提出基于开放计算语言OpenCL(open computing language)的新的通用计算技术,它提供了统一的编程模型。介绍了OpenCL的特点、架构及实现原理等,并提出OpenCL性能优化策略。将OpenCL与计算统一设备架构CUDA(compute unified device architecture)及其它通用计算技术进行对比。对比结果表明,OpenCL能够充分发挥异构处理平台上各种处理器的性能潜力,充分合理地分配任务,为进行大规模并行计算提供了新的强有力的工具。相似文献

10.

畸变差改正算法OpenCL并行加速研究

于梦华王双亭李英成朱祥娥刘晓龙《遥感信息》2019,(3)

针对畸变差改正算法的处理速度不高和CUDA实现算法加速的设备局限性问题,提出了一种OpenCL并行改进畸变差纠正算法实现加速的方法。该方法是对传统的畸变差纠正算法进行并行改进,通过调用计算机GPU的计算单元实现算法加速;采用CPU+GPU的异构模式实现算法加速,将传统算法中逐像素密集计算部分分配到GPU进行处理;与CUDA实现算法加速针对NVIDIA显卡设备不同,OpenCL并行改进的算法没有了设备的限制。实验结果表明,相对于传统算法来说,影像畸变差纠正处理速度显著提升,总体加速比最高达5.976,计算部分加速比最高达到63.432,同时在AMD显卡设备上也得到了较好的加速效果。相似文献

11.

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献

12.

基于OpenCL的GPU加速三维时域有限差分电磁场仿真算法研究

代健褚天舒杨照《数值计算与计算机应用》2014,(1):10-11

提出了一种基于开放运算语言（OpenCL）的GPU加速三维时域有限差分（FDTD）电磁场仿真计算的方法．该方法利用图形处理单元（GPU）的并行处理特性并结合OpenCL接口标准实现了时域卷积完全匹配层（CPML）吸收边界条件的三维FDTD的高性能加速计算．首先设置FDTD仿真参数并动态申请内存空间,然后初始化OpenCL的计算参数,对三维电磁模型基于OpenCL进行FDTD加速仿真．本方法显著提升了FDTD电磁场仿真速度,与利用CPU计算相比速度提升可达5-8倍,且具有CPML吸收边界条件,可以模拟电磁波在自由空间的传播;基于OpenCL编译的语言程序可以运行在CPU或GPU硬件上,并可充分发挥多核CPU的并行计算能力,使得FDTD电磁场仿真具有更广泛的实际应用．相似文献

13.

Swan: A tool for porting CUDA programs to OpenCL 总被引：1，自引：0，他引：1

M.J. Harvey G. De Fabritiis 《Computer Physics Communications》2011,(4):1093-1099

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

Program summary

Program title: SwanCatalogue identifier: AEIH_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License version 2No. of lines in distributed program, including test data, etc.: 17 736No. of bytes in distributed program, including test data, etc.: 131 177Distribution format: tar.gzProgramming language: CComputer: PCOperating system: LinuxRAM: 256 MbytesClassification: 6.5External routines: NVIDIA CUDA, OpenCLNature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion.Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time.Restrictions: No support for CUDA C++ featuresRunning time: Nominal 相似文献

14.

Evaluations of OpenCL-written tsunami simulation on FPGA and comparison with GPU implementation

Fumiya Kono Naohito Nakasato Kensaku Hayashi Alexander Vazhenin Stanislav Sedukhin 《The Journal of supercomputing》2018,74(6):2747-2775

When a tsunami occurred on a sea area, prediction of its arrival time is critical for evacuating people from the coastal area. There are many problems related to tsunami to be solved for reducing negative effects of this serious disaster. Numerical modeling of tsunami wave propagation is a computationally intensive problem which needs to accelerate its calculations by parallel processing. The method of splitting tsunami (MOST) is one of the well-known numerical solvers for tsunami modeling. We have developed a tsunami propagation code based on MOST algorithm and implemented different parallel optimizations for GPU and FPGA. In the latest study, we have the best performance of OpenCL kernel which is implemented tsunami simulation on AMD Radeon 280X GPU. This paper targets on design and evaluation on FPGA using OpenCL. The performance on FPGA design generated automatically by Altera offline compiler follows the results of GPU by several kernel modifications. 相似文献

15.

基于Chan-Vese模型的面向多核CPU和GPU的人脸轮廓提取并行算法

王丽娜史晓华《计算机应用》2014,34(11):3121-3125

针对人脸轮廓提取中Chan-Vese模型计算量大、分割速度缓慢等问题,采用开放计算语言(OpenCL)并行编程模型,提出了一种基于图形处理器(GPU)和多核CPU加速的并行算法。该算法首先将模型的框架进行重构,消除模型中的数据依赖关系;然后,利用开放计算语言对算法进行并行化以及相应的优化。实验结果表明,与单线程算法相比,在NVIDIA GTX660和AMD FX-8530下达到了较高的加速比。相似文献

16.

Efficient fine-grained shared buffer management for multiple OpenCL devices

Chang-qing Xun Dong Chen Qiang Lan Chun-yuan Zhang 《浙江大学学报:C卷英文版》2013,14(11):859-872

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory （DSOM） is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high. 相似文献

17.

FPGA-based implementation of a chirp signal generator using an OpenCL design

《Microprocessors and Microsystems》2020

A novel approach to developing an FPGA-based chirp signal generator using high-level synthesis implementation is proposed. OpenCL, which is a framework used for high-level synthesis (HLS) methodologies, is employed instead of the Verilog/VHDL language to program FPGA. OpenCL has been used for FPGA programming, particularly in high-performance computing applications. Utilizing OpenCL for FPGA development reduces development time because of the high-level abstraction of the code. However, compared to Verilog/VHDL, standard OpenCL does not enable direct access to the FPGA's I/O. In this study, the FPGA needs to access the I/O pins to communicate with the DAC and generate the chirp signal. Thus, direct access to the FPGA I/O pin from the OpenCL environment is required. Therefore, a new OpenCL component is developed to enable the FPGA to communicate with the DAC, thus allowing data streaming to generate the chirp signal. This OpenCL component enables us to stream the data from the FPGA to generate the chirp signal. Here, we demonstrate that by using OpenCL implementation, the FPGA can generate an I/Q chirp signal efficiently. Moreover, the same OpenCL kernel can be employed to generate different bandwidths of the chirp signal without having to reprogram the FPGA. To demonstrate the capability of the system, we generated the I/Q chirp signal from 1 MHz to 5 MHz, 5 MHz to 10 MHz, 10 MHz to 15 MHz and 15 MHz to 20 MHz for a period of 10 µs. 相似文献