期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

Michał Czapiński Chris Thompson Stuart Barnes 《International journal of parallel programming》2014,42(6):1032-1047

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks. 相似文献

2.

Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model

Ki-Hwan Kim KyoungHo Kim Q-Han Park 《Computer Physics Communications》2011,182(6):1201-1207

The Finite-Difference Time-Domain (FDTD) method is commonly used for electromagnetic field simulations. Recently, successful hardware-accelerations using Graphics Processing Unit (GPU) have been reported for the large-scale FDTD simulations. In this paper, we present a performance analysis of the three-dimensional (3D) FDTD on GPU using the roofline model. We find that theoretical predictions on maximum performance agrees well with the experimental results. We also suggest the suitable optimization methods for the best performance of FDTD on GPU. In particular, the optimized 3D FDTD program on GPU (NVIDIA Geforce GTX 480) is shown to be 64 times faster than the naively implemented program on CPU (Intel Core i7 2600). 相似文献

3.

A multiple-GPU based parallel independent coefficient reanalysis method and applications for vehicle design

《Advances in Engineering Software》2015

The main limits of reanalysis method using CUDA (Compute Unified Device Architecture) for large-scale engineering optimization problems are low efficiency on single GPU and memory bottleneck of GPU. To breakthrough these bottlenecks, an efficient parallel independent coefficient (IC) reanalysis method is developed based on multiple GPUs platform. The IC reanalysis procedure is reconstructed to accommodate the use of multiple GPUs. The matrices and vectors are successfully partitioned and prepared for each GPU to achieve good load balance as well as little communication between GPUs. This study also proposes an effective technique to overlap the computation and communication by using non-blocking communication strategy. GPUs would continue their succeeding tasks while communication is still carried out simultaneously. Furthermore, the CSR format is used in each GPU for saving the memory. Finally, large-scale vehicle design problems are implemented by the developed solver. According to the test results, the multi-GPU based IC reanalysis method has potential capability for handling the real large scale problem and reducing the design cycle. 相似文献

4.

一种基于GPU集群的深度优先并行算法设计与实现

余莹李肯立郑光勇《计算机科学》2015,42(1):82-85

深度优先搜索算法在GPU集群中大型图上的简单执行,会导致线程间的负载不平衡和无法合并内存访问的情况,这使得算法的性能较低.为了明显提高算法在单个GPU和多个GPU环境下的性能,在处理数据之前通过采取一系列有效的操作来进行重新编排.提出了构造线程和数据之间映射的新技术,通过利用前缀求和及二分查找操作来达到完美的负载平衡.为了降低通信开销,对DFS各分支中需要进行交换的边集执行修剪操作.实验结果表明,算法在单个GPU上可以尽可能地实现最佳的并行性,在多GPU环境下可以最小化通信开销.在一个GPU集群中,它可以对合有数十亿节点的图有效地执行分布式DFS. 相似文献

5.

Performance models for asynchronous data transfers on consumer Graphics Processing Units

Juan Gómez-Luna José María González-Linares José Ignacio Benavides Nicolás Guil 《Journal of Parallel and Distributed Computing》2012

Graphics Processing Units (GPU) have impressively arisen as general-purpose coprocessors in high performance computing applications, since the launch of the Compute Unified Device Architecture (CUDA). However, they present an inherent performance bottleneck in the fact that communication between two separate address spaces (the main memory of the CPU and the memory of the GPU) is unavoidable. The CUDA Application Programming Interface (API) provides asynchronous transfers and streams, which permit a staged execution, as a way to overlap communication and computation. Nevertheless, a precise manner to estimate the possible improvement due to overlapping does not exist, neither a rule to determine the optimal number of stages or streams in which computation should be divided. In this work, we present a methodology that is applied to model the performance of asynchronous data transfers of CUDA streams on different GPU architectures. Thus, we illustrate this methodology by deriving expressions of performance for two different consumer graphic architectures belonging to the more recent generations. These models permit programmers to estimate the optimal number of streams in which the computation on the GPU should be broken up, in order to obtain the highest performance improvements. Finally, we have checked the suitability of our performance models with three applications based on codes from the CUDA Software Development Kit (SDK) with successful results. 相似文献

6.

Implementation and optimization of GPU‐based parallel one‐step leapfrog ADI‐FDTD for far‐field scattering problems

Bin Zou Shuo Liu Lamei Zhang 《国际射频与微波计算机辅助工程杂志》2020,30(10)

The one‐step leapfrog alternative‐direction‐implicit finite‐difference time‐domain (ADI‐FDTD), free from the Courant‐Friedrichs‐Lewy (CFL) stability condition and sub‐step computations, is efficient when dealing with fine grid problems. However, solution of the numerous tridiagonal systems still imposes a great computational burden and makes the method hard to execute in parallel. In this paper, we proposed an efficient graphic processing unit (GPU)‐based parallel implementation of the one‐step leapfrog ADI‐FDTD for the far‐field EM scattering simulation of objects, in which we present and analyze the manners of calculation area division and thread allocation and a data layout transformation of z components is proposed to achieve better memory access mode, which is a key factor affecting GPU execution efficiency. The simulation experiment is carried out to verify the accuracy and efficiency of the GPU‐based implementation. The simulation results show that there is a good agreement between the proposed one‐step leapfrog ADI‐FDTD method and Yee's FDTD in solving the far‐field scattering problem and huge benefits in performance were encountered when the method was accelerated using GPU technology. 相似文献

7.

GPU‐accelerated finite‐difference time‐domain method for dielectric media based on CUDA

下载免费PDF全文

Ximin Wang Song Liu Xuan Li Shuangying Zhong 《国际射频与微波计算机辅助工程杂志》2016,26(6):512-518

The simulation of electromagnetic (EM) waves propagation in the dielectric media is presented using Compute Unified Device Architecture (CUDA) implementation of finite‐difference time‐domain (FDTD) method on graphic processing unit (GPU). The FDTD formulation in the dielectric media is derived in detail, and GPU‐accelerated FDTD method based on CUDA programming model is described in the flowchart. The accuracy and speedup of the presented CUDA‐implemented FDTD method are validated by the numerical simulation of the EM waves propagating into the lossless and lossy dielectric media from the free space on GPU, by comparison with the original FDTD method on CPU. The comparison of the numerical results of CUDA‐implemented FDTD method on GPU and original FDTD method on CPU demonstrates that the CUDA‐implemented FDTD method on GPU can obtain better application speedup performance with reasonable accuracy. © 2016 Wiley Periodicals, Inc. Int J RF and Microwave CAE 26:512–518, 2016. 相似文献

8.

在流水线结构上的基于Shear-Wrap的并行体数据绘制算法 总被引：1，自引：0，他引：1

下载免费PDF全文

黄小虎李维郑南宁《中国图象图形学报》1998,3(2):115-122

在以前的基于目标空间划分的并行体数据绘制算法中，局部绘制和图象融合是两个串行的过程，在节点机的局部绘制阶段几乎没有数据通讯，但在数据融合阶段数据通讯量非常大，出现总线争用甚至通讯阻塞，而且在这个阶段有非常大的同步开销。本文利用流水线结构，让局部体数据绘制和图象融合并行执行，很好地解决了上述缺点。并在一个基于微机的流水线结构上实现了一个新的基于目标空间划分的并行体数据绘制算法。相似文献

9.

IB网上CPU-GPU异构超算平台容器性能评估及优化

下载免费PDF全文

胡鹤赵毅王宪贺《计算机工程与应用》2021,57(18):82-85

为了实现资源和系统环境的隔离,近年来新兴了多种虚拟化工具,容器便是其中之一。在超算资源上运行的问题通常是由软件配置引起的。容器的一个作用就是将依赖打包进轻量级可移植的环境中,这样可以提高超算应用程序的部署效率。为了解基于IB网的CPU-GPU异构超算平台上容器虚拟化技术的性能特征,使用标准基准测试工具对Docker容器进行了全面的性能评估。该方法能够评估容器在虚拟化宿主机过程中产生的性能开销,包括文件系统访问性能、并行通信性能及GPU计算性能。结果表明,容器具备近乎原生宿主机的性能,文件系统I/O开销及GPU计算开销与原生宿主机差别不大。随着网络负载的增大,容器的并行通信开销也相应增大。根据评估结果,提出了一种能够发挥超算平台容器性能的方法,为使用者有针对性地进行系统配置、合理设计应用程序提供依据。相似文献

10.

An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs

Youngsub Ko Youngmin Yi Soonhoi Ha 《Journal of Real-Time Image Processing》2014,9(1):5-18

H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research efforts to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation, mainly due to significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. Further, we incorporate frame-level parallelization technique to improve the overall throughput. Experimental results show that our proposed H.264 encoder has higher performance than x264 encoder. 相似文献

11.

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

《Parallel Computing》2016

Graphic Processing Units (GPUs) are widely used in high performance computing, due to their high computational power and high performance per Watt. However, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate the computation, while the CPU is responsible for the communication. This approach always requires a dedicated CPU thread, which consumes additional CPU cycles and therefore increases the power consumption of the complete application. In recent work we have shown that the GPU is able to control the communication independently of the CPU. However, there are several problems with GPU-controlled communication. The main problem is intra-GPU synchronization, since GPU blocks are non-preemptive. Therefore, the use of communication requests within a GPU can easily result in a deadlock. In this work we show how dynamic parallelism solves this problem. GPU-controlled communication in combination with dynamic parallelism allows keeping the control flow of multi-GPU applications on the GPU and bypassing the CPU completely. Using other in-kernel synchronization methods results in massive performance losses, due to the forced serialization of the GPU thread blocks. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, we will show that performance per Watt increases by up to 10% while still using commodity hardware. 相似文献

12.

Time‐Domain Parallelization for Accelerating Cloth Simulation

下载免费PDF全文

Junbang Liang Ming C. Lin 《Computer Graphics Forum》2018,37(8):21-34

Cloth simulations, widely used in computer animation and apparel design, can be computationally expensive for real‐time applications. Some parallelization techniques have been proposed for visual simulation of cloth using CPU or GPU clusters and often rely on parallelization using spatial domain decomposition techniques that have a large communication overhead. In this paper, we propose a novel time‐domain parallelization technique that makes use of the two‐level mesh representation to resolve the time‐dependency issue and develop a practical algorithm to smooth the state transition from the corresponding coarse to fine meshes. A load estimation and a load balancing technique used in online partitioning are also proposed to maximize the performance acceleration. Our method achieves a nearly linear performance scaling on manycore clusters and outperforms spatial‐domain parallelization on a diverse set of benchmarks. 相似文献

13.

Solving diffractive optics problems using graphics processing units

D. L. Golovashkin N. L. Kasanskiy 《Optical Memory & Neural Networks》2011,20(2):85-89

Techniques for applying graphics processing units (GPU) to the general-purpose nongraphics computations proposed in recent years by the companies ATI (AMD FireStream, 2006) and NVIDIA (CUDA: Compute Unified Device Architecture, 2007) have given an impetus to developing algorithms and software packages for solving problems of diffractive optics with the aid of the GPU. The computations based on the wide-spread Ray Tracing method were among the first to be implemented using the GPU. The method attracted the attention of the CUDA technology architects, who proposed its GPU-based implementation at the conference NVISION08 (2008). The potential of this direction is associated both with the research into the general issues of mapping of the Ray Tracing method onto the GPU architecture (involving the use of various grid domains and trees) and with developing dedicated software packages (RTE and Linzik projects). In this work, a special attention is given to the overview of techniques for the GPU-aided implementation of the FDTD (finite-difference time-domain) method, which offers an instrument for solving problems of micro- and nanooptics using the rigorous electromagnetic theory. The review of the related papers ranges from the initial research (based on the use of textures) to the complete software solutions (like FDTD Software and FastFDTD). 相似文献

14.

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Wang Xian Aoki Takayuki 《Parallel Computing》2011,37(9):521-535

GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384 × 384 × 384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3-4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.1-1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re = 13,000 was carried on successfully using the mesh system 2000 × 1000 × 1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0 h were used for processing 100,000 time steps. Under this condition, the computational time (2.79 h) and the data communication time (3.06 h) are almost the same. 相似文献

15.

基于GPU的zk-SNARK中多标量乘法的并行计算方法

下载免费PDF全文

王锋柴志雷花鹏程丁冬王宁《计算机应用研究》2024,41(6)

针对zk-SNARK（zero-knowledge succinct non-interactive argument of knowledge）中计算最为耗时的多标量乘法（multi-scalar multiplication,MSM）,提出了一种基于GPU的MSM并行计算方案。首先,对MSM进行细粒度任务分解,提升算法本身的计算并行性,以充分利用GPU的大规模并行计算能力。采用共享内存对同一窗口下的子MSM并行规约减少了数据传输开销。其次,提出了一种基于底层计算模块线程级任务负载搜索最佳标量窗口的窗口划分方法,以最小化MSM子任务的计算开销。最后,对标量形式转换所用数据存储结构进行优化,并通过数据重叠传输和通信时间隐藏,解决了大规模标量形式转换过程的时延问题。该MSM并行计算方法基于CUDA在NVIDIA GPU上进行了实现,并构建了完整的零知识证明异构计算系统。实验结果表明：所提出的方法相比目前业界最优的cuZK的MSM计算模块获得了1.38倍的加速比。基于所改进MSM的整体系统比业界流行的Bellman提升了186倍,同时比业界最优的异构版本Bellperson提升了1.96倍,验证了方法的有效性。相似文献

16.

基于OpenCL的GPU加速三维时域有限差分电磁场仿真算法研究

代健褚天舒杨照《数值计算与计算机应用》2014,(1):10-11

提出了一种基于开放运算语言（OpenCL）的GPU加速三维时域有限差分（FDTD）电磁场仿真计算的方法．该方法利用图形处理单元（GPU）的并行处理特性并结合OpenCL接口标准实现了时域卷积完全匹配层（CPML）吸收边界条件的三维FDTD的高性能加速计算．首先设置FDTD仿真参数并动态申请内存空间,然后初始化OpenCL的计算参数,对三维电磁模型基于OpenCL进行FDTD加速仿真．本方法显著提升了FDTD电磁场仿真速度,与利用CPU计算相比速度提升可达5-8倍,且具有CPML吸收边界条件,可以模拟电磁波在自由空间的传播;基于OpenCL编译的语言程序可以运行在CPU或GPU硬件上,并可充分发挥多核CPU的并行计算能力,使得FDTD电磁场仿真具有更广泛的实际应用．相似文献

17.

基于向量引用Platform-Oblivious内存连接优化技术

张延松张宇王珊《软件学报》2018,29(3):883-895

以MapD为代表的图分析数据库系统通过GPU、Phi等新型众核处理器来支持高性能分析处理,在面向复杂数据模式时连接操作仍然是重要的性能瓶颈.近年来,异构处理器逐渐成为高性能计算的主流平台,内存连接性能的研究从多核CPU平台扩展到新兴的众核处理器,但众多的研究成果并未系统地揭示连接算法性能、连接数据集大小、硬件架构之间的内在联系,难以为未来异构处理器平台的数据库提供连接平台优化选择策略.本文以面向多核CPU、Xeon Phi、GPU处理器平台的内存连接优化技术为目标,通过优化内存哈希表设计,实现以向量映射替代哈希映射操作,消除哈希代价对内存连接算法的影响,从而更加准确地测量内存连接算法在多核CPU的cache大小、Xeon Phi的cache大小、Xeon Phi的并发多线程、GPU的SIMT（单指令多线程）机制等硬件相关因素影响下的性能特征.实验结果表明,缓存与并发多线程机制是提高内存连接算法性能的重要影响因素.缓存机制对于满足cache大小的连接操作具有性能优势,而GPU的并发多线程机制则在较大表的连接操作中具有较高的性能,Xeon Phi则在满足其L2 cache大小的连接操作中具有最高性能.实验结果揭示了内存连接操作性能与异构处理器硬件特性的联系,为未来异构处理器平台内存数据库查询优化器提供了优化策略. 相似文献

18.

一种适应GPU的混合访问缓存索引框架

张鸿骏武延军张珩张立波《软件学报》2020,31(10):3038-3055

散列表（hash table）作为一类根据关键码值（key value）提供高效数据访问的数据索引结构,其广泛应用于各类计算机应用中,尤其是在对性能要求极高的系统软件、数据库以及高性能计算领域.在网络、云计算和物联网服务方面,以散列表为核心结构已经成为缓存系统的重要系统组件.然而,随着大规模数据量的大幅度增加,以多核CPU为核心设计散列表结构的系统已经逐渐出现性能瓶颈,亟需进一步改进散列表的高性能和可扩展性.随着通用图形处理器（graphic processing unit,简称GPU）的日益普及以及硬件计算能力和并发性能的大幅度提升,各类以并行计算为核心的系统软件任务在GPU上进行了优化设计并得到可观的性能提升.由于存在稀疏性和随机性,采用现有散列表的并行结构直接在GPU上应用势必会带来高频次的内存访问和频繁的总线数据传输,影响了散列表在GPU上的性能发挥.重点分析了缓存系统中散列表索引的内存访问、命中率与索引开销,提出并设计了一种适应GPU的混合访问缓存索引框架CCHT（cache cuckoo hash table）,提供了两种适应不同命中率和索引开销要求的缓存策略,允许写入与查询操作并发执行,最大程度地利用了GPU硬件的计算性能与并发特性,减少了内存访问与总线传输.通过在GPU硬件上的实现与实验验证,CCHT在保证缓存命中率的同时,性能优于其他用于缓存索引的散列表. 相似文献

19.

Improving performance of GPU code using novel features of the NVIDIA kepler architecture

Yuanzhe Li Loren Schwiebert Eyad Hailat Jason Mick Jeffrey Potoff 《Concurrency and Computation》2016,28(13):3586-3605

Graphics processing unit (GPU) computing is a popular approach to simulating complex models and performing massive calculations. GPUs have attracted a great deal of interest because they offer both high performance and energy efficiency. Efficient General‐Purpose computation on Graphics Processing Units requires good parallelism, memory coalescing, regular memory access, small overhead on data exchange between the CPU and the GPU, and few explicit global synchronizations, which are usually gained from optimizing the algorithms. Besides these advantages, the proper use of some novel features provided on NVIDIA GPUs can offer further improvement. In this paper, we modify an existing optimized GPU application to illustrate the potential performance gains of these features and to demonstrate the performance trade offs of different implementation choices. The paper focuses on the challenges of reducing interactions between CPU and GPU and reducing the use of explicit synchronization. We explain how to achieve these objectives using two features of the Kepler architecture, warp shuffle, and dynamic parallelism. We find that a judicious use of these two techniques, eliminating repeated operations and synchronizations, results in significantly better performance. We describe various pitfalls encountered in optimizing our code to use these two features and how these were addressed. In particular, we identify a subtle data race condition when using dynamic parallelism under certain circumstances, and present our solution. Finally, we present a technique to trade off the allocation of various device resources to find the parameters that offer the best performance. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

20.

Modeling and animation of fracture of heterogeneous materials based on CUDA

Jiangfan Ning Huaxun Xu Bo Wu Liang Zeng Sikun Li Yueshan Xiong 《The Visual computer》2013,29(4):265-275

Existing techniques for animation of object fracture are based on an assumption that the object materials are homogeneous while most real world materials are heterogeneous. In this paper, we propose to use movable cellular automata (MCA) to simulate fracture phenomena on heterogeneous objects. The method is based on the discrete representation and inherits the advantages from both classical cellular automaton and discrete element methods. In our approach, the object is represented as discrete spherical particles, named movable cellular automata. MCA is used to simulate the material and physical properties so as to determine when and where the fracture occurs. To achieve real-time performance, we accelerate the complex computation of automata’s physical properties in MCA simulation using CUDA on a GPU. The simulation results are directly sent to vertex buffer object (VBO) for rendering to avoid the costly communication between CPU and GPU. The experimental results show the effectiveness of our method. 相似文献