首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
As the computation power in desktops advances, parallel programming has emerged as one of the essential skills needed by next generation software engineers. However, programs written in popular parallel programming paradigms have a substantial amount of sequential code mixed with the parallel code. Several such versions supporting different platforms are necessary to find the optimum version of the program for the available resources and problem size. As revealed by our study on benchmark programs, sequential code is often duplicated in these versions. This can affect code comprehensibility and re-usability of the software. In this paper, we discuss a framework named PPModel, which is designed and implemented to free programmers from these scenarios. Using PPModel, a programmer can separate parallel blocks in a program, map these blocks to various platforms, and re-execute the entire program. We provide a graphical modeling tool (PPModel) intended for Eclipse users and a Domain-Specific Language (tPPModel) for non-Eclipse users to facilitate the separation, the mapping, and the re-execution. This is illustrated with a case study from a benchmark program, which involves re-targeting a parallel block to CUDA and another parallel block to OpenMP. The modified program gave almost 5× performance gain compared to the sequential counterpart, and 1.5× gain compared to the existing OpenMP version.  相似文献   

2.
High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms.  相似文献   

3.
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod...  相似文献   

4.
Data-parallel accelerator devices such as Graphical Processing Units (GPUs) are providing dramatic performance improvements over even multi-core CPUs for lattice-oriented applications in computational physics. Models such as the Ising and Potts models continue to play a role in investigating phase transitions on small-world and scale-free graph structures. These models are particularly well-suited to the performance gains possible using GPUs and relatively high-level device programming languages such as NVIDIA’s Compute Unified Device Architecture (CUDA). We report on algorithms and CUDA data-parallel programming techniques for implementing Metropolis Monte Carlo updates for the Ising model using bit-packing storage, and adjacency neighbour lists for various graph structures in addition to regular hypercubic lattices. We report on parallel performance gains and also memory and performance tradeoffs using GPU/CPU and algorithmic combinations.  相似文献   

5.
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan–Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes.  相似文献   

6.
Graphics processing units (GPUs) are being increasingly embraced by the high‐performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was recently introduced as a software solution to address the high acquisition costs and energy consumption of GPUs that constrain further adoption of this technology. Specifically, rCUDA is a middleware that allows a reduced number of GPUs to be transparently shared among the nodes in a cluster. Although the initial prototype versions of rCUDA demonstrated its functionality, they also revealed concerns with respect to usability, performance, and support for new CUDA features. In response, in this paper, we present a new rCUDA version that (1) improves usability by including a new component that allows an automatic transformation of any CUDA source code so that it conforms to the needs of the rCUDA framework, (2) consistently features low overhead when using remote GPUs thanks to an improved new communication architecture, and (3) supports multithreaded applications and CUDA libraries. As a result, for any CUDA‐compatible program, rCUDA now allows the use of remote GPUs within a cluster with low overhead, so that a single application running in one node can use all GPUs available across the cluster, thereby extending the single‐node capability of CUDA. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

7.
基于CUDA的并行粒子群优化算法的设计与实现   总被引:1,自引:0,他引:1  
针对处理大量数据和求解大规模复杂问题时粒子群优化(PSO)算法计算时间过长的问题, 进行了在显卡(GPU)上实现细粒度并行粒子群算法的研究。通过对传统PSO算法的分析, 结合目前被广泛使用的基于GPU的并行计算技术, 设计实现了一种并行PSO方法。本方法的执行基于统一计算架构(CUDA), 使用大量的GPU线程并行处理各个粒子的搜索过程来加速整个粒子群的收敛速度。程序充分使用CUDA自带的各种数学计算库, 从而保证了程序的稳定性和易写性。通过对多个基准优化测试函数的求解证明, 相对于基于CPU的串行计算方法, 在求解收敛性一致的前提下, 基于CUDA架构的并行PSO求解方法可以取得高达90倍的计算加速比。  相似文献   

8.
基于Hadoop的高性能海量数据处理平台研究   总被引:2,自引:0,他引:2  
海量数据高性能计算蕴藏着巨大的应用价值,但是目前云计算体系只具有海量数据处理能力,而不具有足够的高性能计算能力。将具有超强并行计算能力的CPU与云计算相融合,提出了基于CPU/GPU协同的异构高性能云计算体系结构。以开源Hadoop为基础,采用注释码的形式对MapReduce函数中需要并行的部分进行标记。通过 定制GPU类加载器,将被标记代码转换为CUDA代码并动态编译运行。该平台将GPU的计算能力融合到MapReduce框架中,可高效处理海量数据。  相似文献   

9.
10.
The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

12.
Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have applied the code in the case of the Time-Dependent Schrödinger Equation of atomic hydrogen in a strong laser field and studied its performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant speedup of the GPU over the CPU device. The speedup in the benchmark tended towards a value of about 40 with significant speedups expected against multi-core CPUs. Furthermore, though we do not present the detailed benchmarks here, we also have achieved speedup values of around 75 by performing a slight optimization of the described algorithm.  相似文献   

13.
The research domain of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia data. High-performance computing techniques are necessary to satisfy the ever increasing computational demands of MMCA applications. The introduction of Graphics Processing Units (GPUs) in modern cluster systems presents application developers with a challenge. While GPUs are well known to be capable of providing significant performance improvements, the programming complexity vastly increases. To this end, we have extended a user transparent parallel programming model for MMCA, named Parallel-Horus, to allow the execution of compute intensive operations on the GPUs present in the cluster. The most important class of operations in the MMCA domain are convolutions, which are typically responsible for a large fraction of the execution time. Existing optimization approaches for CUDA kernels in general as well as those specific to convolution operations are too limited in both performance and flexibility. In this paper, we present a new optimization approach, called adaptive tiling, to implement a highly efficient, yet flexible, library-based convolution operation for modern GPUs. To the best of our knowledge, our implementation is the most optimized and best performing implementation of 2D convolution in the spatial domain available to date.  相似文献   

14.
As a general purpose scalable parallel programming model for coding highly parallel applications, CUDA from NVIDIA provides several key abstractions: a hierarchy of thread blocks, shared memory, and barrier synchronization. It has proven to be rather effective at programming multithreaded many-core GPUs that scale transparently to hundreds of cores; as a result, scientists all over the industry and academia are using CUDA to dramatically expedite on production and codes. GPU-based clusters are likely to play an essential role in future cloud computing centers, because some computation-intensive applications may require GPUs as well as CPUs. In this paper, we adopted the PCI pass-through technology and set up virtual machines in a virtual environment; thus, we were able to use the NVIDIA graphics card and the CUDA high performance computing as well. In this way, the virtual machine has not only the virtual CPU but also the real GPU for computing. The performance of the virtual machine is predicted to increase dramatically. This paper measured the difference of performance between physical and virtual machines using CUDA, and investigated how virtual machines would verify CPU numbers under the influence of CUDA performance. At length, we compared CUDA performance of two open source virtualization hypervisor environments, with or without using PCI pass-through. Through experimental results, we will be able to tell which environment is most efficient in a virtual environment with CUDA.  相似文献   

15.
非结构网格的生成在时间和内存上有一定的缺陷,这里提出了一种新的方法,命名为GPU-PDMG,是基于CUDA架构的GPU并行非结构网格生成技术。该技术结合了GPU的高速并行计算能力与Delaunay三角化的优点,在英伟达GPU模块下采用CUDA程序模型,开发出了加锁并行区划分技术,通过对NACA0012翼型、多段翼型等算例进行测试,分析此方法的加速比和效率,对其计算性能展开评估。实验结果表明,GPU-PDMG优于现存在的CPU算法的速度,在保证网格质量的同时,提高了效率。  相似文献   

16.
Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters   总被引:2,自引:0,他引:2  
Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.  相似文献   

17.
OpenACC是一套基于指导语句方式的并行编程语言标准.编程者可以通过在代码中添加符合该标准的指导语句,经OpenACC编译器的编译,将串行代码并行化地移植到加速器或者协处理器上,进而获得异构加速器所带来的加速效果.OpenACC与CUDA和OpenCL这类异构并行编程技术的不同之处在于,它的目的是使编程者在应用移植过程中不需要考虑加速器或协处理器的底层硬件架构,从而降低编程难度.同时它也具有仅需维护一套代码便可在不同硬件平台上运行的优良跨平台性.因此,OpenACC是一个值得研究的并行编程标准.如今的异构加速硬件设备呈现出多元化趋势.在2013年11月的Top500榜单上排名第一的“天河二号”使用了48000块构建在IntelKnights Corner架构之上的协处理器.与此同时,发布不久的NVIDIA公司最新的Kepler架构GPU产品由于多年来的GPU市场积累也迅速形成了可观的用户群体.对于并非追求性能极限的应用移植者而言,寻求应用性能和移植简易性之间的平衡是相当重要的议题.只需要编写一套代码便可运行在这两种硬件平台上的OpenACC正迎合了用户在移植简易性上的需求.解决了移植的简易性之后,同一个应用在不同硬件平台上的性能表现便成了用户最想了解的问题.通过实验和构建性能模型向读者展示使用OpenACC移植的应用在Intel Knights Corner和NVIDIA Kepler架构硬件上的性能可移植性.  相似文献   

18.
The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores.  相似文献   

19.
Swan: A tool for porting CUDA programs to OpenCL   总被引:1,自引:0,他引:1  
The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

Program summary

Program title: SwanCatalogue identifier: AEIH_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License version 2No. of lines in distributed program, including test data, etc.: 17 736No. of bytes in distributed program, including test data, etc.: 131 177Distribution format: tar.gzProgramming language: CComputer: PCOperating system: LinuxRAM: 256 MbytesClassification: 6.5External routines: NVIDIA CUDA, OpenCLNature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion.Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time.Restrictions: No support for CUDA C++ featuresRunning time: Nominal  相似文献   

20.
Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. We present SkelCL—a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL makes three main enhancements to the OpenCL standard: (1) memory management is simplified using parallel container data types (vectors and matrices); (2) an automatic data (re)distribution mechanism allows for implicit data movements between GPUs and ensures scalability when using multiple GPUs; (3) computations are conveniently expressed using parallel algorithmic patterns (skeletons). We demonstrate how SkelCL is used to implement parallel applications, and we report experimental evaluation of our approach in terms of programming effort and performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号