首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
高岚  赵雨晨  张伟功  王晶  钱德沛 《软件学报》2024,35(2):1028-1047
并行计算已成为主流趋势. 在并行计算系统中, 同步是关键设计之一, 对硬件性能的充分利用至关重要. 近年来, GPU (graphic processing unit, 图形处理器)作为应用最为广加速器得到了快速发展, 众多应用也对GPU线程同步提出更高要求. 然而, 现有GPU系统却难以高效地支持真实应用中复杂的线程同步. 研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展, 但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战. 根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类. 在此基础上, 围绕GPU线程同步的表达和执行, 首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战; 而后依据不同的GPU线程同步粒度, 从线程同步表达方法和性能优化方法两个方面入手, 介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究, 对现有研究方法进行分析与总结. 最后, 指出GPU线程同步未来的研究趋势和发展前景, 并给出可能的研究思路, 从而为该领域的研究人员提供参考.  相似文献   

2.
Given that the concurrent L1-minimization (L1-min) problem is often required in some real applications, we investigate how to solve it in parallel on GPUs in this paper. First, we propose a novel self-adaptive warp implementation of the matrix-vector multiplication (Ax) and a novel self-adaptive thread implementation of the matrix-vector multiplication (ATx), respectively, on the GPU. The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size. Second, based on the above proposed kernels, the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU, and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache. Finally, we design a concurrent L1-min solver on multiple GPUs. The experimental results have validated the high effectiveness and good performance of our proposed methods.  相似文献   

3.
In recent years, graphical processing unit(GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard. These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.  相似文献   

4.
目的几何校正(又称地理编码)是合成孔径雷达(SAR)影像处理流程中重要的一个步骤,具有一定的计算复杂度,需要用到几何定位模型。本文针对星载SAR影像,采用有理多项式系数(RPC)定位模型,提出了图形处理器(GPU)支持的几何校正大规模并行处理方法。方法该方法充分利用GPU计算资源强大及几何校正过程中每个像素处理步骤一致的特点,每次导入大量像素至GPU,为每个像素分配一个线程,每个线程执行有理函数计算、投影变换、插值采样等计算复杂度高的步骤,通过优化配置dim Grid和dim Block参数,提升GPU的并行性能。该方法通过分块处理实现SAR影像大幅面处理,且可适用于多个不同分块大小。结果实验结果显示其计算加速比为38 44,为全面客观地分析GPU并行处理的特点,还计算了整体加速比,通过多个实验分析影响整体加速性能的因素,提出大块读写提高I/O性能的优化方法。结论该方法形式简洁,通用性好,可适用于几乎所有的星载SAR影像、不同的影像幅面;且加速性能明显。  相似文献   

5.
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.  相似文献   

6.
Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations.  相似文献   

7.
Graphics processing units (GPUs) pose an attractive choice for designing high-performance and energy-efficient software systems. This is because GPUs are capable of executing massively parallel applications. However, the performance of GPUs is limited by the contention in memory subsystems, often resulting in substantial delays and effectively reducing the parallelism. In this paper, we propose GRAB, an automated debugger to aid the development of efficient GPU kernels. GRAB systematically detects, classifies and discovers the root causes of memory-performance bottlenecks in GPUs. We have implemented GRAB and evaluated it with several open-source GPU kernels, including two real-life case studies. We show the usage of GRAB through improvement of GPU kernels on a real NVIDIA Tegra K1 hardware – a widely used GPU for mobile and handheld devices. The guidance obtained from GRAB leads to an overall improvement of up to 64%.  相似文献   

8.
While Graphics Processing Units (GPUs) show high performance for problems with regular structures, they do not perform well for irregular tasks due to the mismatches between irregular problem structures and SIMD-like GPU architectures. In this paper, we introduce a new library, CUIRRE, for improving performance of irregular applications on GPUs. CUIRRE reduces the load imbalance of GPU threads resulting from irregular loop structures. In addition, CUIRRE can characterize irregular applications for their irregularity, thread granularity and GPU utilization. We employ this library to characterize and optimize both synthetic and real-world applications. The experimental results show that a 1.63× on average and up to 2.76× performance improvement can be achieved with the centralized task pool approach in the library at a 4.57% average overhead with static loading ratios. To avoid the cost of exhaustive searches of loading ratios, an adaptive loading ratio method is proposed to derive appropriate loading ratios for different inputs automatically at runtime. Our task pool approach outperforms other load balancing schemes such as the task stealing method and the persistent threads method. The CUIRRE library can easily be applied on many other irregular problems.  相似文献   

9.
Many-core accelerators are being more frequently deployed to improve the system processing capabilities. In such systems, application mapping must be enhanced to maximize utilization of the underlying architecture. Especially, in graphics processing units (GPUs), mapping kernels that are part of multi-kernel applications has a great impact on overall performance, since kernels may exhibit different characteristics on different CPUs and GPUs. While some kernels run faster on GPUs, others may perform better in CPUs. Thus, heterogeneous execution may yield better performance than executing the application only on a CPU or only on a GPU. In this paper, we investigate on two approaches: a novel profiling-based adaptive kernel mapping algorithm to assign each kernel of an application to the proper device, and a Mixed-Integer Programming (MIP) implementation to determine optimal mapping. We utilize profiling information for kernels on different devices and generate a map that identifies which kernel should run where in order to improve the overall performance of an application. Initial experiments show that our approach can efficiently map kernels on CPUs and GPUs, and outperforms CPU-only and GPU-only approaches.  相似文献   

10.
Membrane systems are parallel distributed computing models that are used in a wide variety of areas. Use of a sequential machine to simulate membrane systems loses the advantage of parallelism in Membrane Computing. In this paper, an innovative classification algorithm based on a weighted network is introduced. Two new algorithms have been proposed for simulating membrane systems models on a Graphics Processing Unit (GPU). Communication and synchronization between threads and thread blocks in a GPU are time-consuming processes. In previous studies, dependent objects were assigned to different threads. This increases the need for communication between threads, and as a result, performance decreases. In previous studies, dependent membranes have also been assigned to different thread blocks, requiring inter-block communications and decreasing performance. The speedup of the proposed algorithm on a GPU that classifies dependent objects using a sequential approach, for example with 512 objects per membrane, was 82×, while for the previous approach (Algorithm 1), it was 8.2×. For a membrane system with high dependency among membranes, the speedup of the second proposed algorithm (Algorithm 3) was 12×, while for the previous approach (Algorithm 1) and the first proposed algorithm (Algorithm 2) that assign each membrane to one thread block, it was 1.8×.  相似文献   

11.
This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver.  相似文献   

12.
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.  相似文献   

13.
图形处理器(graphic processing unit,GPU)的最新发展已经能够以低廉的成本提供高性能的通用计算。基于GPU的CUDA(compute unified device architecture)和OpenCL(open computing language)编程模型为程序员提供了充足的类似于C语言的应用程序接口(application programming interface,API),便于程序员发挥GPU的并行计算能力。采用图形硬件进行加速计算,通过一种新的GPU处理模型——并行时间空间模型,对现有GPU上的N-body实现进行了分析,从而提出了一种新的GPU上快速仿真N-body问题的算法,并在AMD的HD Radeon 5850上进行了实现。实验结果表明,相对于CPU上的实现,获得了400倍左右的加速;相对于已有GPU上的实现,也获得了2至5倍的加速。  相似文献   

14.
空间插值是地理信息系统(GIS)空间分析中计算复杂且耗时的操作,因此无法满足实时性的要求。随着图形处理器(GPU)浮点计算能力的大幅提高,GPU通用计算已成为处理GIS领域内复杂计算的研究热点。为实时化一些传统低效的算法提供了良好的契机。利用GPU在并行计算上的优势,将反距离加权法插值算法映射到了统一计算设备架构(CUDA)并行编程架构。首先在GPU中建立二级索引使计算层次得到了合理的划分,然后利用多线程分块策略执行并行插值计算。最后通过实验表明,该方法的插值误差与CPU方法相比能控制在10-6数量级,并且在插值半径较大插值数据较多的情况下,该算法可达到40倍以上的加速比。充分证明了该方法的正确性及高效性。  相似文献   

15.
Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.  相似文献   

16.
NAS Parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.  相似文献   

17.
Programmers usually implement iterative methods that solve partial differential equations by expressing them using a sequence of basic kernels from libraries optimized for the graphics processing unit (GPU). The global runtime of the resulting combination is often penalized by the smallest and most inefficient vector operations. To improve the GPU exploitation, we identify and analyze the potential kernels to be fused according to the data dependence, data type and size, and GPU resources. This paper provides an extensive analysis of the impact of fusing vector operations [level 1 of Basic Linear Algebra Subprograms (BLAS)] on the performance of the GPU. The experimental evaluation shows that this optimization provides noticeable improvement especially for kernels with lower memory requirements and on more modern GPUs. It is worth noting that the fused BLAS operations can be very useful to help programmers efficiently code iterative methods to solve large linear systems of equations for the GPU. Iterative methods such as biconjugate gradient method (BCG) are one of the examples that can benefit from this optimization strategy. Indeed, kernel fusion of vector routines makes the most efficient GPU implementation of BCG run between \(1.09\times \) and \(1.27\times \) faster on three GPUs of different characteristics.  相似文献   

18.
细粒度任务并行GPU通用矩阵乘   总被引:1,自引:0,他引:1       下载免费PDF全文
稠密线性代数运算对模式识别和生物信息等许多实际应用至关重要,而通用矩阵乘(GEMM)处于稠密线性代数运算的基础地位。在cuBLAS与MAGMA中,GEMM被实现为若干kernel函数,对大型GEMM计算能够达到很高的性能。然而,现有实现对批量的小型GEMM计算性能发挥则较为有限。而且,现有实现也不能在多个具有不同性能的GPU之间自动扩展并达到负载均衡。提出任务并行式GEMM(TPGEMM),用细粒度任务并行的方式实现批量矩阵乘和多GPU矩阵乘。一个或多个GEMM的计算能够被拆分为多个任务,动态地调度到一个或多个GPU上。TPGEMM避免了为批量矩阵乘启动多个kernel函数的开销,对批量矩阵乘能够取得显著高于cuBLAS与MAGMA的性能。在低开销细粒度任务调度的基础上,TPGEMM支持单个GEMM计算在多个GPU间的自动并行,在一台具有四个不同性能GPU的工作站上取得了接近100%的扩展效率。  相似文献   

19.
焦良葆  陈瑞 《计算机工程》2010,36(18):10-12
GPU上的并行算法效率依赖于核函数在流多处理器上的平均运行效率,基于此,分析GPU核的执行方式,以及网格、线程块和线程之间的关系,采用细化核函数的方法将光线跟踪算法进行细化。实验结果证明,核的大小设置和分布方向影响了线程块内部的一致性,核函数的细化能增加线程块中同时运行的线程捆的数量。  相似文献   

20.
GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is usability, since vendor specific APIs are quite different from existing programming languages, and it requires a substantial knowledge of the device and programming interface to optimize applications. Hence, lately a growing number of higher level programming models are targeting GPUs to alleviate this problem. The ultimate goal for a high-level model is to expose an easy-to-use interface for the user to offload compute intensive portions of code (kernels) to the GPU, and tune the code according to the target accelerator to maximize overall performance with a reduced development effort. In this paper, we share our experiences of three of the notable high-level directive based GPU programming models—PGI, CAPS and OpenACC (from CAPS and PGI) on an Nvidia M2090 GPU. We analyze their performance and programmability against Isotropic (ISO)/Tilted Transversely Isotropic (TTI) finite difference kernels, which are primary components in the Reverse Time Migration (RTM) application used by oil and gas exploration for seismic imaging of the sub-surface. When ported to a single GPU using the mentioned directives, we observe an average 1.5–1.8x improvement in performance for both ISO and TTI kernels, when compared with optimized multi-threaded CPU implementations using OpenMP.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号