期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

高岚赵雨晨张伟功王晶钱德沛《软件学报》2024,35(2):1028-1047

并行计算已成为主流趋势. 在并行计算系统中, 同步是关键设计之一, 对硬件性能的充分利用至关重要. 近年来, GPU (graphic processing unit, 图形处理器)作为应用最为广加速器得到了快速发展, 众多应用也对GPU线程同步提出更高要求. 然而, 现有GPU系统却难以高效地支持真实应用中复杂的线程同步. 研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展, 但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战. 根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类. 在此基础上, 围绕GPU线程同步的表达和执行, 首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战; 而后依据不同的GPU线程同步粒度, 从线程同步表达方法和性能优化方法两个方面入手, 介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究, 对现有研究方法进行分析与总结. 最后, 指出GPU线程同步未来的研究趋势和发展前景, 并给出可能的研究思路, 从而为该领域的研究人员提供参考. 相似文献

2.

一种基于冗余线程的GPU多副本容错技术

贾佳杨学军李志凌《计算机研究与发展》2013,50(7)

目前随着通用GPU(general purpose computation on graphic processing units,GPGPU)性能的不断提高,利用CPU和GPU构建的异构系统已经成为高性能计算领域的研究热点.然而随着并行计算系统的不断增长,系统可靠性越来越低,已成为并行计算向大规模扩展的一个不容忽视的制约因素.由于商用GPGPU容错能力较弱,所以由CPU和GPU构建的大规模异构并行系统的可靠性问题更为尖锐,尚缺乏实用的容错手段,针对这一现实问题提出了一种基于冗余线程的GPU多副本容错技术:RB-TMR(Rollback TMR),同时根据异构系统的编程模型及程序特征对这一面向异构系统的容错机制的设计实现及其编译框架进行了具体分析和描述,最后通过10个案例对此技术进行了实现并评估了其性能.这一技术为异构系统的容错技术研究提供了新的思路,具有重大意义. 相似文献

3.

基于CUDA的细粒度并行计算模型研究

肖汉肖波冯娜杨锦锦《计算机与数字工程》2013,41(5)

作为应用软件模型和计算机硬件之间的桥梁,编程模型在计算机领域的重要性不言而喻.但随着具备细粒度并行计算能力的图形处理器(GPU)进入主流市场,与之相适应的编程模型发展却相对滞后.Nvidia在GeForce 8系列显卡上推出的统一计算设备架构(CUDA)技术,使得通用计算图形处理单元(GPGPU)从图形硬件流水线和高级绘制语言中解放出来,开发人员无须掌握图形学编程方法即可在单任务多数据模式(SIMD)下完成高性能并行计算.论文从特性、组成和并行架构等几个方面对CUDA并行计算模型进行了研究,充分表明基于GPU进行高性能并行计算,是适应目前大规模计算需求的一个重要发展途径. 相似文献

4.

动态模式识别算法的GPU平台实现

林文愉王聪《计算技术与自动化》2013,(1):68-72

研究动态模式识别算法在GPU并行计算平台的实现。随着GPGPU(通用计算图形处理器)硬件的发展,基于GPU的大规模并行计算技术将有效地处理动态模式识别算法带来的海量计算问题。文中通过介绍动态模式识别算法,对算法中涉及的巨大计算量进行分析,并针对性地对其中密集计算部分进行并行化分解,移除原算法中在执行中存在的依赖关系,最终得到算法在特定的GPU平台———Jacket上的并行计算实现。实例验证表明,相比于原CPU串行程序,在GPU上运行的并行化程序能实现明显加速,因而具有很好的工程应用价值。相似文献

5.

基于GPU的立体图像对的快速校正

朱宾张丽艳韦虎《计算机与现代化》2010,(9):1-4

随着图形处理器（GPU）的发展,特别是其构架的改变,显卡已经不再是固定流水线的图形处理器,而是类似单指令多数据结构的并行可编程流处理器。近几年兴起了GPU通用计算（GPGPU）的研究热潮,并在许多领域中得到了应用。本文将GPGPU应用到立体图像对的校正过程中,有效解决了其中图像插值算法的计算效率较低的问题。实例验证了本文算法的有效性。相似文献

6.

基于GPU的加速网格求交算法分析与实现

杨娜杨钦曹龙《微计算机信息》2008,24(30)

随着近几年图形硬件的飞速发展,图形处理器(Graphics Processing Unit,简称GPU)的功能越来越强大.现代GPU具备了一定的可编程功能,此功能允许以用户自定义的功能替换原来固定图形流水线中某些模块原有的功能,这使得GPU在功能上更像一个通用处理(General Purpose GPU,GPGPU),针对地质建模软件中频繁使用的三角网格面求交算法进行了研究.通过对三角网格求交问题的具体分析及对图形硬件的分析,利用图形硬件的特殊设计和高浮点运算速度,高内存带宽,实现了高效的基于可视化查询方法计算网格求交的算法. 相似文献

7.

CUDA相邻归约与其避免线程分化算法的研究

卫易东《信息与电脑》2023,(18):55-57+61

在边缘计算环境下,上层应用调度图形处理器（Graphic Processing Unit,GPU）的统一计算架构（Compute Unified Device Architecture,CUDA）进行计算时,可能会遇到CUDA线程分化问题,导致运算耗时较长或线程空置化。本研究介绍了CUDA底层开发的基础原理和概念,并解释了CUDA运算的执行流程。通过分析GPU架构原理,提出了相邻归约算法和相邻归约的避免线程分化算法的实现方式和应用方法。相似文献

8.

GPU加速的神经网络BP算法*

田绪红江敏杰《计算机应用研究》2009,26(5):1679-1681

近年来图形处理器(GPU)快速拓展的可编程性能力加上渲染流水线的高速度及并行性,使得图形处理器通用计算(GPGPU)迅速成为一个研究热点。针对大规模神经网络BP算法效率低下问题,提出了一种GPU加速的神经网络BP算法。将BP网络的前向计算、反向学习转换为GPU纹理的渲染过程,从而利用GPU强大的浮点运算能力和高度并行的计算特性对BP算法进行求解。实验结果表明,在保证求解结果准确度不变的情况下,该方法运行效率有明显的提高。相似文献

9.

一种基于深度纹理的碰撞检测

黄鹏孟明《计算机应用与软件》2013,(1):270-272,293

提出一种基于GPU的图像空间碰撞检测方法。利用OpenGL的深度纹理技术和其扩展功能帧缓存对象(FBO)实现将深度值直接渲染至纹理,然后采用通用计算图形处理器(GPGPU)技术对纹理中的数据进行处理实现碰撞检测。仿真实验显示,该方法充分利用了图形硬件GPU的性能,并能够正确地返回碰撞检测结果。相似文献

10.

Matlab的图形处理器并行计算及其在拓扑优化中的应用

蔡勇李胜《计算机应用》2016,36(3):628-632

针对传统并行计算方法实现结构拓扑优化快速计算的硬件成本高、程序开发效率低的问题,提出了一种基于Matlab和图形处理器(GPU)的双向渐进结构优化(BESO)方法的全流程并行计算策略。首先,探讨了Matlab编程环境中实现GPU并行计算的三种途径的优缺点和适用范围;其次,分别采用内置函数直接并行的方式实现了拓扑优化算法中向量和稠密矩阵的并行化计算,采用MEX函数调用CUSOLVER库的形式实现了稀疏格式有限元方程组的快速求解,采用并行线程执行(PTX)代码的方式实现了拓扑优化中单元敏度分析等优化决策的并行化计算。数值算例表明,基于Matlab直接开发GPU并行计算程序不仅编程效率高,而且还可以避免不同编程语言间的计算精度差异,最终使GPU并行程序可以在保持计算结果不变的前提下取得可观的加速比。相似文献

11.

High‐speed parallel implementations of the rainbow method based on perfect tables in a heterogeneous system

Jung Woo Kim Jungjoo Seo Jin Hong Kunsoo Park Sung‐Ryul Kim 《Software》2015,45(6):837-855

The computing power of graphics processing units (GPU) has increased rapidly, and there has been extensive research on general‐purpose computing on GPU (GPGPU) for cryptographic algorithms such as RSA, Elliptic Curve Cryptosystem (ECC), NTRU, and Advanced Encryption Standard. With the rise of GPGPU, commodity computers have become complex heterogeneous GPU+CPU systems. This new architecture poses new challenges and opportunities in high‐performance computing. In this paper, we present high‐speed parallel implementations of the rainbow method based on perfect tables, which is known as the most efficient time‐memory trade‐off, in the heterogeneous GPU+CPU system. We give a complete analysis of the effect of multiple checkpoints on reducing the cost of false alarms and take advantage of it for load balancing between GPU and CPU. For GTX460, our implementation is about 1.86 and 3.25 times faster than other GPU‐accelerated implementations, RainbowCrack and Cryptohaze, respectively, and for GTX580, 1.53 and 2.40 times faster. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

12.

Decision tree construction on GPU: ubiquitous parallel computing approach

Aziz Nasridinov Yangsun Lee Young-Ho Park 《Computing》2014,96(5):403-413

General Purpose Graphic Processing Unit (GPGPU) computing with CUDA has been effectively used in scientific applications, where huge accelerations have been achieved. However, while today’s traditional GPGPU can reduce the execution time of parallel code by many times, it comes at the expense of significant power and energy consumption. In this paper, we propose ubiquitous parallel computing approach for construction of decision tree on GPU. In our approach, we exploit parallelism of well-known ID3 algorithm for decision tree learning by two levels: at the outer level of building the tree node-by-node, and at the inner level of sorting data records within a single node. Thus, our approach not only accelerates the construction of decision tree via GPU computing, but also does so by taking care of the power and energy consumption of the GPU. Experiment results show that our approach outperforms purely GPU-based implementation and CPU-based sequential implementation by several times. 相似文献

13.

A GPGPU based program to solve the TDSE in intense laser fields through the finite difference approach

Cathal Ó Broin L.A.A. Nikolopoulos 《Computer Physics Communications》2014

We present a General-purpose computing on graphics processing units (GPGPU) based computational program and framework for the electronic dynamics of atomic systems under intense laser fields. We present our results using the case of hydrogen, however the code is trivially extensible to tackle problems within the single-active electron (SAE) approximation. Building on our previous work, we introduce the first available GPGPU based implementation of the Taylor, Runge–Kutta and Lanczos based methods created with strong field ab-initio simulations specifically in mind; CLTDSE. The code makes use of finite difference methods and the OpenCL framework for GPU acceleration. The specific example system used is the classic test system; Hydrogen. After introducing the standard theory, and specific quantities which are calculated, the code, including installation and usage, is discussed in-depth. This is followed by some examples and a short benchmark between an 8 hardware thread (i.e. logical core) Intel Xeon CPU and an AMD 6970 GPU, where the parallel algorithm runs 10 times faster on the GPU than the CPU. 相似文献

14.

基于CPU／GPU异构平台并行优化的研究

杨芳菊《电脑编程技巧与维护》2012,(18):4-7,67

CPU／GPU异构系统具有很大的发展潜力,深入研究CPU／GPU异构平台的并行优化,可实现系统整体计算能力的最大化。通过对CPU／GPU任务划分的优化来平衡CPU和GPU的负载,可提高计算资源的利用率,缩短计算任务的执行时间;通过对GPU线程划分的优化,可使GPU获得更高的速度。从而提高系统整体性能。相似文献

15.

大规模稀疏矩阵的主特征向量计算优化方法 总被引：1，自引：0，他引：1

王伟陈建平曾国荪俞莉花谭一鸣《计算机科学与探索》2012,6(2):118-124

矩阵主特征向量(principal eigenvectors computing,PEC)的求解是科学与工程计算中的一个重要问题。随着图形处理单元通用计算(general-purpose computing on graphics pro cessing unit,GPGPU)的兴起,利用GPU来优化大规模稀疏矩阵的图形处理单元求解得到了广泛关注。分别从应用特征和GPU体系结构特征两方面分析了PEC运算的性能瓶颈,提出了一种面向GPU的稀疏矩阵存储格式——GPU-ELL和一个针对GPU的线程优化映射策略,并设计了相应的PEC优化执行算法。在ATI HD Radeon5850上的实验结果表明,相对于传统CPU,该方案获得了最多200倍左右的加速,相对于已有GPU上的实现,也获得了2倍的加速。相似文献

16.

Modeling and characterizing GPGPU reliability in the presence of soft errors

Jingweijia Tan Yang Yi Fangyang Shen Xin Fu 《Parallel Computing》2013

The general-purpose computing on graphic processing units (GPGPUs) becomes increasingly popular due to its high computational throughput for data parallel applications. Modern GPU architectures have limited capability for error detection and fault tolerance since they are originally designed for graphics processing. However, the rigorous execution correctness is required for general-purpose applications, which makes reliability a growing concern in the GPGPU architecture design. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper explores a first step to model and characterize GPGPU reliability in light of soft errors. We develop GPGPU-SODA (GPGPU SOftware Dependability Analysis), a framework to estimate the soft-error vulnerability of GPGPU microarchitecture. By using GPGPU-SODA, we observe that several microarchitecture structures in GPGPUs exhibit high soft-error susceptibility, and the structure vulnerability is sensitive to the workload characteristics (e.g. branch divergences, memory access pattern). We further investigate the impact of several architectural optimizations on GPU soft-error robustness. For example, we find that increasing the number of threads supported by GPU significantly affects the GPGPU soft-error robustness. However, changing the warp scheduling policy has little impact on the structure vulnerability. The observations made in this study provide designers the useful guidance to build resilient GPGPUs: a comprehensive resiliency solution for GPGPUs should consider the entire GPGPU design instead of solely focusing on a particular structure. 相似文献

17.

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献

18.

Computing prestack Kirchhoff time migration on general purpose GPU

Xiaohua Shi Chuang Li Shihu Wang Xu Wang 《Computers & Geosciences》2011,37(10):1702-1710

This paper introduces how to optimize a practical prestack Kirchhoff time migration program by the Compute Unified Device Architecture (CUDA) on a general purpose GPU (GPGPU). A few useful optimization methods on GPGPU are demonstrated, such as how to increase the kernel thread numbers on GPU cores, and how to utilize the memory streams to overlap GPU kernel execution time, etc. The floating-point errors on CUDA and NVidia's GPUs are discussed in detail. Some effective methods that can be used to reduce the floating-point errors are introduced. The images generated by the practical prestack Kirchhoff time migration programs for the same real-world seismic data inputs on CPU and GPU are demonstrated. The final GPGPU approach on NVidia GTX 260 is more than 17 times faster than its original CPU version on Intel's P4 3.0G. 相似文献

19.

GPGPU编程技术初探

林茂董玉敏邹杰杨敏张晋楠《电脑编程技巧与维护》2010,(2):15-17,23

伴随着GPGPU计算技术的不断发展,HPC高性能计算系统体系结构正在悄然发生着一场变革,这场变革为高性能计算发展提供了一个新的方向、CUDA是NIVIDIA公司提供的利用GPGPU进行并行运算应用开发的一套C语言编程平台,通过它可以利用特定显卡的高性能运算能力进行一些大规模高性能计算,有效提升计算机系统的使用效率,本文主要介绍GPU发展现状以及如何利用CUDA编程技术进行并行运算软件开发．相似文献

20.

Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling

《Journal of Systems Architecture》2014,60(5):420-430

General-purpose graphics processing unit (GPGPU) plays an important role in massive parallel computing nowadays. A GPGPU core typically holds thousands of threads, where hardware threads are organized into warps. With the single instruction multiple thread (SIMT) pipeline, GPGPU can achieve high performance. But threads taking different branches in the same warp violate SIMD style and cause branch divergence. To support this, a hardware stack is used to sequentially execute all branches. Hence branch divergence leads to performance degradation. This article represents the PDOM (post dominator) stack as a binary tree, and each leaf corresponds to a branch target. We propose a new PDOM stack called PDOM-ASI, which can schedule all the tree leaves. The new stack can hide more long operation latencies with more schedulable warps without the problem of warp over-subdivision. Besides, a multi-level warp scheduling policy is proposed, which lets part of the warps run ahead and creates more opportunities to hide the latencies. The simulation results show that our policies achieve 10.5% performance improvements over baseline policies with only 1.33% hardware area overhead. 相似文献