期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

张延松刘专韩瑞琛张宇王珊《软件学报》2023,34(11):5205-5229

GPU数据库近年来在学术界和工业界吸引了大量的关注. 尽管一些原型系统和商业系统(包括开源系统)开发了作为下一代的数据库系统, 但基于GPU的OLAP引擎性能是否真的超过CPU系统仍然存有疑问, 如果能够超越, 那什么样的负载/数据/查询处理模型更加适合, 则需要更深入的研究. 基于GPU的OLAP引擎有两个主要的技术路线: GPU内存处理模式和GPU加速模式. 前者将所有的数据集存储在GPU显存来充分利用GPU的计算性能和高带宽内存性能, 不足之处在于GPU容量有限的显存制约了数据集大小以及稀疏访问模式的数据存储降低GPU显存的存储效率. 后者只在GPU显存中存储部分数据集并通过GPU加速计算密集型负载来支持大数据集, 主要的挑战在于如何为GPU显存选择优化的数据分布和负载分布模型来最小化PCIe传输代价和最大化GPU计算效率. 致力于将两种技术路线集成到OLAP加速引擎中, 研究一个定制化的混合CPU-GPU平台上的OLAP框架OLAP Accelerator, 设计CPU内存计算、GPU内存计算和GPU加速3种OLAP计算模型, 实现GPU平台向量化查询处理技术, 优化显存利用率和查询性能, 探索GPU数据库的不同的技术路线和性能特征. 实验结果显示GPU内存向量化查询处理模型在性能和内存利用率两方面获得最佳性能, 与OmniSciDB和Hyper数据库相比性能达到3.1和4.2倍加速. 基于分区的GPU加速模式仅加速了连接负载来平衡CPU和GPU端的负载, 能够比GPU内存模式支持更大的数据集. 相似文献

2.

面向OpenCL模型的GPU性能优化 总被引：1，自引：0，他引：1

陈钢吴百锋《计算机辅助设计与图形学学报》2011,23(4):571-581

GPU的高性价比吸引了越来越多的通用计算.为充分发挥异构处理平台下GPU的通用计算能力,提出面向OpenCL模型的性能优化方法.该方法建立源程序的多面体表示,分别对GPU的全局存储器和快速存储器进行优化与分配;通过检测存储访问模式发掘可向量化的存储访问实例,利用数据空间变换对存储访问模式进行转换,进而使用向量数据类型提... 相似文献

3.

基于GPU的稀疏矩阵存储格式优化研究

杨世伟蒋国平宋玉蓉涂潇《计算机工程》2019,45(9)

稀疏矩阵存储格式中的稀疏矩阵向量乘(SpMV)计算效率低下,且分块行列(BRC)存储格式的计算结果缺少再现性和确定性。为此,提出一种改进的BRCP存储格式。采用不同的二维分块策略,根据矩阵各行非零元素分布的统计特性自适应调节分块参数,提高SpMV在GPU平台上的并行性,并设计基于快速分段求和算法的GPU内核函数,保证计算结果的确定性及其在不同GPU平台上的再现性。实验结果表明,BRCP存储格式具有较高的计算效率,相比BRC存储格式可减少并行环境中的SpMV计算误差,并提高PageRank排序的准确率。相似文献

4.

基于HYB格式稀疏矩阵与向量乘在CPU+GPU异构系统中的实现与优化

阳王东李肯立《计算机工程与科学》2016,38(2):202-209

稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。相似文献

5.

基于GPU的遥感图像配准并行程序设计与存储优化

周海芳赵进《计算机研究与发展》2012,(Z1):281-286

遥感图像配准是遥感图像应用的一个重要处理步骤.随着遥感图像数据规模与遥感图像配准算法计算复杂度的增大,遥感图像配准面临着处理速度的挑战.最近几年,GPU计算能力得到极大提升,面向通用计算领域得到了快速发展.结合GPU面向通用计算领域的优势与遥感图像配准面临的处理速度问题,研究了GPU加速处理遥感图像配准的算法.选取计算量大计算精度高的基于互信息小波分解配准算法进行GPU并行设计,提出了GPU并行设计模型;同时选取GPU程序常用面向存储级的优化策略应用于遥感图像配准GPU程序,并利用CUDA(compute unified device architecture)编程语言在nVIDIA Tesla M2050GPU上进行了实验.实验结果表明,提出的并行设计模型与面向存储级的优化策略能够很好地适用于遥感图像配准领域,最大加速比达到了19.9倍.研究表明GPU通用计算技术在遥感图像处理领域具有广阔的应用前景. 相似文献

6.

基于GPU的图像特征并行计算方法

张杰柴志雷喻津《计算机科学》2015,42(10):297-300, 324

特征提取与描述是众多计算机视觉应用的基础。局部特征提取与描述因像素级处理产生的高维计算而导致其计算复杂、实时性差,影响了算法在实际系统中的应用。研究了局部特征提取与描述中的关键共性计算模块——图像金字塔机制及图像梯度计算。基于NVIDIA GPU/CUDA架构设计并实现了共性模块的并行计算,并通过优化全局存储、纹理存储及共享存储的访问方式进一步实现了其高效计算。实验结果表明,基于GPU的图像金字塔和图像梯度计算比CPU获得了30倍左右的加速,将实现的图像金字塔和图像梯度计算应用于HOG特征提取与描述算法,相比CPU获得了40倍左右的加速。该研究对于基于GPU实现局部特征的高速提取与描述具有现实意义。相似文献

7.

融合遗传和蚁群算法并行求解最短公共超串

伍世刚钟诚《计算机应用》2014,34(7):1857-1861

依据各级缓存容量,将CPU主存中种群个体和蚂蚁个体数据划分存储到一级、二级和三级缓存中,以减少并行计算过程中数据在各级存储之间的传输开销,在CPU与GPU之间采取异步传送和不完全传送数据、GPU多个内核函数异步执行多个流的方法,设置GPU block线程数量为16的倍数、GPU共享存储器划分大小为32倍的bank,使用GPU常量存储器存储交叉概率、变异概率等需频繁访问的只读参数,将输入串矩阵和重叠部分长度矩阵只读大数据结构绑定到GPU纹理存储器,设计实现了一种多核CPU和GPU协同求解最短公共超串问题的计算、存储和通信高效的并行算法。求解多种规模的最短公共超串问题的实验结果表明,多核CPU与GPU协同并行算法比串行算法快70倍以上。相似文献

8.

基于CPU/GPU异构模式的高光谱遥感影像数据处理研究与实现

汤媛媛周海芳方民权申小龙《计算机科学》2016,43(2):47-50, 77

近年来,基于GPU的新型异构高性能计算模式的蓬勃发展为众多领域应用提供了良好的发展机遇,国内外遥感专家开始引入高性能异构计算来解决高光谱遥感影像高维空间特点所带来的数据计算量大、实时处理难等问题。在此简要介绍了高光谱遥感和CPU/GPU异构计算模式,总结了近几年国内外基于CPU/GPU异构模式的高光谱遥感数据处理研究现状和问题;并面向共享存储型小型桌面超级计算机,基于CPU/GPU异构模式实现了高光谱遥感影像MNF降维的并行化,通过与串行程序和共享存储的OpenMP同构模式对比,验证了异构模式在高光谱遥感处理领域的发展潜力。相似文献

9.

一种基于GPU硬件加速计算的辐射度实现方法 总被引：2，自引：0，他引：2

胡伟秦开怀《计算机研究与发展》2005,42(6):945-950

提出一种新的基于GPU(graphics processing unit)的辐射度方法．该方法利用可编程图形处理单元GPU的并行计算能力,将辐射度方法中形状因子计算以及线性方程组求解的全过程完全在可编程图形硬件中完成,避免了原有基于GPU的辐射度方法需要CPU参与的问题,绕开了计算机主内存与GPU纹理内存之间数据交换的瓶颈;在基于半立方体法的形状因子计算和绘制过程中,解决了基于GPU硬件加速的遍历、分类和累加问题．此外,该方法采用新的矩阵和向量在GPU中的存储方法,利用GPU实现Jacobi迭代法快速求解线性方程组．实验结果证明。该方法能够快速有效地实现辐射度的计算和绘制．相似文献

10.

基于CUDA的位并行近似串匹配算法

下载免费PDF全文

崔文科徐克付李娜娜胡玥《计算机工程》2012,38(22):267-270

为满足文本检索、计算生物学等领域海量数据匹配对高性能计算的要求,提出一种基于计算统一设备架构(CUDA)的位并行近似串匹配算法。结合图形处理器(GPU)的高并行计算结构及存储带宽特性,通过优化数据存储方式,实现并行化动态规划矩阵算法(BPM)的加速,并对加速性能进行对比测试。实验结果表明,BPM算法通过GPU加速能获得20倍左右的加速比。相似文献

11.

Memory bandwidth optimization of SpMV on GPGPUs

Chenggang Clarence YAN Hui YU Weizhi XU Yingping ZHANG Bochuan CHEN Zhu TIAN Yuxuan WANG Jian YIN 《Frontiers of Computer Science》2015,9(3):431

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that cannot be fully exploited by SpMV due to its irregularity. In this paper, we propose two novel methods to optimize the memory bandwidth for SpMV on GPGPU. First, a new storage format is proposed to exploit memory bandwidth of GPU architecture more efficiently. The new storage format can ensure that there are as many non-zeros as possible in the format which is suitable to exploit the memory bandwidth of the GPU. Second, we propose a cache blocking method to improve the performance of SpMV on GPU architecture. The sparse matrix is partitioned into sub-blocks that are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time to access the global memory for vector x is reduced heavily. Experiments are carried out on three GPU platforms, GeForce 9800 GX2, GeForce GTX 480, and Tesla K40. Experimental results show that both new methods can efficiently improve the utilization of GPU memory bandwidth and the performance of the GPU. 相似文献

12.

图形处理器流水线数据压缩技术研究综述

韩立敏田泽张骏郑新建任向隆《计算机应用研究》2018,35(3)

提高功耗效率是高端GPU的关键设计目标之一,在3D图形渲染流水线的多个阶段使用数据压缩技术能够显著减少GPU片外存储器的访问量,从而达到提高图形绘制性能和降低功耗的效果。为了对图形处理器流水线数据压缩技术的应用现状进行总结和分析,立足于GPU图形渲染流水线和存储系统的结构特征,归纳了各种缓冲区对象、纹理数据专用压缩算法的关键特性;分析了图形流水线数据压缩技术的研究现状、不足与挑战;并基于应用需求指明GPU流水线数据压缩技术进一步的研究内容。相似文献

13.

Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC-based GPU Virtualization

JiHun Kang JongBeom Lim HeonChang Yu 《Software》2020,50(6):948-972

Graphics processing unit (GPU) virtualization technology enables a single GPU to be shared among multiple virtual machines (VMs), thereby allowing multiple VMs to perform GPU operations simultaneously with a single GPU. Because GPUs exhibit lower resource scalability than central processing units (CPUs), memory, and storage, many VMs encounter resource shortages while running GPU operations concurrently, implying that the VM performing the GPU operation must wait to use the GPU. In this paper, we propose a partial migration technique for general-purpose graphics processing unit (GPGPU) tasks to prevent the GPU resource shortage in a remote procedure call-based GPU virtualization environment. The proposed method allows a GPGPU task to be migrated to another physical server's GPU based on the available resources of the target's GPU device, thereby reducing the wait time of the VM to use the GPU. With this approach, we prevent resource shortages and minimize performance degradation for GPGPU operations running on multiple VMs. Our proposed method can prevent GPU memory shortage, improve GPGPU task performance by up to 14%, and improve GPU computational performance by up to 82%. In addition, experiments show that the migration of GPGPU tasks minimizes the impact on other VMs. 相似文献

14.

基于图形处理器的Cuboid算法

周国亮冯海军何国明陈红李翠平王珊《计算机研究与发展》2009,46(Z2)

近年来,基于图形处理器的通用计算获得了广泛关注,并在多个领域取得了进展.内存OLAP减少了磁盘I/O,但基于单核或多核CPU的计算能力及cache miss成为新的性能瓶颈,从而无法保证好的效率.而图形处理器由于其众多核和高带宽能够很好地适应OLAP计算特性.通过图形处理器来加速任一cuboid的计算,从而提高整个内存OLAP系统的性能.提出了基于图形处理器的分块并行算法,并对算法进行了优化及讨论了数据稀疏和数据分布倾斜等不同条件下的算法.算法通过扩展可以突破内存限制,组成磁盘、内存、显存三级流水线,适应海量数据计算;同时算法也可以作为计算整个cube的基础.通过实验比较,基于图形处理器的算法明显优于四核CPU算法. 相似文献

15.

Low occupancy high performance elemental products in assembly free FEM on GPU

Pikle Nileshchandra K. Sathe Shailesh R. Vyavahare Arvind Y. 《Engineering with Computers》2021,38(3):2189-2204

Assembly free FEM bypasses the assembly step and solves the system of linear equations at the element level using Conjugate Gradient (CG) type iterative solver. The smaller dense Matrix-vector Products (MvPs) are encapsulated within the CG solver and are computed either at element level or degree of freedom (DoF) level. Both these strategies exploit the computing power of GPU effectively, but the performance is lagging due to the uncoalesced global memory access on GPU. This paper proposes an improved MvP strategy in assembly free FEM, which improves the performance by coalesced global memory access using on-chip faster shared memory and using the texture cache memory on GPU. Since GPU has limited shared memory (in few KBs), the proposed technique suffers from a problem known as low occupancy. Despite the low occupancy issue, the proposed strategy outperforms both element based and DoF based MvP strategies on GPU. Numerical experiments compared with element level and DoF level strategies on GPU and found that, GPU instance of proposed MvP outperforms both strategies approximately by factor of 7 and 1.5 respectively.

相似文献

16.

Local Painting and Deformation of Meshes on the GPU

H. Schäfer B. Keinert M. Nießner M. Stamminger 《Computer Graphics Forum》2015,34(1):26-35

We present a novel method to adaptively apply modifications to scene data stored in GPU memory. Such modifications may include interactive painting and sculpting operations in an authoring tool, or deformations resulting from collisions between scene objects detected by a physics engine. We only allocate GPU memory for the faces affected by these modifications to store fine‐scale colour or displacement values. This requires dynamic GPU memory management in order to assign and adaptively apply edits to individual faces at runtime. We present such a memory management technique based on a scan‐operation that is efficiently parallelizable. Since our approach runs entirely on the GPU, we avoid costly CPU–GPU memory transfer and eliminate typical bandwidth limitations. This minimizes runtime overhead to under a millisecond and makes our method ideally suited to many real‐time applications such as video games and interactive authoring tools. In addition, our algorithm significantly reduces storage requirements and allows for much higher resolution content compared to traditional global texturing approaches. Our technique can be applied to various mesh representations, including Catmull–Clark subdivision surfaces, as well as standard triangle and quad meshes. In this paper, we demonstrate several scenarios for these mesh types where our algorithm enables adaptive mesh refinement, local surface deformations and interactive on‐mesh painting and sculpting. 相似文献

17.

基于CUDA的并行加速渲染算法 总被引：1，自引：1，他引：0

下载免费PDF全文

刘镇郝冬宁梅向东《中国图象图形学报》2013,18(11):1457-1461

GPU可以快速有效的处理海量数据,因此在近些年成为图形图像数据处理领域的研究热点。针对现有GPU渲染中在处理含有大量相同或相似模型场景时存在资源利用率低下和带宽消耗过大的问题,在原有GPU渲染架构的基础上提出了一种基于CUDA的加速渲染方法。在该方法中,根据现有的GPU渲染模式构建对应的模型,通过模型找出其不足,从而引申出常量内存的概念;然后分析常量内存的特性以及对渲染产生的作用,从而引入基于常量内存控制的方法来实现渲染的加速,整个渲染过程可以通过渲染算法进行控制。实验结果表明,该方法对解决上述问题具有较好的效果,最终实现加速渲染。相似文献

18.

GPU-accelerated string matching for database applications

Evangelia A. Sitaridi Kenneth A. Ross 《The VLDB Journal The International Journal on Very Large Data Bases》2016,25(5):719-740

Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations. 相似文献

19.

量子线路模拟器QuEST在多GPU平台上的性能优化

张亮常旭秦志楷沈立《计算机工程与科学》2021,43(1):17-23

在当前量子计算的研究中,量子线路模拟器作为重要的研究工具,一直受到研究者们的高度重视.QuEST是一款开源的通用量子线路模拟器,能在单个CPU结点、多个CPU结点和单个GPU等多种测试平台上灵活运行.量子线路模拟固有的并行性使其非常适合在GPU上运行,并能获得较大的性能加速.但是其缺点在于所消耗的内存空间巨大,单个GP... 相似文献

20.

Accelerating computation of Euclidean distance map using the GPU with efficient memory access

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(5):383-406

Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation. 相似文献