首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 343 毫秒
1.
唐斌  龙文 《液晶与显示》2016,31(7):714-720
本文提出一种基于GPU+CPU的快速实现Canny算子的方法。首先将算子分为串行和并行两部分,高斯滤波、梯度幅值和方向计算、非极大值抑制和双阈值处理在GPU中完成,将二维高斯滤波分解为水平方向上和垂直方向上的两次一维滤波从而降低计算的复杂度;然后使用CUDA编程完成多线程并行计算以加快计算速度;最后使用共享存储器隐藏线程访问全局存储的延迟;在CPU中则使用队列FIFO完成边缘连接。仿真测试结果表明:对分辨率为1024×1024的8位图像的处理时间为122 ms,相对应单独使用CPU而言,加速比最高可达5.39倍,因此本文方法充分利用了GPU的并行性的特征和CPU的串行处理能力。  相似文献   

2.
星图配准是星图处理应用中的一个重要步骤,因此星图配准的速度直接影响了星图处理的整体速度.近几年来,图形处理器(GPU)在通用计算领域得到快速的发展.结合GPU在通用计算领域的优势与星图配准面临的处理速度的问题,研究了基于GPU加速处理星图配准的算法.在已有配准算法的基础上,根据算法特点提出了相应的GPU并行设计模型,利用CUDA编程语言进行仿真实验.实验结果表明:相较于传统基于CPU的配准算法,基于GPU的并行设计模型同样达到了配准要求,且配准速度的加速比达到29.043倍.  相似文献   

3.
翟锐  吕科  代双凤  潘卫国 《电子学报》2016,44(12):2894-2899
随着遥感技术的发展,地形数据规模越来越大,远远超过了内存处理的范围,成为急需解决的问题.通过数据压缩提高系统吞吐量是常用技术之一,随着GPU技术的快速发展,传统的压缩算法无法充分利用GPU的能力.鉴于此,本文提出了一种基于GPU的地形数据压缩方法,实现了高度域和位置信息的压缩.不同于其他的算法仅对高度或位置进行压缩,本文的主要贡献在于将地形的位置和高度同时进行处理,当前顶点的所有信息都可以根据当前分段计算得到.算法对地形的高度域进行贝塞尔曲线的近似,保存每个顶点的差值,实现有损和无损的相结合的高比率的压缩.通过与传统方法的比较,实验结果表明,能够取得很好的压缩效果.  相似文献   

4.
本文提出的基于GPU的三维纹理映射算法通过编写顶点程序和片段程序,将传统的基于纹理面片的体绘制算法在GPU中实现.首先将体数据映射为三维纹理并将其载入到显存,接着通过对顶点着色程序和像素着色程序的编写将光线进入点、离开点的计算以及图像的合成运算移入GPU中,最后根据不同的采样点颜色混合公式实现不同的绘制效果.与传统的三...  相似文献   

5.
针对基于群智能优化的点云配准算法计算时间长的问题,提出一种基于CUDA的并行粒子群配准算法.以点对点距离最短为适应度函数,利用粒子群算法各粒子天然的并行性,将运算过程分配到GPU的各个线程中计算变换参数.由于GPU多个线程运算同时执行互不干扰,极大地提高了粒子群的运算速度,从而可以实现点云的快速、精确配准.实验结果表明,该算法既克服了ICP算法对点云初始位置要求高的缺点,又有效解决了基于群智能优化的点云配准算法计算时间长的问题.  相似文献   

6.
韩秉君  黄诗铭  杜滢 《电信科学》2015,31(10):82-88
提出了一种在 Kepler 架构 GPU(graphics processing unit,图形处理器)上利用 CUDA(compute unified device architecture,统一计算设备架构)技术加速通信仿真中DFT(discrete Fourier transform,离散傅里叶变换)处理过程的方法。该方法的核心思想是利用线程级并行技术实现单条收发链路内部DFT运算的并行加速,并利用动态并行和Hyper-Q技术实现不同收发用户对之间链路处理过程的并行加速,从而最终达到加速仿真中DFT处理过程的目的。实验结果表明,相对单核单线程CPU程序和上一代Fermi架构GPU程序,该方法分别能够将DFT处理速度提升300倍和3倍,具有较好的加速效果。  相似文献   

7.
针对嵌入式应用中三维图形渲染的要求,设计了一款可编程的多线程顶点处理器.该顶点处理器采用单指令多数据结构,一条指令能够同时处理4个单精度浮点数,并采用多线程技术,支持4个线程并发执行,能够有效地减少发生数据写读冲突时的停顿周期数,提高了处理效率.相对于单线程结构,4线程顶点处理器在较小的硬件开销下,可以实现2.1~2.8倍的性能提升.该顶点处理器支持OpenGL ES 1.1和Vertex Shader Model 1.1,在90nm CMOS工艺库下可实现频率为200MHz,性能为50Mvertices/s.  相似文献   

8.
针对多模式合成孔径雷达(SAR)成像处理中存在的计算效率不足问题,提出了一种基于GPU的多模式SAR统一成像并行加速方法。为充分利用GPU的显存资源,提高算法的运算效率,利用共享内存对矩阵转置、矩阵相乘等部分进行大规模数据并行计算。实验结果表明,该算法大幅度提升了多模式SAR成像的计算效率,最高加速比达到55.62,解决了GPU显存空间利用率较低的问题。  相似文献   

9.
采用GPU加速的三维实体模型绘制   总被引:1,自引:1,他引:0       下载免费PDF全文
袁友伟 《电子学报》2008,36(Z1):144-146
 利用GPU的强大浮点数计算能力和并行处理能力,提出一种完全基于GPU的具有真实感三维实体模型快速绘制方法.本文利用现代图形加速卡中GPU的可编程管线,实现了快速的网格生成及简化.在保证不改变网格的拓扑结构的前提下,调整网格,使能量方程的数值尽量降低,从而大大降低线性曲面中三角形的数量.实验结果表明,该方法能够实现实时的三维实体模型快速绘制,具有重要的应用价值.  相似文献   

10.
商凯  胡艳 《电子技术》2011,38(5):9-11
近几年图形处理器GPU的通用计算能力发展迅速,现在已经发展成为具有巨大并行运算能力的多核处理器,而CUDA架构的推出突破了传统GPU开发方式的束缚,把GPU巨大的通用计算能力解放了出来.本文利用GPU来加速AES算法,即利用GPU作为CPU的协处理器,将AES算法在GPU上实现,以提高计算的吞吐量.最后在GPU和CPU...  相似文献   

11.
This paper proposes low power VLSI architecture for motion tracking that can be used in online video applications such as in MPEG and VRML. The proposed architecture uses a hierarchical adaptive structured mesh (HASM) concept that generates a content-based video representation. The developed architecture shows the significant reducing of power consumption that is inherited in the HASM concept. The proposed architecture consists of two units: a motion estimation and motion compensation units.The motion estimation (ME) architecture generates a progressive mesh code that represents a mesh topology and its motion vectors. ME reduces the power consumption since it (1) implements a successive splitting strategy to generate the mesh topology. The successive split allows the pipelined implementation of the processing elements. (2) It approximates the mesh nodes motion vector by using the three step search algorithm. (3) and it uses parallel units that reduce the power consumption at a fixed throughput.The motion compensation (MC) architecture processes a reference frame, mesh nodes and motion vectors to predict a video frame using affine transformation to warp the texture with different mesh patches. The MC reduces the power consumption since it uses (1) a multiplication-free algorithm for affine transformation. (2) It uses parallel threads in which each thread implements a pipelined chain of scalable affine units to compute the affine transformation of each patch.The architecture has been prototyped using top-down low-power design methodology. The performance of the architecture has been analyzed in terms of video construction quality, power and delay.  相似文献   

12.
为对CUDA并行程序内核性能进行分析和预测,从而指导并行程序设计及性能优化,提出一种性能预测框架.1)从GPU编程模型和设备架构细节入手,以线程束为研究单位,通过整合与GPU程序用时密切相关的软硬件基本特征,定义了并行空间闲置度、流处理器线程束负载、并行效应因子等高层次性能相关特征.2)基于上述特征,框架针对线程负载均衡型GPU程序,评估内核函数在不同问题规模以及执行配置下的执行时间.3)依据性能评估原理提出了内核函数执行配置参数的优化策略.验证实验结果表明,该框架在两种典型情境下对现有程序性能的平均预测准确率分别达到89%和94%,客观归纳了高层次特征与程序性能间的相关关系,且能定性分析并行算法性能水平.  相似文献   

13.
Automatic image processing methods are a prerequisite to efficiently analyze the large amount of image data produced by computed tomography (CT) scanners during cardiac exams. This paper introduces a model-based approach for the fully automatic segmentation of the whole heart (four chambers, myocardium, and great vessels) from 3-D CT images. Model adaptation is done by progressively increasing the degrees-of-freedom of the allowed deformations. This improves convergence as well as segmentation accuracy. The heart is first localized in the image using a 3-D implementation of the generalized Hough transform. Pose misalignment is corrected by matching the model to the image making use of a global similarity transformation. The complex initialization of the multicompartment mesh is then addressed by assigning an affine transformation to each anatomical region of the model. Finally, a deformable adaptation is performed to accurately match the boundaries of the patient's anatomy. A mean surface-to-surface error of 0.82 mm was measured in a leave-one-out quantitative validation carried out on 28 images. Moreover, the piecewise affine transformation introduced for mesh initialization and adaptation shows better interphase and interpatient shape variability characterization than commonly used principal component analysis.   相似文献   

14.
Single GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. However, much of the current code is written for a single GPU system. Porting such a code for execution on multiple GPUs is difficulty task. In particular, it requires programmer effort to determine how data is partitioned across multiple GPU cards and then launch the appropriate thread blocks that mostly accesses the data that is local to that card. Otherwise, cross-card data movement is an expensive operation. In this work we explore hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs. In particular, our approach focuses on minimizing the number of remote memory accesses across the off-chip network without burdening the programmer to perform data partitioning and workload assignment. We propose a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data. The scheduler exploits well known observation that GPU workloads tend to launch a kernel multiple times iteratively to process large volumes of data. The memory accesses of the thread block across different iterations of a kernel launch exhibit correlated behavior. Our data location aware scheduler exploits this predictability to track memory access affinity of each thread block to a specific GPU card and stores this information to make scheduling decisions for future iterations. To further reduce the number of remote accesses we propose a hybrid mechanism that enables migrating or copying the pages between the memory of multiple GPUs based on their access behavior. Hence, most of the memory accesses are to the local GPU memory. Over an architecture consisting of two GPUs, our proposed schemes are able to improve the performance by 1.55× when compared to single GPU execution across widely used Rodinia [17], Parboil [18], and Graph [23] benchmarks.  相似文献   

15.
GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.  相似文献   

16.
GPU计算液晶自适应光学波前重构的并行性研究   总被引:1,自引:2,他引:1  
研究了图形处理器(GPU)计算液晶自适应波前重构的并行性。介绍了液晶自适应光学的Zernike模式波前重构算法,论述了GPU的通用架构和GPU实现波前重构的方法。在此基础上提出了利用GPU拥有的RGBA4个颜色通道进行并行计算,进一步加快计算速度,最后给出了实验结果。结果表明:在GPU计算波前重构时,利用RGBA颜色通道的并行计算,将计算速度提高了3倍多。  相似文献   

17.
Gibeom Gu  Duksu Kim 《ETRI Journal》2020,42(4):608-618
We present a novel GPU‐based ray‐casting algorithm for volume rendering of unstructured grid data. Our volume rendering system uses a ray‐casting method that guarantees accurate rendering results. We also employ the per‐pixel intersection list concept in the Bunyk algorithm to guarantee an accurate result for non‐convex meshes. For efficient memory access for the lists on the GPU, we represent the intersection lists for all faces as an array with our novel construction algorithm. With the intersection lists, we perform ray‐casting on a GPU, and a GPU thread handles each ray. To increase ray‐coherency in a thread block and improve memory access efficiency, we extend a prior image‐tile‐based work distribution method to fit modern GPU architectures. We also show that a prior approach using a per‐thread local buffer to reduce redundant computation is not appropriate for modern GPU architectures. Instead, we take an on‐demand calculation strategy that achieves better performance even though it allows duplicate computations. We applied our method to three unstructured grid datasets with different characteristics. With a GPU, our method achieved up to 36.5 times higher performance for the ray‐casting process and 19.7 times higher performance for the whole volume rendering process compared with the Bunyk algorithm using a CPU core. Also, our approach showed up to 8.2 times higher performance than a GPU‐based cell projection method while generating more accurate rendering results. These results demonstrate the efficiency and accuracy of our method.  相似文献   

18.
张聪  邢同举  罗颖  张静  孙强 《电子设计工程》2011,19(19):141-143,146
数学形态学运算是一种高度并行的运算,其计算量大而又如此广泛地应用于对实时性要求较高的诸多重要领域。为了提高数学形态学运算的速度,提出了一种基于CUDA架构的GPU并行数学形态学运算。文章详细描述了GPU硬件架构和CUDA编程模型,并给出了GPU腐蚀并行运算的详细实现过程以及编程过程中为充分利用GPU资源所需要注意的具体问题。实验结果表明,GPU并行数学形态学运算速度可达到几个数量级的提高。  相似文献   

19.
为解决软件化雷达系统实时处理大规模数据的问题,提出了一种分层级的分布式并行计算方法,并设计了一种低延时大规模数据处理能力的软件化雷达系统.该系统采用三层并行计算方法,利用ZeroMQ技术实现了任务级的分布式计算,多线程技术实现了线程级的多核并行计算,Arrayfire平台实现了数据级的图形处理单元(Graphic Pr...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号