期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FPGA and GPU-based acceleration of ML workloads on Amazon cloud - A case study using gradient boosted decision tree library

《Integration, the VLSI Journal》2020

Cloud vendors such as Amazon (AWS) have started to offer FPGAs in addition to GPUs and CPU in their computing on-demand services. In this work we explore design space trade-offs of implementing a state-of-the-art machine learning library for Gradient-boosted decision trees (GBDT) on Amazon cloud and compare the scalability, performance, cost and accuracy with best known CPU and GPU implementations from literature. Our evaluation indicates that depending on the dataset, an FPGA-based implementation of the bottleneck computation kernels yields a speed-up anywhere from 3X to 10X over a GPU and 5X to 33X over a CPU. We show that smaller bin size results in better performance on a FPGA, but even with a bin size of 16 and a fixed point implementation the degradation in terms of accuracy on a FPGA is relatively small, around 1.3%–3.3% compared to a floating point implementation with 256 bins on a CPU or GPU. 相似文献

2.

Distributed topology control in large‐scale hybrid RF/FSO networks: SIMT GPU‐based particle swarm optimization approach

Osama Awwad Ala Al‐Fuqaha Ghassen Ben Brahim Bilal Khan Ammar Rayes 《International Journal of Communication Systems》2013,26(7):888-911

The tremendous power of graphics processing unit (GPU) computing relative to prior CPU‐only architectures presents new opportunities for efficient solutions of previously intractable large‐scale optimization problems. Although most previous work in this field focused on scientific applications in the areas of medicine and physics, here we present a Compute Unified Device Architecture‐based (CUDA) GPU solution to solve the topology control problem in hybrid radio frequency and free space optics wireless mesh networks by adapting and adjusting the transmission power and the beam‐width of individual nodes according to QoS requirements. Our approach is based on a stochastic global optimization technique inspired by the social behavior of flocking birds — so‐called ‘particle swarm optimization’ — and was implemented on the NVIDIA GeForce GTX 285 GPU. The implementation achieved a performance speedup factor of 392 over a CPU‐only implementation. Several innovations in the memory/execution structure in our approach enabled us to surpass all prior known particle swarm optimization GPU implementations. Our results provide a promising indication of the viability of GPU‐based approaches towards the solution of large‐scale optimization problems such as those found in radio frequency and free space optics wireless mesh network design. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

3.

基于GPU的并行FDTD方法在二维粗糙面散射中的应用

下载免费PDF全文

贾春刚郭立新刘伟《电波科学学报》2016,31(4):683-687

利用显卡(Graphics Processing Unit, GPU)加速时域有限差分(Finite-Difference Time Domain, FDTD)法计算二维粗糙面的双站散射系数, 介绍了FDTD的理论公式以及计算模型.采用各向异性完全匹配层(Uniaxial Perfectly Matched Layer, UPML)截断FDTD计算区域.重点讨论了基于GPU的并行FDTD计算粗糙面双站散射系数的并行设计方案计算流程.在NVIDIA GeForce GTX 570显卡上获得了50.7×的加速比.结果表明:通过对FDTD计算粗糙面散射问题的加速, 极大地提高了计算效率. 相似文献

4.

GPU Computing 总被引：9，自引：0，他引：9

Owens J.D. Houston M. Luebke D. Green S. Stone J.E. Phillips J.C. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2008,96(5):879-899

The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications. 相似文献

5.

Efficient GPU and CPU-based LDPC decoders for long codewords

Stefan Gr?nroos Kristian Nybom Jerker Bj?rkqvist 《Analog Integrated Circuits and Signal Processing》2012,73(2):583-595

The next generation DVB-T2, DVB-S2, and DVB-C2 standards for digital television broadcasting specify the use of low-density parity-check (LDPC) codes with codeword lengths of up to 64800 bits. The real-time decoding of these codes on general purpose computing hardware is useful for completely software defined receivers, as well as for testing and simulation purposes. Modern graphics processing units (GPUs) are capable of massively parallel computation, and can in some cases, given carefully designed algorithms, outperform general purpose CPUs (central processing units) by an order of magnitude or more. The main problem in decoding LDPC codes on GPU hardware is that LDPC decoding generates irregular memory accesses, which tend to carry heavy performance penalties (in terms of efficiency) on GPUs. Memory accesses can be efficiently parallelized by decoding several codewords in parallel, as well as by using appropriate data structures. In this article we present the algorithms and data structures used to make log-domain decoding of the long LDPC codes specified by the DVB-T2 standard??at the high data rates required for television broadcasting??possible on a modern GPU. Furthermore, we also describe a similar decoder implemented on a general purpose CPU, and show that high performance LDPC decoders are also possible on modern multi-core CPUs. 相似文献

6.

真三维活动视频数据的优化研究 总被引：1，自引：0，他引：1

江寅川袁杰《现代电子技术》2012,35(8):116-119,126

提出了一种基于点阵的真三维视频显示技术,该系统利用LED为单元节点组成三维空间阵列,用于显示真三维活动影像。由于数据量巨大,为了加快处理速度,利用CUDA编程模型对计算过程进行优化,把处理过程中可以并行计算的部分交由GPU执行。先把要处理的视频数据传到内存中,由CPU进行一些预处理,然后传到显存,由GPU对视频运动过程等进行处理,处理完后再传到内存,由CPU进行一些后续处理,最终把处理后的数据传出加以显示或存储。通过比较仅由CPU处理与用GPU优化后的计算时间,发现优化后计算速度比优化前快了几十到几百倍,而且数据量越大,优化效果越好,核心多的GPU所得到的加速比大,最后在实验部分给出了用OpenGL仿真的结果。相似文献

7.

SURF算法在通用GPU和OpenCL的优化与实现

王艳梅史晓华于湛麟《电子测试》2013,(12):51-55,42

Speeded Up Robust Feature（SURF）算法是在计算机视觉领域得到广泛应用的一种图像兴趣点检测和匹配方法。开放计算语言（OpenCL）提供了一个在异构体系结构上,包括GPU,CPU及其他类型处理器,编写并行程序的框架。本文介绍了如何在通用GPU和OpenCL平台上,对SURF算法进行优化与实现。本文对其中一些优化方法,例如kernel线程的配置,局部内存的使用方法等,进行了详细的对比和讨论。最终实现的OpenCL版本的算法在NVidiaGTX260平台上获得了比原始的CPU版本在IntelDual—CoreE54002．7G处理器上至少21倍的加速。相似文献

8.

字符串匹配算法的实现:CPU vs.GPU vs.FPGA

李璋杜慧敏王涌钢《电子科技》2014,27(12):5-8

针对字符串匹配算法在各平台实现的性能问题,将算法在CPU、GPU及FPGA上做了测试对比。GPU具有计算单元多的特点,使得GPU对计算密集型应用有较大的效率提升;而FPGA具有级强的灵活性、可编程性及大量的逻辑运算单元,在处理字符串匹配时的处理速度快。通过对3种实现方式在Snort规则库下做的分析,其结果表明,FPGA的处理速度最快,相比GPU的处理速度提升了10倍。而CPU的串行处理速度最慢,且FPGA的资源消耗最多,GPU次之,CPU的资源消耗最少,且实现最简单。相似文献

9.

Fast generation of digitally reconstructed radiographs using attenuation fields with application to 2D-3D image registration

Russakoff DB Rohlfing T Mori K Rueckert D Ho A Adler JR Maurer CR 《IEEE transactions on medical imaging》2005,24(11):1441-1454

Generation of digitally reconstructed radiographs (DRRs) is computationally expensive and is typically the rate-limiting step in the execution time of intensity-based two-dimensional to three-dimensional (2D-3D) registration algorithms. We address this computational issue by extending the technique of light field rendering from the computer graphics community. The extension of light fields, which we call attenuation fields (AFs), allows most of the DRR computation to be performed in a preprocessing step; after this precomputation step, DRRs can be generated substantially faster than with conventional ray casting. We derive expressions for the physical sizes of the two planes of an AF necessary to generate DRRs for a given X-ray camera geometry and all possible object motion within a specified range. Because an AF is a ray-based data structure, it is substantially more memory efficient than a huge table of precomputed DRRs because it eliminates the redundancy of replicated rays. Nonetheless, an AF can require substantial memory, which we address by compressing it using vector quantization. We compare DRRs generated using AFs (AF-DRRs) to those generated using ray casting (RC-DRRs) for a typical C-arm geometry and computed tomography images of several anatomic regions. They are quantitatively very similar: the median peak signal-to-noise ratio of AF-DRRs versus RC-DRRs is greater than 43 dB in all cases. We perform intensity-based 2D-3D registration using AF-DRRs and RC-DRRs and evaluate registration accuracy using gold-standard clinical spine image data from four patients. The registration accuracy and robustness of the two methods is virtually identical whereas the execution speed using AF-DRRs is an order of magnitude faster. 相似文献

10.

Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors

Antonio Ruiz Manuel Ujaldon Lee Cooper Kun Huang 《Journal of Signal Processing Systems》2009,55(1-3):229-250

Microscopic imaging is an important tool for characterizing tissue morphology and pathology. 3D reconstruction and visualization of large sample tissue structure requires registration of large sets of high-resolution images. However, the scale of this problem presents a challenge for automatic registration methods. In this paper we present a novel method for efficient automatic registration using graphics processing units (GPUs) and parallel programming. Comparing a C++ CPU implementation with Compute Unified Device Architecture (CUDA) libraries and pthreads running on GPU we achieve a speed-up factor of up to 4.11× with a single GPU and 6.68× with a GPU pair. We present execution times for a benchmark composed of two sets of large-scale images: mouse placenta (16K ×16K pixels) and breast cancer tumors (23K ×62K pixels). It takes more than 12 hours for the genetic case in C++ to register a typical sample composed of 500 consecutive slides, which was reduced to less than 2 hours using two GPUs, in addition to a very promising scalability for extending those gains easily on a large number of GPUs in a distributed system. 相似文献

11.

High-speed nonlinear finite element analysis for surgical simulation using graphics processing units 总被引：1，自引：0，他引：1

Taylor ZA Cheng M Ourselin S 《IEEE transactions on medical imaging》2008,27(5):650-663

The use of biomechanical modelling, especially in conjunction with finite element analysis, has become common in many areas of medical image analysis and surgical simulation. Clinical employment of such techniques is hindered by conflicting requirements for high fidelity in the modelling approach, and fast solution speeds. We report the development of techniques for high-speed nonlinear finite element analysis for surgical simulation. We use a fully nonlinear total Lagrangian explicit finite element formulation which offers significant computational advantages for soft tissue simulation. However, the key contribution of the work is the presentation of a fast graphics processing unit (GPU) solution scheme for the finite element equations. To the best of our knowledge, this represents the first GPU implementation of a nonlinear finite element solver. We show that the present explicit finite element scheme is well suited to solution via highly parallel graphics hardware, and that even a midrange GPU allows significant solution speed gains (up to 16.8 times) compared with equivalent CPU implementations. For the models tested the scheme allows real-time solution of models with up to 16 000 tetrahedral elements. The use of GPUs for such purposes offers a cost-effective high-performance alternative to expensive multi-CPU machines, and may have important applications in medical image analysis and surgical simulation. 相似文献

12.

Towards accelerating irregular EDA applications with GPUs

Hao Qian^{Author Vitae} Yangdong DengAuthor VitaeBo WangAuthor Vitae Shuai MuAuthor Vitae 《Integration, the VLSI Journal》2012,45(1):46-60

Recently graphic processing units (GPUs) are rising as a new vehicle for high-performance, general purpose computing. It is attractive to unleash the power of GPU for Electronic Design Automation (EDA) computations to cut the design turn-around time of VLSI systems. EDA algorithms, however, generally depend on irregular data structures such as sparse matrix and graphs, which pose major challenges for efficient GPU implementations. In this paper, we propose high-performance GPU implementations for a set of important irregular EDA computing patterns including sparse matrix, graph algorithms and message-passing algorithms. In the sparse matrix domain, we solve a core problem, sparse-matrix vector product (SMVP). On a wide range of EDA problem instances, our SMVP implementation outperforms all prior work and achieves a speedup up to 50× over the CPU baseline implementation. The GPU based SMVP procedure is applied to successfully accelerate two core EDA computing engines, timing analysis and linear system solution. In the graph algorithm domain, we developed a SMVP based formulation to efficiently solve the breadth-first search (BFS) problem on GPUs. We also developed efficient solutions for two message-passing algorithms, survey propagation (SP) based SAT solution and a register-transfer level (RTL) simulation. Our results prove that GPUs have a strong potential to accelerate EDA computing through designing GPU-friendly algorithms and/or re-organizing computing structures of sequential algorithms. 相似文献

13.

GSWO: A programming model for GPU-enabled parallelization of sliding window operations in image processing

《Signal Processing: Image Communication》2016

Sliding Window Operations (SWOs) are widely used in image processing applications. They often have to be performed repeatedly across the target image, which can demand significant computing resources when processing large images with large windows. In applications in which real-time performance is essential, running these filters on a CPU often fails to deliver results within an acceptable timeframe. The emergence of sophisticated graphic processing units (GPUs) presents an opportunity to address this challenge. However, GPU programming requires a steep learning curve and is error-prone for novices, so the availability of a tool that can produce a GPU implementation automatically from the original CPU source code can provide an attractive means by which the GPU power can be harnessed effectively. This paper presents a GPU-enabled programming model, called GSWO, which can assist GPU novices by converting their SWO-based image processing applications from the original C/C++ source code to CUDA code in a highly automated manner. This model includes a new set of simple SWO pragmas to generate GPU kernels and to support effective GPU memory management. We have implemented this programming model based on a CPU-to-GPU translator (C2GPU). Evaluations have been performed on a number of typical SWO image filters and applications. The experimental results show that the GSWO model is capable of efficiently accelerating these applications, with improved applicability and a speed-up of performance compared to several leading CPU-to-GPU source-to-source translators. 相似文献

14.

Optimization and Parallelization of Monaural Source Separation Algorithms in the openBliSSART Toolkit

Felix Weninger Björn Schuller 《Journal of Signal Processing Systems》2012,69(3):267-277

We describe the implementation of monaural audio source separation algorithms in our toolkit openBliSSART (Blind Source Separation for Audio Recognition Tasks). To our knowledge, it provides the first freely available C+ + implementation of Non-Negative Matrix Factorization (NMF) supporting the Compute Unified Device Architecture (CUDA) for fast parallel processing on graphics processing units (GPUs). Besides integrating parallel processing, openBliSSART introduces several numerical optimizations of commonly used monaural source separation algorithms that reduce both computation time and memory usage. By illustrating a variety of use-cases from audio effects in music processing to speech enhancement and feature extraction, we demonstrate the wide applicability of our application framework for a multiplicity of research and end-user applications. We evaluate the toolkit by benchmark results of the NMF algorithms and discuss the influence of their parameterization on source separation quality and real-time factor. In the result, the GPU parallelization in openBliSSART introduces double-digit speedups with respect to conventional CPU computation, enabling real-time processing on a desktop PC even for high matrix dimensions. 相似文献

15.

基于GPU的快速二维沃尔什变换研究 总被引：1，自引：1，他引：1

童莹张健《微电子学与计算机》2011,28(1):46-49,53

提出了一种基于GPU(Graphics Processing Unit,图形处理器)CUDA(Compute Unified Device Architecture,计算统一设备架构)平台的快速二维沃尔什变换(Walsh Transform)实现方法.该方法利用GPU的并行结构和硬件特点,从算法实现、存储类型、逻辑构架设置等方面提高了沃尔什变换的运算速度.实验结果表明,随着图像分辨率的增加,沃尔什变换在GPU上运行时间远低于CPU,GPU比CPU具有更明显的加速效果. 相似文献

16.

基于GPU的MTD性能优化

杨千禾袁子乔扈月松《火控雷达技术》2021,50(1):86-93

为了解决传统雷达信号处理机在研发阶段面临的调试困难,计算能力受硬件限制及程序复用性差等问题,本文提出了使用GPU作为雷达计算核心的方案.在使用GPU实现雷达信号处理算法的过程中,动目标检测(MTD)部分的优化效果远低于脉冲压缩和恒虚警检测.经过分析,MTD过程中的矩阵转置与向量点乘占据了算法的大量时间.本文从GPU的数... 相似文献

17.

基于GPU+CPU的CANNY算子快速实现

下载免费PDF全文

唐斌龙文《液晶与显示》2016,31(7):714-720

本文提出一种基于GPU+CPU的快速实现Canny算子的方法。首先将算子分为串行和并行两部分,高斯滤波、梯度幅值和方向计算、非极大值抑制和双阈值处理在GPU中完成,将二维高斯滤波分解为水平方向上和垂直方向上的两次一维滤波从而降低计算的复杂度;然后使用CUDA编程完成多线程并行计算以加快计算速度;最后使用共享存储器隐藏线程访问全局存储的延迟;在CPU中则使用队列FIFO完成边缘连接。仿真测试结果表明：对分辨率为1024×1024的8位图像的处理时间为122 ms,相对应单独使用CPU而言,加速比最高可达5.39倍,因此本文方法充分利用了GPU的并行性的特征和CPU的串行处理能力。相似文献

18.

Intensity-based 2-D-3-D registration of cerebral angiograms 总被引：4，自引：0，他引：4

Hipwell JH Penney GP McLaughlin RA Rhode K Summers P Cox TC Byrne JV Noble JA Hawkes DJ 《IEEE transactions on medical imaging》2003,22(11):1417-1426

We propose a new method for aligning three-dimensional (3-D) magnetic resonance angiography (MRA) with 2-D X-ray digital subtraction angiograms (DSA). Our method is developed from our algorithm to register computed tomography volumes to X-ray images based on intensity matching of digitally reconstructed radiographs (DRRs). To make the DSA and DRR more similar, we transform the MRA images to images of the vasculature and set to zero the contralateral side of the MRA to that imaged with DSA. We initialize the search for a match on a user defined circular region of interest. We have tested six similarity measures using both unsegmented MRA and three segmentation variants of the MRA. Registrations were carried out on images of a physical neuro-vascular phantom and images obtained during four neuro-vascular interventions. The most accurate and robust registrations were obtained using the pattern intensity, gradient difference, and gradient correlation similarity measures, when used in conjunction with the most sophisticated MRA segmentations. Using these measures, 95% of the phantom start positions and 82% of the clinical start positions were successfully registered. The lowest root mean square reprojection errors were 1.3 mm (standard deviation 0.6) for the phantom and 1.5 mm (standard deviation 0.9) for the clinical data sets. Finally, we present a novel method for the comparison of similarity measure performance using a technique borrowed from receiver operator characteristic analysis. 相似文献

19.

基于像素着色器的雷达显示系统模拟 总被引：1，自引：1，他引：0

刘强杨泽刚王炜《太赫兹科学与电子信息学报》2009,7(3):180-183

雷达显示系统模拟中产生高质量雷达图像需要计算显示器分辨单元的全部像素,逐点计算效率低,严重影响模拟的实时性。利用像素着色器技术,将雷达图像分为3层独立绘制的纹理,并由硬件完成分层纹理的混合。降低了图像计算的复杂度,且充分利用了CPU与GPU（图像处理单元）的并行运算能力。应用结果表明,在雷达显示系统分辨力为1280×1024时,图像输出仍能稳定地保持在50fps以上,能够很好地满足大分辨率雷达显示系统模拟的需求。相似文献

20.

Highly parallel GEMV with register blocking method on GPU architecture

《Journal of Visual Communication and Image Representation》2014,25(7):1566-1573

GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices. 相似文献