首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A new Graphics Processing Unit (GPU) parallelization strategy is proposed to accelerate sparse finite element computation for three dimensional electromagnetic analysis. The parallelization strategy is employed based on a new compression format called sliced ELL Four (sliced ELL-F). The sliced ELL-F format-based parallelization strategy is designed for hastening many addition, dot product, and Sparse Matrix Vector Product (SMVP) operations in the Conjugate Gradient Norm (CGN) calculation of finite element equations. The new implementation of SMVP on GPUs is evaluated. The proposed strategy executed on a GPU can efficiently solve sparse finite element equations, especially when the equations are huge sparse (size of most rows in a coefficient matrix is less than 8). Numerical results show the sliced ELL-F format-based parallelization strategy can reach significant speedups compared to Compressed Sparse Row (CSR) format.  相似文献   

2.
In this paper, we propose an efficient parallel dynamic linear solver, called GPU-GMRES, for transient analysis of large linear dynamic systems such as large power grid networks. The new method is based on the preconditioned generalized minimum residual (GMRES) iterative method implemented on heterogeneous CPU–GPU platforms. The new solver is very robust and can be applied to power grids with different structures as well as for general analysis problems for large linear dynamic systems with asymmetric matrices. The proposed GPU-GMRES solver adopts the very general and robust incomplete LU based preconditioner. We show that by properly selecting the right amount of fill-ins in the incomplete LU factors, a good trade-off between GPU efficiency and convergence rate can be achieved for the best overall performance. Such tunable feature can make this algorithm very adaptive to different problems. GPU-GMRES solver properly partitions the major computing tasks in GMRES solver to minimize the data traffic between CPU and GPUs to enhance performance of the proposed method. Furthermore, we propose a new fast parallel sparse matrix–vector (SpMV) multiplication algorithm to further accelerate the GPU-GMRES solver. The new algorithm, called segSpMV, can enjoy full coalesced memory access compared to existing approaches. To further improve the scalability and efficiency, segSpMV method is further extended to multi-GPU platforms, which leads to more scalable and faster multi-GPU GMRES solver. Experimental results on the set of the published IBM benchmark circuits and mesh-structured power grid networks show that the GPU-GMRES solver can deliver order of magnitudes speedup over the direct LU solver, UMFPACK. The resulting multi-GPU-GMRES can also deliver 3–12× speedup over the CPU implementation of the same GMRES method on transient analysis.  相似文献   

3.
本文提出了一种用于半导体器件数值分析的新颖的稀疏矩阵技术及其算法实现。文中详述了该稀疏矩阵的存储方式及计算过程,并与现有的稀疏矩阵技术作了比较,说明该稀疏矩阵用于半导体器件模拟时可以大大减少存储量和节省运算时间,实施也非常方便。文中还给出用该稀疏矩阵技术完成的几个算法,并给出计算实例,以说明该稀疏矩阵技术所需的时空特性。  相似文献   

4.
A novel sparse matrix technique for the numerical analysis of semiconductor devicesand its algorithms are presented.Storage scheme and calculation procedure of the sparse matrixare described in detail.The sparse matrix technique in the device simulation can decrease storagegreatly with less CPU time and its implementation is very easy.Some algorithms and calculationexamples to show the time and space characteristics of the sparse matrix are given.  相似文献   

5.
General purpose graphics processing units (GPGPUs) have gained much popularity in scientific computing to speedup computational intensive workloads. Resource allocation in terms of power and subcarriers assignment, in current wireless standards, is one of the challenging problems due to its high computational complexity requirement. The Hungarian algorithm (HA), which has been extensively applied to linear assignment problems (LAPs), has been seen to provide encouraging result in resource allocation for wireless communication systems. This paper presents a compute unified device architecture (CUDA) implementation of the HA on graphics processing unit (GPU) for this problem. HA has been implemented on a parallel architecture to solve the subcarrier assignment problem and maximize spectral efficiency. The proposed implementation is achieved by using the “Kuhn‐Munkres” algorithm with effective modifications, in order to fully exploit the capabilities of modern GPU devices. A cost matrix for maximum assignment has been defined leading to a low complexity matrix compression along with highly optimized CUDA reduction and parallel alternating path search process. All these optimizations lead to an efficient implementation with superior performance when compared with existing parallel implementations.  相似文献   

6.
Cloud vendors such as Amazon (AWS) have started to offer FPGAs in addition to GPUs and CPU in their computing on-demand services. In this work we explore design space trade-offs of implementing a state-of-the-art machine learning library for Gradient-boosted decision trees (GBDT) on Amazon cloud and compare the scalability, performance, cost and accuracy with best known CPU and GPU implementations from literature. Our evaluation indicates that depending on the dataset, an FPGA-based implementation of the bottleneck computation kernels yields a speed-up anywhere from 3X to 10X over a GPU and 5X to 33X over a CPU. We show that smaller bin size results in better performance on a FPGA, but even with a bin size of 16 and a fixed point implementation the degradation in terms of accuracy on a FPGA is relatively small, around 1.3%–3.3% compared to a floating point implementation with 256 bins on a CPU or GPU.  相似文献   

7.
Expander graph arguments for message-passing algorithms   总被引:3,自引:0,他引:3  
We show how expander-based arguments may be used to prove that message-passing algorithms can correct a linear number of erroneous messages. The implication of this result is that when the block length is sufficiently large, once a message-passing algorithm has corrected a sufficiently large fraction of the errors, it will eventually correct all errors. This result is then combined with known results on the ability of message-passing algorithms to reduce the number of errors to an arbitrarily small fraction for relatively high transmission rates. The results hold for various message-passing algorithms, including Gallager's hard-decision and soft-decision (with clipping) decoding algorithms. Our results assume low-density parity-check (LDPC) codes based on an irregular bipartite graph  相似文献   

8.
Yu  Xiaodong  Wang  Hao  Feng  Wu-chun  Gong  Hao  Cao  Guohua 《Journal of Signal Processing Systems》2019,91(3-4):321-338

The algebraic reconstruction technique (ART) is an iterative algorithm for CT (i.e., computed tomography) image reconstruction that delivers better image quality with less radiation dosage than the industry-standard filtered back projection (FBP). However, the high computational cost of ART requires researchers to turn to high-performance computing to accelerate the algorithm. Alas, existing approaches for ART suffer from inefficient design of compressed data structures and computational kernels on GPUs. Thus, this paper presents our CUDA-based CT image reconstruction tool based on the algebraic reconstruction technique (ART) or cuART. It delivers a compression and parallelization solution for ART-based image reconstruction on GPUs. We address the under-performing, but popular, GPU libraries, e.g., cuSPARSE, BRC, and CSR5, on the ART algorithm and propose a symmetry-based CSR format (SCSR) to further compress the CSR data structure and optimize data access for both SpMV and SpMV_T via a column-indices permutation. We also propose sorting-based global-level and sorting-free view-level blocking techniques to optimize the kernel computation by leveraging different sparsity patterns of the system matrix. The end result is that cuART can reduce the memory footprint significantly and enable practical CT datasets to fit into a single GPU. The experimental results on a NVIDIA Tesla K80 GPU illustrate that our approach can achieve up to 6.8x, 7.2x, and 5.4x speedups over counterparts that use cuSPARSE, BRC, and CSR5, respectively.

  相似文献   

9.
Median filtering is a well-known method used in a wide range of application frameworks as well as a standalone filter, especially for salt-and-pepper denoising. It is able to highly reduce the power of noise while minimizing edge blurring. Currently, existing algorithms and implementations are quite efficient but may be improved as far as processing speed is concerned, which has led us to further investigate the specificities of modern GPUs. In this paper, we propose the GPU implementation of fixed-size kernel median filters, able to output up to 1.85 billion pixels per second on C2070 Tesla cards. Based on a Branchless Vectorized Median class algorithm and implemented through memory fine tuning and the use of GPU registers, our median drastically outperforms existing implementations, resulting, as far as we know, in the fastest median filter to date.  相似文献   

10.
GPU Computing   总被引:9,自引:0,他引:9  
The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications.  相似文献   

11.
Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256×256×133 CT volume in ~?24 ms using an NVidia GeForce 8800 GTX and in ~?2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture.  相似文献   

12.
The tremendous power of graphics processing unit (GPU) computing relative to prior CPU‐only architectures presents new opportunities for efficient solutions of previously intractable large‐scale optimization problems. Although most previous work in this field focused on scientific applications in the areas of medicine and physics, here we present a Compute Unified Device Architecture‐based (CUDA) GPU solution to solve the topology control problem in hybrid radio frequency and free space optics wireless mesh networks by adapting and adjusting the transmission power and the beam‐width of individual nodes according to QoS requirements. Our approach is based on a stochastic global optimization technique inspired by the social behavior of flocking birds — so‐called ‘particle swarm optimization’ — and was implemented on the NVIDIA GeForce GTX 285 GPU. The implementation achieved a performance speedup factor of 392 over a CPU‐only implementation. Several innovations in the memory/execution structure in our approach enabled us to surpass all prior known particle swarm optimization GPU implementations. Our results provide a promising indication of the viability of GPU‐based approaches towards the solution of large‐scale optimization problems such as those found in radio frequency and free space optics wireless mesh network design. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

13.
为了解决多分辨率图像融合算法在单纯CPU上运行效率低的问题,提出了一种用于在基于梯度塔形分解的图像融合算法中使用的新融合规则,并将整个融合算法在图形处理器上进行了设计和高效的实现,将计算密集的任务安排到图形处理器上执行.实验验证,融合图像保留了源图像中的显著特征.并且,与单纯CPU上的融合算法相比,该系统在不同尺寸的图像融合中都获得加速.展现了一种通过图形处理器提高多分辨率图像融合算法效率的方法.  相似文献   

14.
The next generation DVB-T2, DVB-S2, and DVB-C2 standards for digital television broadcasting specify the use of low-density parity-check (LDPC) codes with codeword lengths of up to 64800 bits. The real-time decoding of these codes on general purpose computing hardware is useful for completely software defined receivers, as well as for testing and simulation purposes. Modern graphics processing units (GPUs) are capable of massively parallel computation, and can in some cases, given carefully designed algorithms, outperform general purpose CPUs (central processing units) by an order of magnitude or more. The main problem in decoding LDPC codes on GPU hardware is that LDPC decoding generates irregular memory accesses, which tend to carry heavy performance penalties (in terms of efficiency) on GPUs. Memory accesses can be efficiently parallelized by decoding several codewords in parallel, as well as by using appropriate data structures. In this article we present the algorithms and data structures used to make log-domain decoding of the long LDPC codes specified by the DVB-T2 standard??at the high data rates required for television broadcasting??possible on a modern GPU. Furthermore, we also describe a similar decoder implemented on a general purpose CPU, and show that high performance LDPC decoders are also possible on modern multi-core CPUs.  相似文献   

15.
谈继魁  方勇  霍迎秋 《电视技术》2015,39(15):42-45
重建算法在压缩感知理论中有着重要的作用,经典的正交匹配追踪(OMP)重建算法在每次迭代中对已选择的原子进行正交化处理以加速算法的收敛速度,但同时增加了算法的计算复杂度。针对这一问题,提出了一种基于图形处理单元(GPU)并行计算的OMP算法,重点对算法中复杂度高的投影和矩阵求逆部分在GPU平台上进行并行设计。实验结果表明基于GPU的并行OMP算法相对于其串行算法加速比可以达到30~44倍,有效地提高了算法的计算效率,拓宽了该算法的应用范围。  相似文献   

16.
GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.  相似文献   

17.
The use of biomechanical modelling, especially in conjunction with finite element analysis, has become common in many areas of medical image analysis and surgical simulation. Clinical employment of such techniques is hindered by conflicting requirements for high fidelity in the modelling approach, and fast solution speeds. We report the development of techniques for high-speed nonlinear finite element analysis for surgical simulation. We use a fully nonlinear total Lagrangian explicit finite element formulation which offers significant computational advantages for soft tissue simulation. However, the key contribution of the work is the presentation of a fast graphics processing unit (GPU) solution scheme for the finite element equations. To the best of our knowledge, this represents the first GPU implementation of a nonlinear finite element solver. We show that the present explicit finite element scheme is well suited to solution via highly parallel graphics hardware, and that even a midrange GPU allows significant solution speed gains (up to 16.8 times) compared with equivalent CPU implementations. For the models tested the scheme allows real-time solution of models with up to 16 000 tetrahedral elements. The use of GPUs for such purposes offers a cost-effective high-performance alternative to expensive multi-CPU machines, and may have important applications in medical image analysis and surgical simulation.  相似文献   

18.
Tracking systems are important in computer vision, with applications in video surveillance, human computer interfaces (HCI), etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to a greater use of the GPU for non-rendering applications, such as image processing and computer vision tasks. In this work we show an effective particle filtering implementation for real-time template tracking based on the use of a graphics card as a streaming architecture in a translation-rotation-scale model.  相似文献   

19.
This tutorial paper describes various efficient implementations (published and new unpublished) of the forward and backward modified discrete cosine transform (MDCT) in the MPEG layer III (MP3) audio coding standard developed in the time period 1990-2010, including the efficient implementation of polyphase filter banks for completeness. The efficient MDCT implementations are discussed in the context of (fast) complete analysis/synthesis MDCT filter banks in the MP3 encoder and decoder. In general, for each efficient forward/backward MDCT block transforms implementation are presented: complete formulas or sparse matrix factorizations of the algorithm, the corresponding signal flow graph for the short audio block and the total arithmetic complexity as well as the useful comments related to improving the arithmetic complexity and a possible structural simplification of the algorithm. Finally, all efficient forward/backward MDCT implementations are compared both in terms of the arithmetic complexity and structural simplicity. It is important to note that almost all presented algorithms can be also used for the 2n-length data blocks in others MPEG audio coding standards and proprietary audio compression algorithms.  相似文献   

20.
新一代人工智能技术的特征,表现为借助GPU计算、云计算等高性能分布式计算能力,使用以深度学习算法为代表的机器学习算法,在大数据上进行学习训练,来模拟、延伸和扩展人的智能。不同数据来源、不同的计算物理位置,使得目前的机器学习面临严重的隐私泄露问题,因此隐私保护机器学习(PPM)成为目前广受关注的研究领域。采用密码学工具来解决机器学习中的隐私问题,是隐私保护机器学习重要的技术。该文介绍隐私保护机器学习中常用的密码学工具,包括通用安全多方计算(SMPC)、隐私保护集合运算、同态加密(HE)等,以及应用它们来解决机器学习中数据整理、模型训练、模型测试、数据预测等各个阶段中存在的隐私保护问题的研究方法与研究现状。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号