首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDA‐capable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of . The methods described in the paper are applicable to any FMMs in which the multipole‐to‐local (M2L) operator is a dense matrix and the matrix is precomputed. This is the case for example in the black‐box fast multipole method (bbFMM), which is a variant of the FMM that can handle large class of kernels. This example will be used in our benchmarks. In the FMM, two operators represent most of the computational cost, and an optimal implementation typically tries to balance those two operators. One is the nearby interaction calculation (direct sum calculation, line 29 in Listing 1), and the other is the M2L operation. We focus on the M2L. By combining multiple M2L operations and reordering the primitive loops of the M2L so that CUDA threads can reuse or share common data, these approaches reduce the movement of data in the GPU. Because memory bandwidth is the primary bottleneck of these methods, significant performance improvements are realized. Four M2L schemes are detailed and analyzed in the case of a uniform tree. The four schemes are tested and compared with an optimized, OpenMP parallelized, multi‐core CPU code. We consider high and low precision calculations by varying the number of Chebyshev nodes used in the bbFMM. The accuracy of the GPU codes is found to be satisfactory and achieved performance over 200 Gflop/s on one NVIDIA Tesla C1060 GPU (Nvidia Corporation, Sta. Clara, CA, USA). This was compared against two quad‐core Intel Xeon E5345 processors (Intel Corporation, Sta. Clara, CA, USA) running at 2.33 GHz, for a combined peak performance of 149 Gflop/s for single precision. For the low FMM accuracy case, the observed performance of the CPU code was 37 Gflop/s, whereas for the high FMM accuracy case, the performance was about 8.5 Gflop/s, most likely because of a higher frequency of cache misses. We also present benchmarks on an NVIDIA C2050 GPU (a Fermi processor)(Nvidia Corporation, Sta. Clara, CA, USA) in single and double precision. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

2.
This paper presents techniques for generating very large finite‐element matrices on a multicore workstation equipped with several graphics processing units (GPUs). To overcome the low memory size limitation of the GPUs, and at the same time to accelerate the generation process, we propose to generate the large sparse linear systems arising in finite‐element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure. The scheduling of the threads is organized in such a way that the CPU operations do not affect the performance of the process, and the GPUs are idle only when data are being transferred from GPU to CPU. This approach is verified on two workstations: the first consists of two 6‐core Intel Xeon X5690 processors with two Fermi GPUs: each GPU is a GeForce GTX 590 with two graphics processors and 1.5 GB of fast RAM; the second workstation is equipped with two Tesla C2075 boards carrying 6 GB of RAM each and two 12‐core Opteron 6174s. For the latter setup, we demonstrate the fast generation of sparse finite‐element matrices as large as 10 million unknowns, with over 1 billion nonzero entries. Comparing with the single‐threaded and multithreaded CPU implementations, the GPU‐based version of the algorithm based on the ideas presented in this paper reduces the finite‐element matrix‐generation time in double precision by factors of 100 and 30, respectively. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

3.
Li J  Bloch P  Xu J  Sarunic MV  Shannon L 《Applied optics》2011,50(13):1832-1838
Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.  相似文献   

4.
A collocation boundary element code for solving the three-dimensional Laplace equation, publicly available from http://intetec.org, has been adapted to run on an Nvidia Tesla general-purpose graphics processing unit (GPU). Global matrix assembly and LU factorization of the resulting dense matrix are performed on the GPU. Out-of-core techniques are used to solve problems larger than the available GPU memory. The code achieved about 10 times speedup in matrix assembly over a single CPU core and about 56 Gflops/s in the LU factorization using only 512 Mbytes of GPU memory. Details of the GPU implementation and comparisons with the standard sequential algorithm are included to illustrate the performance of the GPU code.  相似文献   

5.
The proposed spectral element method implementation is based on sparse matrix storage of local shape function derivatives calculated at Gauss–Lobatto–Legendre points. The algorithm utilizes two basic operations: multiplication of sparse matrix by vector and element‐by‐element vectors multiplication. Compute‐intensive operations are performed for a part of equation of motion derived at the degree of freedom level of 3D isoparametric spectral elements. The assembly is performed at the force vector in such a way that atomic operations are minimized. This is achieved by a new mesh coloring technique The proposed parallel implementation of spectral element method on GPU is applied for the first time for Lamb wave simulations. It has been found that computation on multicore GPU is up to 14 times faster than on single CPU. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

6.
Q. Wu  F. Wang  Y. Xiong 《工程优选》2016,48(10):1679-1692
In order to reduce the computational time, a fully parallel implementation of the particle swarm optimization (PSO) algorithm on a graphics processing unit (GPU) is presented. Instead of being executed on the central processing unit (CPU) sequentially, PSO is executed in parallel via the GPU on the compute unified device architecture (CUDA) platform. The processes of fitness evaluation, updating of velocity and position of all particles are all parallelized and introduced in detail. Comparative studies on the optimization of four benchmark functions and a trajectory optimization problem are conducted by running PSO on the GPU (GPU-PSO) and CPU (CPU-PSO). The impact of design dimension, number of particles and size of the thread-block in the GPU and their interactions on the computational time is investigated. The results show that the computational time of the developed GPU-PSO is much shorter than that of CPU-PSO, with comparable accuracy, which demonstrates the remarkable speed-up capability of GPU-PSO.  相似文献   

7.
Network alignment is an important bridge to understanding human protein–protein interactions (PPIs) and functions through model organisms. However, the underlying subgraph isomorphism problem complicates and increases the time required to align protein interaction networks (PINs). Parallel computing technology is an effective solution to the challenge of aligning large‐scale networks via sequential computing. In this study, the typical Hungarian‐Greedy Algorithm (HGA) is used as an example for PIN alignment. The authors propose a HGA with 2‐nearest neighbours (HGA‐2N) and implement its graphics processing unit (GPU) acceleration. Numerical experiments demonstrate that HGA‐2N can find alignments that are close to those found by HGA while dramatically reducing computing time. The GPU implementation of HGA‐2N optimises the parallel pattern, computing mode and storage mode and it improves the computing time ratio between the CPU and GPU compared with HGA when large‐scale networks are considered. By using HGA‐2N in GPUs, conserved PPIs can be observed, and potential PPIs can be predicted. Among the predictions based on 25 common Gene Ontology terms, 42.8% can be found in the Human Protein Reference Database. Furthermore, a new method of reconstructing phylogenetic trees is introduced, which shows the same relationships among five herpes viruses that are obtained using other methods.Inspec keywords: graphics processing units, proteins, molecular biophysics, genetics, microorganisms, medical computing, bioinformaticsOther keywords: graphics processing unit‐based alignment, protein interaction networks, network alignment, human protein–protein interactions, Hungarian‐Greedy algorithm, GPU acceleration, gene ontology terms, phylogenetic trees reconstruction, herpes viruses  相似文献   

8.
Recently, the application of graphics processing units (GPUs) to scientific computations is attracting a great deal of attention, because GPUs are getting faster and more programmable. In particular, NVIDIA's GPUs called compute unified device architecture enable highly mutlithreaded parallel computing for non‐graphic applications. This paper proposes a novel way to accelerate the boundary element method (BEM) for three‐dimensional Helmholtz' equation using CUDA. Adopting the techniques for the data caching and the double–single precision floating‐point arithmetic, we implemented a GPU‐accelerated BEM program for GeForce 8‐series GPUs. The program performed 6–23 times faster than a normal BEM program, which was optimized for an Intel's quad‐core CPU, for a series of boundary value problems with 8000–128000 unknowns, and it sustained a performance of 167 Gflop/s for the largest problem (1 058 000 unknowns). The accuracy of our BEM program was almost the same as that of the regular BEM program using the double precision floating‐point arithmetic. In addition, our BEM was applicable to solve realistic problems. In conclusion, the present GPU‐accelerated BEM works rapidly and precisely for solving large‐scale boundary value problems for Helmholtz' equation. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

9.
Finite element method (FEM) is a well‐developed method to solve real‐world problems that can be modeled with differential equations. As the available computational power increases, complex and large‐size problems can be solved using FEM, which typically involves multiple degrees of freedom (DOF) per node, high order of elements, and an iterative solver requiring several sparse matrix‐vector multiplication operations. In this work, a new storage scheme is proposed for sparse matrices arising from FEM simulations with multiple DOF per node. A sparse matrix‐vector multiplication kernel and its variants using the proposed scheme are also given for CUDA‐enabled GPUs. The proposed scheme and the kernels rely on the mesh connectivity data from FEM discretization and the number of DOF per node. The proposed kernel performance was evaluated on seven test matrices for double‐precision floating point operations. The performance analysis showed that the proposed GPU kernel outperforms the ELLPACK (ELL) and CUSPARSE Hybrid (HYB) format GPU kernels by an average of 42% and 32%, respectively, on a Tesla K20c card. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

10.
Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU time are important topics of research. A general and robust idea is the parallelization of source codes/programs. Recently, the technological development of graphics hardware created a possibility to use desktop video cards to solve numerically intensive problems. We present a powerful parallel computing framework for solving reaction-diffusion equations numerically using the Graphics Processing Units (GPUs) with CUDA. Four different reaction-diffusion problems, (i) diffusion of chemically inert compound, (ii) Turing pattern formation, (iii) phase separation in the wake of a moving diffusion front and (iv) air pollution dispersion were solved, and additionally both the Shared method and the Moving Tiles method were tested. Our results show that parallel implementation achieves typical acceleration values in the order of 5-40 times compared to CPU using a single-threaded implementation on a 2.8 GHz desktop computer.  相似文献   

11.
Recently, graphics processing units (GPUs) have been increasingly leveraged in a variety of scientific computing applications. However, architectural differences between CPUs and GPUs necessitate the development of algorithms that take advantage of GPU hardware. As sparse matrix vector (SPMV) multiplication operations are commonly used in finite element analysis, a new SPMV algorithm and several variations are developed for unstructured finite element meshes on GPUs. The effective bandwidth of current GPU algorithms and the newly proposed algorithms are measured and analyzed for 15 sparse matrices of varying sizes and varying sparsity structures. The effects of optimization and differences between the new GPU algorithm and its variants are then subsequently studied. Lastly, both new and current SPMV GPU algorithms are utilized in the GPU CG solver in GPU finite element simulations of the heart. These results are then compared against parallel PETSc finite element implementation results. The effective bandwidth tests indicate that the new algorithms compare very favorably with current algorithms for a wide variety of sparse matrices and can yield very notable benefits. GPU finite element simulation results demonstrate the benefit of using GPUs for finite element analysis and also show that the proposed algorithms can yield speedup factors up to 12‐fold for real finite element applications. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

12.
Cloud-derived wind refers to the wind field data product reversely derived through satellite remote sensing cloud images. Satellite cloud-derived wind inversion has the characteristics of large scale, computationally intensive and long time. The most widely used cloud-derived serial--tracer cloud tracking method is the maximum cross-correlation coefficient (MCC) method. In order to overcome the efficiency bottleneck of the cloud-derived serial MCC algorithm, we proposed a parallel cloud-derived wind inversion algorithm based on GPU framework in this paper, according to the characteristics of independence between each wind vector calculation. In this algorithm, each iteration is considered as a thread of GPU cores, and each thread block array of GPU allocates n*32 threads, and the many thread blocks are allocated to the thread grid. The parameters of the algorithm are passed from CPU to GPU global memory and the storage spaces are previously created on the GPU device before the functions of algorithm are executed. The test results of multiple sets of different inversion models on the NVIDIA Geforce GT and the 4-core 8-thread Core i7-3770 CPU show that the algorithm significantly improves the inversion efficiency. The acceleration ratio is up to 112, and the parallel experiment acceleration ratio is also impressive.  相似文献   

13.
With the development of parallel computing architectures, larger and more complex finite element analyses (FEA) are being performed with higher accuracy and smaller execution times. Graphics processing units (GPUs) are one of the major contributors of this computational breakthrough. This work presents a three-stage GPU-based FEA matrix generation strategy with the key idea of decoupling the computation of global matrix indices and values by use of a novel data structure referred to as the neighbor matrix. The first stage computes the neighbor matrix on the GPU based on the unstructured mesh. Using this neighbor matrix, the indices and values of the global matrix are computed separately in the second and third stages. The neighbor matrix is computed for three different element types. Two versions for performing numerical integration and assembly in the same or separate kernels are implemented and simulations are run for different mesh sizes having up to three million degrees of freedom on a single GPU. Comparison with GPU-based parallel implementation from the literature reveals speedup ranging from 4× to 6× for the proposed workload division strategy. Furthermore, the same kernel implementation is found to outperform the separate kernel implementation by 70% to 150% for different element types.  相似文献   

14.
To achieve the wavefront phase-recovery stage of an adaptive-optics loop computed in real time for 32 x 32 or a greater number of subpupils in a Shack-Hartmann sensor, we present here, for what is to our knowledge the first time, preliminary results that we obtained by using innovative techniques: graphical processing units (GPUs) and field-programmable gate arrays (FPGAs). We describe the stream-computing paradigm of the GPU and adapt a zonal algorithm to take advantage of the parallel computational power of the GPU. We also present preliminary results we obtained by use of FPGAs on the same algorithm. GPUs have proved to be a promising technique, but FPGAs are already a feasible solution to adaptive-optics real-time requirements, even for a large number of subpupils.  相似文献   

15.
Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are created and analyzed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor using single‐precision arithmetic achieves speedups of 30 or more in comparison to a well optimized double‐precision single core implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite element discretization. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

16.
周伟江  董博  许伟杰 《声学技术》2018,37(2):187-191
针对常规粒子滤波算法粒子数目保持不变的问题,提出了一种可以自适应调整粒子数目的改进算法。该算法将KL距离(Kullback-Leibler Divergence,KLD)引入粒子滤波重采样过程,保证在一定的滤波精度下,可以有效地调整滤波过程中使用的粒子数目,从而实现了滤波过程中粒子数目的自适应。将该算法应用于纯方位水下目标跟踪,仿真结果表明,该方法有效地改善了滤波效果,计算量低,适合于实际应用。  相似文献   

17.
Quantitative sodium magnetic resonance imaging permits noninvasive measurement of the tissue sodium concentration (TSC) bioscale in the brain. Computing the TSC bioscale requires reconstructing and combining multiple datasets acquired with a non‐Cartesian acquisition that highly oversamples the center of k‐space. Even with an optimized implementation of the algorithm to compute TSC, the overall processing time exceeds the time required to collect data from the human subject. Such a mismatch presents a challenge for sustained sodium imaging to avoid a growing data backlog and provide timely results. The most computationally intensive portions of the TSC calculation have been identified and accelerated using a consumer graphics processing unit (GPU) in addition to a conventional central processing unit (CPU). A recently developed data organization technique called Compact Binning was used along with several existing algorithmic techniques to maximize the scalability and performance of these computationally intensive operations. The resulting GPU+CPU TSC bioscale calculation is more than 15 times faster than a CPU‐only implementation when processing 256 × 256 × 256 data and 2.4 times faster when processing 128 × 128 × 128 data. This eliminates the possibility of a data backlog for quantitative sodium imaging. The accelerated quantification technique is suitable for general three‐dimensional non‐Cartesian acquisitions and may enable more sophisticated imaging techniques that acquire even more data to be used for quantitative sodium imaging. © 2013 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 23, 29–35, 2013.  相似文献   

18.
It takes an enormous amount of time to calculate a computer-generated hologram (CGH). A fast calculation method for a CGH using precalculated object light has been proposed in which the light waves of an arbitrary object are calculated using transform calculations of the precalculated object light. However, this method requires a huge amount of memory. This paper proposes the use of a method that uses a cylindrical basic object light to reduce the memory requirement. Furthermore, it is accelerated by using a graphics processing unit (GPU). Experimental results show that the calculation speed on a GPU is about 65 times faster than that on a CPU.  相似文献   

19.
While modern CFD tools are able to provide the user with reliable and accurate simulations, there is a strong need for interactive design and analysis tools. State-of-the-art CFD software employs massive resources in terms of CPU time, user interaction, and also GPU time for rendering and analysis. In this work, we develop an innovative tool able to provide a seamless bridge between artistic design and engineering analysis. This platform has three main ingredients: computer vision to avoid long user interaction at the pre-processing stage, machine learning to avoid costly CFD simulations, and augmented reality for an agile and interactive post-processing of the results.  相似文献   

20.
This paper presents a single instruction multiple data tabu search (SIMD-TS) algorithm for the quadratic assignment problem (QAP) with graphics hardware acceleration. The QAP is a classical combinatorial optimisation problem that is difficult to solve optimally for even small problems with over 30 items. By using graphic hardware acceleration, the developed SIMD-TS algorithm executes 20 to 45 times faster than traditional CPU code. The computational improvement is made possible by the utilisation of the parallel computing capability of a graphics processing unit (GPU). The speed and effectiveness of this algorithm are demonstrated on QAP library problems. The main contribution of this paper is a fast and effective SIMD-TS algorithm capable of producing results for large QAPs on a desktop personal computer equivalent to the results achieved with a CPU cluster.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号