首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper presents techniques for generating very large finite‐element matrices on a multicore workstation equipped with several graphics processing units (GPUs). To overcome the low memory size limitation of the GPUs, and at the same time to accelerate the generation process, we propose to generate the large sparse linear systems arising in finite‐element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure. The scheduling of the threads is organized in such a way that the CPU operations do not affect the performance of the process, and the GPUs are idle only when data are being transferred from GPU to CPU. This approach is verified on two workstations: the first consists of two 6‐core Intel Xeon X5690 processors with two Fermi GPUs: each GPU is a GeForce GTX 590 with two graphics processors and 1.5 GB of fast RAM; the second workstation is equipped with two Tesla C2075 boards carrying 6 GB of RAM each and two 12‐core Opteron 6174s. For the latter setup, we demonstrate the fast generation of sparse finite‐element matrices as large as 10 million unknowns, with over 1 billion nonzero entries. Comparing with the single‐threaded and multithreaded CPU implementations, the GPU‐based version of the algorithm based on the ideas presented in this paper reduces the finite‐element matrix‐generation time in double precision by factors of 100 and 30, respectively. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

2.
Particle-mesh interpolations are fundamental operations for particle-in-cell codes, as implemented in vortex methods, plasma dynamics and electrostatics simulations. In these simulations, the mesh is used to solve the field equations and the gradients of the fields are used in order to advance the particles. The time integration of particle trajectories is performed through an extensive resampling of the flow field at the particle locations. The computational performance of this resampling turns out to be limited by the memory bandwidth of the underlying computer architecture. We investigate how mesh-particle interpolation can be efficiently performed on graphics processing units (GPUs) and multicore central processing units (CPUs), and we present two implementation techniques. The single-precision results for the multicore CPU implementation show an acceleration of 45-70×, depending on system size, and an acceleration of 85-155× for the GPU implementation over an efficient single-threaded C++ implementation. In double precision, we observe a performance improvement of 30-40× for the multicore CPU implementation and 20-45× for the GPU implementation. With respect to the 16-threaded standard C++ implementation, the present CPU technique leads to a performance increase of roughly 2.8-3.7× in single precision and 1.7-2.4× in double precision, whereas the GPU technique leads to an improvement of 9× in single precision and 2.2-2.8× in double precision.  相似文献   

3.
General purpose computing on graphics processing units (GPUs) has been previously shown to speed up computationally intensive data processing and image reconstruction algorithms for computed tomography (CT), magnetic resonance (MR), and ultrasound images. Although some algorithms in ultrasound have been converted to GPU processing, many investigative ultrasound research systems still use serial processing on a single CPU. One such ultrasound modality is acoustic radiation force impulse (ARFI) imaging, which investigates the mechanical properties of soft tissue. Traditionally, the raw data are processed offline to estimate the displacement of the tissue after the application of radiation force. It is highly advantageous to process the data in real-time to assess their quality and make modifications during a study. In this paper, we present algorithms for efficient GPU parallel processing of two widely used tools in ultrasound: cubic spline interpolation and Loupas' two-dimensional autocorrelator for displacement estimation. It is shown that a commercially available graphics card can be used for these computations, achieving speed increases up to 40x compared with single CPU processing. Thus, we conclude that the GPU-based data processing approach facilitates real-time (i.e., <1 second) display of ARFI data and is a promising approach for ultrasonic research systems.  相似文献   

4.
Diffusion tensor imaging (DTI) tractography technique represents neural fiber pathways by using local tensor information based on water diffusion anisotropy in brain white matter. However, DTI tractography is often unable to reconstruct crossing, kissing, and branching fiber trajectories due to intrinsic limitations of DTI. Increasingly complex tractography algorithms provided reliable and visually pleasing results, yet at an increasing computational cost in comparison with simple tractography algorithms. To shorten the computation time, we developed multi‐GPU (graphics processing unit)‐based parallelized versions of deterministic and probabilistic tractography algorithms to investigate their utility for near‐real time tractography. We were able to dramatically reduce the computation time using multiple GPUs (three NVIDIA TESLA C1060s) in comparison to the central processing unit (CPU) sequential processing. Deterministic tractography could accelerate 101 times faster, and probabilistic tractography could accelerate 63 times faster. The results showed that parallel tractography algorithm is well suited with GPU which has fundamentally parallelized architecture. © 2013 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 23, 256–264, 2013  相似文献   

5.
Automated optical inspection systems installed in production lines help ensure high throughput by speeding up inspection of defects that are otherwise difficult to detect using the naked eye. However, depending on the size and surface properties of the products such as micro-cracks on touchscreen panels glass cover, the detection speed and accuracy are limited by the imaging module and lighting technique. Therefore the current inspection methods are still delegated to a few qualified personnel whose limited capacity has been a huge tradeoff for high volume production. In this study, an automated optical technology for in-line surface defect inspection is developed offering high performance in spatial resolution and detection speed for any surface. The inspection system consisting of an LED array which illuminates a wide inspection area on the test object captures scattered light from surface defects using a 12288-pixel line CCD at 12 kHz acquisition rate. A 3.5 \(\mu\) m per pixel resolution of the line CCD provides a detection width capability of at most 43 mm which is equivalent to 147 megapixels image data acquired per second. To handle the large volume of data per acquisition cycle, the data are transmitted from a host CPU to multiple GPU devices where CUDA-based image processing kernels are adopted to perform detection and labeling of surface defects in parallel. The processed data is sent back to the CPU to display user-defined defect maps. 2-D inspection of back-coated flat mirrors, 43 mm x 70 mm\(^{2}\) in size, using a single CCD module and multiple GPU reveals that surface flaws such as bubbles, cracks, and edge defects are detected accurately. The acquisition time to capture and load the data to a CPU is 1.7 s while the processing time to transmit the same data for surface defect detection in a GPU is 248 ms. The latter time scale is considerably faster compared to minute-long computations in solely CPU-based processing algorithm of the same test object. The minimum width of detected surface defects is about 10 \(\mu \)m with true detection rates above 94%. Moreover, the inspection system is easily configurable by tasking multiple CCD imaging modules to different GPU devices to allow inspection of larger test objects. This flexibility can improve both acquisition and detection speeds to boost in-line circuit chips, packaging, and touchscreen panel inspection systems.  相似文献   

6.
Recently, the application of graphics processing units (GPUs) to scientific computations is attracting a great deal of attention, because GPUs are getting faster and more programmable. In particular, NVIDIA's GPUs called compute unified device architecture enable highly mutlithreaded parallel computing for non‐graphic applications. This paper proposes a novel way to accelerate the boundary element method (BEM) for three‐dimensional Helmholtz' equation using CUDA. Adopting the techniques for the data caching and the double–single precision floating‐point arithmetic, we implemented a GPU‐accelerated BEM program for GeForce 8‐series GPUs. The program performed 6–23 times faster than a normal BEM program, which was optimized for an Intel's quad‐core CPU, for a series of boundary value problems with 8000–128000 unknowns, and it sustained a performance of 167 Gflop/s for the largest problem (1 058 000 unknowns). The accuracy of our BEM program was almost the same as that of the regular BEM program using the double precision floating‐point arithmetic. In addition, our BEM was applicable to solve realistic problems. In conclusion, the present GPU‐accelerated BEM works rapidly and precisely for solving large‐scale boundary value problems for Helmholtz' equation. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

7.
Quantitative sodium magnetic resonance imaging permits noninvasive measurement of the tissue sodium concentration (TSC) bioscale in the brain. Computing the TSC bioscale requires reconstructing and combining multiple datasets acquired with a non‐Cartesian acquisition that highly oversamples the center of k‐space. Even with an optimized implementation of the algorithm to compute TSC, the overall processing time exceeds the time required to collect data from the human subject. Such a mismatch presents a challenge for sustained sodium imaging to avoid a growing data backlog and provide timely results. The most computationally intensive portions of the TSC calculation have been identified and accelerated using a consumer graphics processing unit (GPU) in addition to a conventional central processing unit (CPU). A recently developed data organization technique called Compact Binning was used along with several existing algorithmic techniques to maximize the scalability and performance of these computationally intensive operations. The resulting GPU+CPU TSC bioscale calculation is more than 15 times faster than a CPU‐only implementation when processing 256 × 256 × 256 data and 2.4 times faster when processing 128 × 128 × 128 data. This eliminates the possibility of a data backlog for quantitative sodium imaging. The accelerated quantification technique is suitable for general three‐dimensional non‐Cartesian acquisitions and may enable more sophisticated imaging techniques that acquire even more data to be used for quantitative sodium imaging. © 2013 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 23, 29–35, 2013.  相似文献   

8.
Tsunami simulation consists of fluid dynamics, numerical computations, and visualization techniques. Nonlinear shallow water equations are often used to model the tsunami propagation. Tsunami inundation is modeled by adding the friction slope to the conservation of momentum. The two-step second-order finite difference MacCormack numerical method can solve these equations. It is well suited for nonlinear equations and simpler for related application development. In addition, the finite difference method can be computed in parallel. The programmable graphics hardware allows general-purpose computing on graphics processing units (GPUs) to solve the MacCormack method in parallel to speed up the simulation. Tsunami simulation data can be loaded as textures data in graphics memory, the computation processes can be written as shader programs using OpenGL Shading Language so that the operations can be computed by graphics processors in parallel. We developed different versions of the tsunami simulation and visualization programs: (i) CPU-based, and (ii) CPU–GPU collaboration to implement the MacCormack numerical method. The performance results showed that graphics hardware accelerated simulation had a significant improvement in the execution time of each computation step. Real-time simulation and visualization are made possible by the programmable shaders. Furthermore, we achieved high-performance parallel visualization on a tiled display wall with a cluster of computers.  相似文献   

9.
Although they show potential to improve ultrasound image quality, plane wave (PW) compounding and synthetic aperture (SA) imaging are computationally demanding and are known to be challenging to implement in real-time. In this work, we have developed a novel beamformer architecture with the real-time parallel processing capacity needed to enable fast realization of PW compounding and SA imaging. The beamformer hardware comprises an array of graphics processing units (GPUs) that are hosted within the same computer workstation. Their parallel computational resources are controlled by a pixel-based software processor that includes the operations of analytic signal conversion, delay-and-sum beamforming, and recursive compounding as required to generate images from the channel-domain data samples acquired using PW compounding and SA imaging principles. When using two GTX-480 GPUs for beamforming and one GTX-470 GPU for recursive compounding, the beamformer can compute compounded 512 x 255 pixel PW and SA images at throughputs of over 4700 fps and 3000 fps, respectively, for imaging depths of 5 cm and 15 cm (32 receive channels, 40 MHz sampling rate). Its processing capacity can be further increased if additional GPUs or more advanced models of GPU are used.  相似文献   

10.
This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDA‐capable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of . The methods described in the paper are applicable to any FMMs in which the multipole‐to‐local (M2L) operator is a dense matrix and the matrix is precomputed. This is the case for example in the black‐box fast multipole method (bbFMM), which is a variant of the FMM that can handle large class of kernels. This example will be used in our benchmarks. In the FMM, two operators represent most of the computational cost, and an optimal implementation typically tries to balance those two operators. One is the nearby interaction calculation (direct sum calculation, line 29 in Listing 1), and the other is the M2L operation. We focus on the M2L. By combining multiple M2L operations and reordering the primitive loops of the M2L so that CUDA threads can reuse or share common data, these approaches reduce the movement of data in the GPU. Because memory bandwidth is the primary bottleneck of these methods, significant performance improvements are realized. Four M2L schemes are detailed and analyzed in the case of a uniform tree. The four schemes are tested and compared with an optimized, OpenMP parallelized, multi‐core CPU code. We consider high and low precision calculations by varying the number of Chebyshev nodes used in the bbFMM. The accuracy of the GPU codes is found to be satisfactory and achieved performance over 200 Gflop/s on one NVIDIA Tesla C1060 GPU (Nvidia Corporation, Sta. Clara, CA, USA). This was compared against two quad‐core Intel Xeon E5345 processors (Intel Corporation, Sta. Clara, CA, USA) running at 2.33 GHz, for a combined peak performance of 149 Gflop/s for single precision. For the low FMM accuracy case, the observed performance of the CPU code was 37 Gflop/s, whereas for the high FMM accuracy case, the performance was about 8.5 Gflop/s, most likely because of a higher frequency of cache misses. We also present benchmarks on an NVIDIA C2050 GPU (a Fermi processor)(Nvidia Corporation, Sta. Clara, CA, USA) in single and double precision. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

11.
Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are created and analyzed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor using single‐precision arithmetic achieves speedups of 30 or more in comparison to a well optimized double‐precision single core implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite element discretization. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

12.
Network alignment is an important bridge to understanding human protein–protein interactions (PPIs) and functions through model organisms. However, the underlying subgraph isomorphism problem complicates and increases the time required to align protein interaction networks (PINs). Parallel computing technology is an effective solution to the challenge of aligning large‐scale networks via sequential computing. In this study, the typical Hungarian‐Greedy Algorithm (HGA) is used as an example for PIN alignment. The authors propose a HGA with 2‐nearest neighbours (HGA‐2N) and implement its graphics processing unit (GPU) acceleration. Numerical experiments demonstrate that HGA‐2N can find alignments that are close to those found by HGA while dramatically reducing computing time. The GPU implementation of HGA‐2N optimises the parallel pattern, computing mode and storage mode and it improves the computing time ratio between the CPU and GPU compared with HGA when large‐scale networks are considered. By using HGA‐2N in GPUs, conserved PPIs can be observed, and potential PPIs can be predicted. Among the predictions based on 25 common Gene Ontology terms, 42.8% can be found in the Human Protein Reference Database. Furthermore, a new method of reconstructing phylogenetic trees is introduced, which shows the same relationships among five herpes viruses that are obtained using other methods.Inspec keywords: graphics processing units, proteins, molecular biophysics, genetics, microorganisms, medical computing, bioinformaticsOther keywords: graphics processing unit‐based alignment, protein interaction networks, network alignment, human protein–protein interactions, Hungarian‐Greedy algorithm, GPU acceleration, gene ontology terms, phylogenetic trees reconstruction, herpes viruses  相似文献   

13.
Parallelization of the finite-element method (FEM) has been contemplated by the scientific and high-performance computing community for over a decade. Most of the computations in the FEM are related to linear algebra that includes matrix and vector computations. These operations have the single-instruction multiple-data (SIMD) computation pattern, which is beneficial for shared-memory parallel architectures. General-purpose graphics processing units (GPGPUs) have been effectively utilized for the parallelization of FEM computations ever since 2007. The solver step of the FEM is often carried out using conjugate gradient (CG)-type iterative methods because of their larger convergence rates and greater opportunities for parallelization. Although the SIMD computation patterns in the FEM are intrinsic for GPU computing, there are some pitfalls, such as the underutilization of threads, uncoalesced memory access, lower arithmetic intensity, limited faster memories on GPUs and synchronizations. Nevertheless, FEM applications have been successfully deployed on GPUs over the last 10 years to achieve a significant performance improvement. This paper presents a comprehensive review of the parallel optimization strategies applied in each step of the FEM. The pitfalls and trade-offs linked to each step in the FEM are also discussed in this paper. Furthermore, some extraordinary methods that exploit the tremendous amount of computing power of a GPU are also discussed. The proposed review is not limited to a single field of engineering. Rather, it is applicable to all fields of engineering and science in which FEM-based simulations are necessary.  相似文献   

14.
A collocation boundary element code for solving the three-dimensional Laplace equation, publicly available from http://intetec.org, has been adapted to run on an Nvidia Tesla general-purpose graphics processing unit (GPU). Global matrix assembly and LU factorization of the resulting dense matrix are performed on the GPU. Out-of-core techniques are used to solve problems larger than the available GPU memory. The code achieved about 10 times speedup in matrix assembly over a single CPU core and about 56 Gflops/s in the LU factorization using only 512 Mbytes of GPU memory. Details of the GPU implementation and comparisons with the standard sequential algorithm are included to illustrate the performance of the GPU code.  相似文献   

15.
To achieve the wavefront phase-recovery stage of an adaptive-optics loop computed in real time for 32 x 32 or a greater number of subpupils in a Shack-Hartmann sensor, we present here, for what is to our knowledge the first time, preliminary results that we obtained by using innovative techniques: graphical processing units (GPUs) and field-programmable gate arrays (FPGAs). We describe the stream-computing paradigm of the GPU and adapt a zonal algorithm to take advantage of the parallel computational power of the GPU. We also present preliminary results we obtained by use of FPGAs on the same algorithm. GPUs have proved to be a promising technique, but FPGAs are already a feasible solution to adaptive-optics real-time requirements, even for a large number of subpupils.  相似文献   

16.
It takes an enormous amount of time to calculate a computer-generated hologram (CGH). A fast calculation method for a CGH using precalculated object light has been proposed in which the light waves of an arbitrary object are calculated using transform calculations of the precalculated object light. However, this method requires a huge amount of memory. This paper proposes the use of a method that uses a cylindrical basic object light to reduce the memory requirement. Furthermore, it is accelerated by using a graphics processing unit (GPU). Experimental results show that the calculation speed on a GPU is about 65 times faster than that on a CPU.  相似文献   

17.
Q. Wu  F. Wang  Y. Xiong 《工程优选》2016,48(10):1679-1692
In order to reduce the computational time, a fully parallel implementation of the particle swarm optimization (PSO) algorithm on a graphics processing unit (GPU) is presented. Instead of being executed on the central processing unit (CPU) sequentially, PSO is executed in parallel via the GPU on the compute unified device architecture (CUDA) platform. The processes of fitness evaluation, updating of velocity and position of all particles are all parallelized and introduced in detail. Comparative studies on the optimization of four benchmark functions and a trajectory optimization problem are conducted by running PSO on the GPU (GPU-PSO) and CPU (CPU-PSO). The impact of design dimension, number of particles and size of the thread-block in the GPU and their interactions on the computational time is investigated. The results show that the computational time of the developed GPU-PSO is much shorter than that of CPU-PSO, with comparable accuracy, which demonstrates the remarkable speed-up capability of GPU-PSO.  相似文献   

18.
Since the spatial resolution of diffusion weighted magnetic resonance imaging (DWI) is subject to scanning time and other constraints, its spatial resolution is relatively limited. In view of this, a new non-local DWI image super-resolution with joint information method was proposed to improve the spatial resolution. Based on the non-local strategy, we use the joint information of adjacent scan directions to implement a new weighting scheme. The quantitative and qualitative comparison of the datasets of synthesized DWI and real DWI show that this method can significantly improve the resolution of DWI. However, the algorithm ran slowly because of the joint information. In order to apply the algorithm to the actual scene, we compare the proposed algorithm on CPU and GPU respectively. It is found that the processing time on GPU is much less than on CPU, and that the highest speedup ratio to the traditional algorithm is more than 26 times. It raises the possibility of applying reconstruction algorithms in actual workplaces.  相似文献   

19.
Recently, graphics processing units (GPUs) have been increasingly leveraged in a variety of scientific computing applications. However, architectural differences between CPUs and GPUs necessitate the development of algorithms that take advantage of GPU hardware. As sparse matrix vector (SPMV) multiplication operations are commonly used in finite element analysis, a new SPMV algorithm and several variations are developed for unstructured finite element meshes on GPUs. The effective bandwidth of current GPU algorithms and the newly proposed algorithms are measured and analyzed for 15 sparse matrices of varying sizes and varying sparsity structures. The effects of optimization and differences between the new GPU algorithm and its variants are then subsequently studied. Lastly, both new and current SPMV GPU algorithms are utilized in the GPU CG solver in GPU finite element simulations of the heart. These results are then compared against parallel PETSc finite element implementation results. The effective bandwidth tests indicate that the new algorithms compare very favorably with current algorithms for a wide variety of sparse matrices and can yield very notable benefits. GPU finite element simulation results demonstrate the benefit of using GPUs for finite element analysis and also show that the proposed algorithms can yield speedup factors up to 12‐fold for real finite element applications. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

20.
The proposed work aims to quicken the magnetic resonance imaging (MRI) brain tissue segmentation process using knowledge-based partial supervision fuzzy c-means (KPSFCM) with graphics processing unit (GPU). The proposed KPSFCM contains three steps: knowledge-based initialization, modification, and optimization. The knowledge-based initialization step extracts initial centers from input MR images for KPSFCM using Gaussian-based histogram smoothing. The modification step changes the membership function of PSFCM, which is guided by the labeled patterns of cerebrospinal fluid portion. Finally, the optimization step is achieved through size-based optimization (SBO), adjacency-based optimization (ABO), and parallelism-based optimization (PBO). SBO and ABO are algorithmic level optimization techniques in central processing unit (CPU), whereas PBO is a hardware level optimization technique implemented in GPU using compute unified device architecture (CUDA). Performance of the KPSFCM is tested with online and clinical datasets. The proposed KPSFCM gives better segmentation accuracy than 14 state-of-the-art-methods, but computationally expensive. When the optimization techniques (SBO and ABO) were included, the execution time reduces by 13 times in CPU. Finally, the inclusion of PBO yields 19 times faster than the optimized CPU implementation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号