首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
对称矩阵三对角化是求解稠密特征问题的关键计算过程.针对GPU集群采用了MPI(message passing interface)和GPU级2级并行方法设计实现了基于MPI和CUDA(compute unified device architecture )的稠密对称矩阵三对角化算法.在MPI集群级并行中,通过将2维通信域中行-列通信域间的全局数据通信设计为完全并行的点-点数据通信方式,改善了三对角化MPI并行算法的通信性能.通过改进原矩阵三对角化的MPI并行算法,避免了在GPU级并行中使用的不规则的矩阵-向量运算,这部分的并行性能提升了1倍左右.并且,将在GPU并行中存在的小粒度计算合并为较大粒度计算,该策略可通过加大计算密集度来充分地发挥GPU的计算能力,增加GPU的利用率,从而提升了算法的性能.此外,利用多个CUDA流使算法中独立的CUDA操作可以在不同的流中并发执行.并且,在并行算法中,利用CPU与GPU之间的异步数据传输,使得在不同流中的数据传输和核函数同时执行,隐藏了数据传输的时间,进一步提升了算法的性能.在中国科学院超级计算机系统“元”上,使用Nvidia Tesla K20 GPGPU测试了不同规模矩阵的基于MPI+CUDA的三对角化并行块算法的性能,取得了较好的加速效果与性能,并且具有良好的可扩展性.  相似文献   

2.
利用GPU进行加速的归一化差分植被指数(Normalized Differential Vegetation Index,NDVI)提取算法通常采用GPU多线程并行模型,存在弱相关计算之间以及CPU与GPU之间数据传输耗时较多等问题,影响了加速效果的进一步提升。针对上述问题,根据NDVI提取算法的特性,文中提出了一种基于GPU多流并发并行模型的NDVI提取算法。通过CUDA流和Hyper-Q特性,GPU多流并发并行模型可以使数据传输与弱相关计算、弱相关计算与弱相关计算之间达到重叠,从而进一步提高算法并行度及GPU资源利用率。文中首先通过GPU多线程并行模型对NDVI提取算法进行优化,并对优化后的计算过程进行分解,找出包含数据传输及弱相关性计算的部分;其次,对数据传输和弱相关计算部分进行重构,并利用GPU多流并发并行模型进行优化,使弱相关计算之间、弱相关计算和数据传输之间达到重叠的效果;最后,以高分一号卫星拍摄的遥感影像作为实验数据,对两种基于GPU实现的NDVI提取算法进行实验验证。实验结果表明,与传统基于GPU多线程并行模型的NDVI提取算法相比,所提算法在影像大于12000*12000像素时平均取得了约1.5倍的加速,与串行提取算法相比取得了约260倍的加速,具有更好的加速效果和并行性。  相似文献   

3.
图形处理器在通用计算中的应用   总被引:1,自引:1,他引:0  
基于图形处理器(GPU)的计算统一设备体系结构(compute unified device architecture,CUDA)构架,阐述了GPU用于通用计算的原理和方法.在Geforce8800GT下,完成了矩阵乘法运算实验.实验结果表明,随着矩阵阶数的递增,无论是GPU还是CPU处理,速度都在减慢.数据增加100倍后,GPU上的运算时间仅增加了3.95倍,而CPU的运算时间增加了216.66倍.  相似文献   

4.
张宇  张延松  陈红  王珊 《软件学报》2016,27(5):1246-1265
通用GPU因其强大的并行计算能力成为新兴的高性能计算平台,并逐渐成为近年来学术界在高性能数据库实现技术领域的研究热点.但当前GPU数据库领域的研究沿袭的是ROLAP(relational OLAP)多维分析模型,研究主要集中在关系操作符在GPU平台上的算法实现和性能优化技术,以哈希连接的GPU并行算法研究为中心.GPU拥有数千个并行计算单元,但其逻辑控制单元较少,相对于CPU具有更强的并行计算能力,但逻辑控制和复杂内存管理能力较弱,因此并不适合需要复杂数据结构和复杂内存管理机制的内存数据库查询处理算法直接移植到GPU平台.提出了面向GPU向量计算特性的混合OLAP多维分析模型semi-MOLAP,将MOLAP(multidimensionalOLAP)模型的直接数组访问和计算特性与ROLAP模型的存储效率结合在一起,实现了一个基于完全数组结构的GPU semi-MOLAP多维分析模型,简化了GPU数据管理,降低了GPU semi-MOLAP算法复杂度,提高了GPU semi-MOLAP算法的代码执行率.同时,基于GPU和CPU计算的特点,将semi-MOLAP操作符拆分为CPU和GPU平台的协同计算,提高了CPU和GPU的利用率以及OLAP的查询整体性能.  相似文献   

5.
基于CUDA的快速图像压缩   总被引:1,自引:0,他引:1  
为了进一步提高JPEG编码效率,对JPEG压缩算法进行研究,分析得出JPEG核心步骤可以并行化处理.因此,实现平台宜采用以并行计算为优势的GPU,而不是以串行计算为主的CPU.NVIDIA新推出的CUDA(计算统一设备架构)为此实现提供了软硬件环境.CUDA是基于GPU进行通用计算的开发平台,非常适合大规模的并行数据计算.在GPU流处理器架构下用CUDA技术实现编码并行化,并针对流处理器架构特点进行内存读写等方面的优化,提高了JPEG编码的速度.实验结果表明了CUDA技术在并行处理方面的优越性,JPEG编码效率得到了极大提高.  相似文献   

6.
通过有限元空间和有限体积元空间的一种双射投影得到了不可压缩流问题低次等阶稳定有限体积元方法.该方法采用低次等阶元P1-P1(或Q1-Q1)对Navier-Stokes(N-S)方程进行数值求解,利用局部压力投影技术进行稳定化处理.通过有限元和有限体积元方法的等价性进行有限体积元方法的理论分析.发现不可压缩流N-S问题在f∈H~1时,稳定有限体积元方法与稳定有限元方法之间具有O(|logh|~(1/2)/h~2)阶超收敛逼近结果.将稳定有限体积算法的三种两重网格格式进行了比较分析,发现当粗、细网格尺度比例选取适当时,两重算法具有传统算法相同的收敛速度,而两重算法具有明显的效率优势,并且Simple格式速度最快,Picard格式更适合较小粘性系数问题的数值求解.  相似文献   

7.
弧相容算法是约束满足问题的基本压缩求解空间算法之一,很多优秀的高级算法都以高性能的弧相容算法作为核心.近年来,以GPU为计算工具加速并行计算被用来尝试解决许多问题.基于GPU和基本的并行算法,提出一种适合GPU运算的约束网络表示模型N-E,给出其生成算法BuildNE.结合细粒度的弧相容算法——AC4,基于N-E模型提出AC4的并行化算法AC4\\+GPU与改进算法AC4\\+GPU+,使弧相容算法得以扩展到GPU上执行.实验结果验证了该算法的可行性,与AC4算法的比较,其在一些规模较小的问题上取得了10%~50%的加速,在一些规模较大的问题上则加速1~2个数量级.为今后进一步在GPU上以并行形式解决其他约束满足问题提供了一种核心算法方案.  相似文献   

8.
为提高大规模并行计算的并行效率,充分发挥CPU与GPU的功能特点,特别是体现GPU强大的运算能力,提出了用消息传递接口(MPI)将一组GPU连接起来。使GPU通用计算与计算流体力学中的LBM(latticeBoltzmannmethod)算法相结合。根据GPU通用计算与LBM算法的原理,使MPI作为计算分配的机制,CUDA(compute unified device architecture)作为主要的计算执行引擎,建立支持CUDA的GPU集群,在集群上对LBM算法中的D2Q9模型进行二维方腔流数值模拟。实验结果表明,利用GPU组模拟与CPU模拟结果一致,更充分发挥了GPU的计算能力,提高了并行效率。  相似文献   

9.
随着智能计算和大数据应用的发展,人们对GPU等加速部件的需求不断增长.计算软件栈比如CUDA、OpenCL软件栈是能充分发挥GPU硬件性能的关键.考虑计算软件栈未来在国产基础软硬件平台(比如飞腾CPU和麒麟操作系统)上的可移植性和适配性,重点研究OpenCL开源计算软件栈.测试分析OpenCL应用在不同平台上的表现,评估应用在不同OpenCL软件栈上(比如Mesa、ROCm等)进行GPU计算的表现,评估软件栈中驱动、内核等对GPU计算的影响,并且整个测试涵盖了编译、数据传输和内核执行等OpenCL计算各个阶段的时间开销.经过测试评估发现,国产平台更迫切也更适合使用GPU进行加速计算,ROCm是比较理想的OpenCL开源软件栈,有较好的性能和稳定性,并且与闭源软件栈相比存在一定的优化空间.  相似文献   

10.
单颗粒冷冻电镜是结构生物学研究的重要手段之一,基于贝叶斯理论的冷冻电镜3维图像数据处理软件RELION(regularized likelihood optimization)具有很好的性能和易用性,受到广泛关注.然而其计算需求极大,限制了RELION的应用.针对RELION算法的特点,研究了基于GPU 的并行优化问题.首先全面分析了RELION的原理、RELION程序的算法结构及性能瓶颈;在此基础上,针对GPU细粒度体系结构对程序进行优化设计,提出了基于GPU的多级并型模型.为了获得良好的性能,对RELION的数据结构进行重组.为了避免GPU存储空间不足的问题,设计了自适应并行框架.实验结果表明:基于GPU的RELION实现可以获得良好的性能,相比于单CPU,整个应用的加速比超过36倍,计算密集型算法的加速比达到75倍以上.在多GPU上的测试结果表明基于GPU的RELION具有很好的可扩展性.  相似文献   

11.
Simulations of the interaction between a vortex and a NACA0012 airfoil are performed with a stable, high-order accurate (in space and time), multi-block finite difference solver for the compressible Navier-Stokes equations.We begin by computing a benchmark test case to validate the code. Next, the flow with steady inflow conditions are computed on several different grids. The resolution of the boundary layer as well as the amount of the artificial dissipation is studied to establish the necessary resolution requirements. We propose an accuracy test based on the weak imposition of the boundary conditions that does not require a grid refinement.Finally, we compute the vortex-airfoil interaction and calculate the lift and drag coefficients. It is shown that the viscous terms add the effect of detailed small scale structures to the lift and drag coefficients.  相似文献   

12.
Averbuch  A.  Epstein  B.  Ioffe  L.  Yavneh  I. 《The Journal of supercomputing》2000,17(2):123-142
We present an efficient parallelization strategy for speeding up the computation of a high-accuracy 3-dimensional serial Navier-Stokes solver that treats turbulent transonic high-Reynolds flows. The code solves the full compressible Navier-Stokes equations and is applicable to realistic large size aerodynamic configurations and as such requires huge computational resources in terms of computer memory and execution time. The solver can resolve the flow properly on relatively coarse grids. Since the serial code contains a complex infrastructure typical for industrial code (which ensures its flexibility and applicability to complex configurations), then the parallelization task is not straightforward. We get scalable implementation on massively parallel machines by maintaining efficiency at a fixed value by simultaneously increasing the number of processors and the size of the problem.The 3-D Navier-Stokes solver was implemented on three MIMD message-passing multiprocessors (a 64-processors IBM SP2, a 20-processors MOSIX, and a 64-processors Origin 2000). The same code written with PVM and MPI software packages was executed on all the above distinct computational platforms. The examples in the paper demonstrate that we can achieve efficiency of about 60% for as many as 64 processors on Origin 2000 on a full-size 3-D aerodynamic problem which is solved on realistic computational grids.  相似文献   

13.
An optimized implementation of a block tridiagonal solver based on the block cyclic reduction (BCR) algorithm is introduced and its portability to graphics processing units (GPUs) is explored. The computations are performed on the NVIDIA GTX480 GPU. The results are compared with those obtained on a single core of Intel Core i7-920 (2.67 GHz) in terms of calculation runtime. The BCR linear solver achieves the maximum speedup of 5.84x with block size of 32 over the CPU Thomas algorithm in double precision. The proposed BCR solver is applied to discontinuous Galerkin (DG) simulations on structured grids via alternating direction implicit (ADI) scheme. The GPU performance of the entire computational fluid dynamics (CFD) code is studied for different compressible inviscid flow test cases. For a general mesh with quadrilateral elements, the ADI-DG solver achieves the maximum total speedup of 7.45x for the piecewise quadratic solution over the CPU platform in double precision.  相似文献   

14.
当前高性能计算机体系结构呈现多样性特征,给并行应用软件开发带来巨大挑战.采用领域特定语言OPS对高阶精度计算流体力学软件HNSC进行面向多平台的并行化,使用OPS API实现了代码的重构,基于OPS前后端自动生成了纯M PI、OpenM P、M PI+OpenM P和M PI+CUDA版本的可执行程序.在一个配有2块I...  相似文献   

15.
Jun Cao 《Computers & Fluids》2005,34(8):991-1024
In this paper, we discuss how to improve the adaptive finite element simulation of compressible Navier-Stokes flow via a posteriori error estimate analysis. We use the moving space-time finite element method to globally discretize the time-dependent Navier-Stokes equations on a series of adapted meshes. The generalized compressible Stokes problem, which is the Stokes problem in its most generalized form, is presented and discussed. On the basis of the a posteriori error estimator for the generalized compressible Stokes problem, a numerical framework of a posteriori error estimation is established corresponding to the case of compressible Navier-Stokes equations. Guided by the a posteriori errors estimation, a combination of different mesh adaptive schemes involving simultaneous refinement/unrefinement and point-moving are applied to control the finite element mesh quality. Finally, a series of numerical experiments will be performed involving the compressible Stokes and Navier-Stokes flows around different aerodynamic shapes to prove the validity of our mesh adaptive algorithms.  相似文献   

16.
An Euler/Navier-Stokes zonal scheme is developed to numerically simulate the two-dimensional flow over a blunt leading-edge plate. The computational domain has been divided into inner and outer regions where the Navier-Stokes and Euler equations are used, respectively. On the downstream boundary, compatibility conditions derived from the boundary-layer equations are used. The grid is generated by using conformal mapping and the problem is solved by using a compressible Navier-Stokes code, which has been modified to treat Euler and Navier-Stokes regions. The accuracy of the solution is determined by the reattachment location. Bench-mark solutions have been obtained using the Navier-Stokes equations throughout the optimum computational domain and size. The problem is recalculated with sucessive decrease of the computational domain from the downstream side where the compatibility conditions are used, and with successive decrease of the Navier-Stokes computational region. The results of the zonal scheme are in excellent agreement with those of the benchmark solutions and the experimental data. The CPU time saving is about 15%.  相似文献   

17.
三维实时云建模与渲染在工业仿真中的应用   总被引:2,自引:0,他引:2  
在云建模方面,提出一种基于物理仿真和艺术可控性相结合的方法,利用Navier-Stokes流体动力学公式描述单一云朵的聚散和运动,通过盒子的堆积来描述云朵的初始轮廓,最终在盒子内部按流体动力学规律填充粒子生成三维云模型.为了满足实时性要求,在可编程图形芯片上求解Navier-Stokes等式,以便利用图形芯片的并行处理能力加快求解速度.在云的实时渲染方面,基于太阳光照方向和天气状况提出了一种简单的光照模型,大幅度地提高了云的渲染速度.此外,还提出一种改进的环状Impostor技术来提高大范围云层的渲染速度,并通过Shader编程的方法解决了应用Impostor技术到Alpha融合场景中所出现的问题.基于所描述的理论模型,利用三维图形API开发了一套三维云仿真系统,并广泛应用于各种工业仿真和科技娱乐展示项目中,取得了较好的效果.利用该方法生成的云模型具有真实感强、渲染速度快等特点.  相似文献   

18.
Numerical experiments are presented for the solution of the steady-state compressible Navier-Stokes equations. One test problem is fixed supersonic flow past a double ellipse, and the various solution methods studied. The problem is discretized using Osher's scheme, first- and second-order accurate. The fastest convergence to steady state is obtained using Newton's method. Simplifications of Newton's method based on domain decomposition are shown to perform well, whereas line relaxation methods meet with difficulties.  相似文献   

19.
Computer-Generated Marbling Textures: A GPU-Based Design System   总被引:3,自引:0,他引:3  
A computer system for interactively creating marbling textures is built on the physical model of the traditional marbling process. The approach generates marbling designs as the result of color advection in the 2D flow fields obtained by numerically solving the Navier-Stokes equations on the GPU with a multigrid solver  相似文献   

20.
《Computers & Fluids》2006,35(8-9):888-897
The goal of this article is to contribute to the discussion of the efficiency of lattice-Boltzmann (LB) methods as CFD solvers. After a short review of the basic model and extensions, we compare the accuracy and computational efficiency of two research simulation codes based on the LB and the finite-element method (FEM) for two-dimensional incompressible laminar flow problems with complex geometries. We also study the influence of the Mach number on the solution, since LB methods are weakly compressible by nature, by comparing compressible and incompressible results obtained from the LB code and the commercial code CFX. Our results indicate, that for the quantities studied (lift, drag, pressure drop) our LB prototype is competitive for incompressible transient problems, but asymptotically slower for steady-state Stokes flow because the asymptotic algorithmic complexity of the classical LB-method is not optimal compared to the multigrid solvers incorporated in the FEM and CFX code. For the weakly compressible case, the LB approach has a significant wall clock time advantage as compared to CFX. In addition, we demonstrate that the influence of the finite Mach number in LB simulations of incompressible flow is easily underestimated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号