首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
当前世界上排前几位的超级计算机都基于大量CPU和GPU组合的混合架构,它们对某些特殊问题,譬如基于FFT的图像处理或N体颗粒计算等领域可获得很高的性能。但是对由有限差分(或基于网格的有限元)离散的偏微分方程问题,于CPU/GPU集群上获得较好的性能仍然是一种挑战。本文提出并测试一种基于这类集群架构的混合算法。算法的可扩展性通过区域分解算法实现,而GPU的性能由基于光滑聚集的代数多重网格法获得,避免了在GPU上表现不理想的不完全分解算法。本文的数值实验采用32CPU/GPU求解用差分离散后达三千万未知数的偏微分方程。  相似文献   

3.
This paper presents a reformulation of bidirectional path‐tracing that adequately divides the algorithm into processes efficiently executed in parallel on both the CPU and the GPU. We thus benefit from high‐level optimization techniques such as double buffering, batch processing, and asyncronous execution, as well as from the exploitation of most of the CPU, GPU, and memory bus capabilities. Our approach, while avoiding pure GPU implementation limitations (such as limited complexity of shaders, light or camera models, and processed scene data sets), is more than ten times faster than standard bidirectional path‐tracing implementations, leading to performance suitable for production‐oriented rendering engines.  相似文献   

4.
CPU+GPU的异构模式由于比传统的超算架构更加便宜和更加环保、低碳,所以得到了越来越多的关注,在HPC的Top500中也渐渐出现了异构模式的身影。然而异构模式下的并行效率过低也是个既定的事实。本文从异构模式及GPU之间并行调度的原理出发,以Linpack测试效率为例来展开异构模式下的并行计算效率研究,并给出相应结论。  相似文献   

5.
提出一个多核CPU/GPU混合平台下的集合求交算法.针对CPU端求交问题,利用对数据空间局部性和中序求交的思想,给出内向求交算法和Baeza-Yates改进算法,算法速度分别提升0.79倍和1.25倍.在GPU端,提出有效搜索区间思想,通过计算GPU中每个Block在其余列表上的有效搜索区间来缩小搜索范围,进而提升求交速度,速度平均提升40%.在混合平台采用时间隐藏技术将数据预处理和输入输出操作隐藏在GPU计算过程中,结果显示系统平均速度可提升85%.  相似文献   

6.
针对单个JVM的性能缺陷问题,分析了实现分布式JVM的关键技术,提出了一个基于Spaces的分布式虚拟机集成模型,该模型将执行代码和数据分离,通过异步协作机制和动态装载类技术,将多个Java作业透明地调度到不同的JVM资源上并行执行,实现了单一系统映象。  相似文献   

7.
CPU/GPU协同并行计算研究综述   总被引:3,自引:3,他引:3  
CPU/GPU异构混合并行系统以其强劲计算能力、高性价比和低能耗等特点成为新型高性能计算平台,但其复杂体系结构为并行计算研究提出了巨大挑战。CPU/GPU协同并行计算属于新兴研究领域,是一个开放的课题。根据所用计算资源的规模将CPU/GPU协同并行计算研究划分为三类,尔后从立项依据、研究内容和研究方法等方面重点介绍了几个混合计算项目,并指出了可进一步研究的方向,以期为领域科学家进行协同并行计算研究提供一定参考。  相似文献   

8.
ASIFT(Affine-SIFT)是一种具有仿射不变性、尺度不变性的特征提取算法,其被用于图像匹配中,具有较好的匹配效果,但因计算复杂度高而难以运用到实时处理中。在分析ASIFT算法运行耗时分布的基础上,先对SIFT算法进行了GPU优化,通过使用共享内存、合并访存,提高了数据访问效率。之后对ASIFT计算中的其它部分进行GPU优化,形成GASIFT。整个GASIFT计算过程中使用显存池来减少对显存的申请和释放。最后分别在CPU/GPU协同工作的两种方式上进行了尝试。实验表明,CPU负责逻辑计算、GPU负责并行计算的模式最适合于GASIFT计算,在该模式下GASIFT有很好的加速效果,尤其针对大、中图片。对于2048*1536的大图片,GASIFT与标准ASIFT相比加速比可达16倍,与OpenMP优化过的ASIFT相比加速比可达7倍,极大地提高了ASIFT在实时计算中应用的可能性。  相似文献   

9.
High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms.  相似文献   

10.
基于网络的分布式并行计算技术提供高性能、低成本的计算资源。该文介绍所构造的一个基于网络的分布并行虚拟计算机DPVM。它由虚拟机层、通信层和基本类层组成,以浏览器方式提供用户界面;允许计算机加入系统,提供空闲计算资源,或登录到系统上,获得计算资源;采用Java作为程序设计语言,支持平台无关的分布并行计算。  相似文献   

11.
GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is usability, since vendor specific APIs are quite different from existing programming languages, and it requires a substantial knowledge of the device and programming interface to optimize applications. Hence, lately a growing number of higher level programming models are targeting GPUs to alleviate this problem. The ultimate goal for a high-level model is to expose an easy-to-use interface for the user to offload compute intensive portions of code (kernels) to the GPU, and tune the code according to the target accelerator to maximize overall performance with a reduced development effort. In this paper, we share our experiences of three of the notable high-level directive based GPU programming models—PGI, CAPS and OpenACC (from CAPS and PGI) on an Nvidia M2090 GPU. We analyze their performance and programmability against Isotropic (ISO)/Tilted Transversely Isotropic (TTI) finite difference kernels, which are primary components in the Reverse Time Migration (RTM) application used by oil and gas exploration for seismic imaging of the sub-surface. When ported to a single GPU using the mentioned directives, we observe an average 1.5–1.8x improvement in performance for both ISO and TTI kernels, when compared with optimized multi-threaded CPU implementations using OpenMP.  相似文献   

12.
肖玄基  张云泉  李玉成  袁良 《软件学报》2013,24(S2):118-126
MAGMA是第一个面向下一代体系架构(多核CPU和GPU)开源的线性代数软件包,它采用了诸多针对异构平台的优化方法,包括混合同步、通信避免和动态任务调度.它在功能、数据存储、接口上与LAPACK相似,可以发挥GPU的巨大计算能力进行数值计算.对MAGMA进行了测试分析.首先对矩阵分解算法进行分析;然后通过测试结果,分析MAGMA有效的优化和并行方法,为MAGMA使用、优化提供有益的建议;最后提出了一种对于矩阵分块算法的自适应调优的方法,经过测试,对于方阵的SGEQRF函数加速比达到1.09,对于高瘦矩阵的CGEQRF函数加速比达到1.8.  相似文献   

13.
In this paper, we propose a program development toolkit called OMPICUDA for hybrid CPU/GPU clusters. With the support of this toolkit, users can make use of a familiar programming model, i.e., compound OpenMP and MPI instead of mixed CUDA and MPI or SDSM to develop their applications on a hybrid CPU/GPU cluster. In addition, they can adapt the types of resources used for executing different parallel regions in the same program by means of an extended device directive according to the property of each parallel region. On the other hand, this programming toolkit supports a set of data-partition interfaces for users to achieve load balance at the application level no matter what type of resources are used for the execution of their programs.  相似文献   

14.
摘 要:城市轨道交通线路三维可视化设计能有效地改进设计质量。针对传统基于 CPU 的线路三维建模方法存在着建模速度慢、等待设计成果时间长、渲染效率低、场景优化困难等 问题,提出一种基于 CPU 离散-GPU 建模的城市轨道交通线路三维模型快速建模方法。首先, 将线路分为线状模型和点状模型,然后根据线形设计成果利用 CPU 分解线状模型得到离散化 的边界条件,解析点状模型得到空间信息参数,分别形成独立的、数据量极小的离散数据包; 然后利用 GPU 的并行计算能力通过离散数据包实现线路三维模型的快速建立;联合 CPU 的场 景拣选能力和 GPU 的顶点扩展能力,建立了一种用于长线状模型显示的场景优化方法。研究 结果显示:①该方法建模耗时仅为传统方法的 0.55%~1.30%;②浏览体验相比基于传统 CPU 建模和场景管理的方法显著提升,最小帧数在 70 帧以上;③可有效降低内存及 CPU 占用率等 性能指标,释放设计平台计算压力;④为线路三维可视化设计实用化提供了一种可借鉴的方法 和思路。  相似文献   

15.
CPU/GPU异构系统具有很大的发展潜力,深入研究CPU/GPU异构平台的并行优化,可实现系统整体计算能力的最大化。通过对CPU/GPU任务划分的优化来平衡CPU和GPU的负载,可提高计算资源的利用率,缩短计算任务的执行时间;通过对GPU线程划分的优化,可使GPU获得更高的速度。从而提高系统整体性能。  相似文献   

16.
《计算机工程》2018,(4):1-11
为满足计算密集型大数据应用的实时处理需求,在Apache Storm基础上,研究开发H-Storm异构计算平台。通过多进程服务特性设计图形处理器(GPU)资源的量化和分布式调用机制,进而提出H-Storm异构集群的任务调度策略,实现GPU性能及负载的任务调度算法与协同计算下自适应的流分发决策机制。实验结果表明,在512×512矩阵乘法用例下,与原生Storm平台相比,H-Storm异构计算平台吞吐量提升54.9倍,响应延时下降77倍。  相似文献   

17.
波束形成的实时性一直是声纳和雷达等领域信号处理过程中的重点和难点。本文采用基于CUDA(Compute Unified Device Architecture,统一计算设备架构)的GPU(Graphic Processing Unit,图形处理器)与CPU协作处理方法,实现了宽带波束形成的实时处理。本方法的处理速度相较于matlab和CPU平台可以提高一至两个数量级,相较于同等处理速度的多DSP平台则体现了开发周期短、费用低、工作量小和可靠性高等众多优势。  相似文献   

18.

Modern computer systems can use different types of hardware acceleration to achieve massive performance improvements. Some accelerators like FPGA and dedicated GPU (dGPU) need optimized data structures for the best performance and often use dedicated memory. In contrast, APUs, which are a combination of a CPU and an integrated GPU (iGPU), support shared memory and allow the iGPU to work together with the CPU on pointer-based data structures. First, we develop an approach for dGPU to accelerate queries in libcuckoo and robin-map and when looking at accelerating insert, updates and erase operations in the original libcuckoo using OneAPI on an APU. We evaluate the dGPU against the CPU variants and our dGPU approach adapted for the CPU and also in a hybrid context by using longer keys on the CPU and shorter keys on the dGPU. In comparison with the original libcuckoo algorithm, our dGPU approach achieves a speed-up of 2.1, and in comparison with the robin-map a speed-up of 1.5. For hybrid workloads, our approach is efficient if long keys are processed on the CPU and short keys are processed on the dGPU. By processing a mixture of 20% long keys on the CPU and 80% short keys on dGPU, our hybrid approach has a 40% higher throughput than the CPU only approach. In addition, we develop a hybrid APU approach for insert, update and erase operations in the original libcuckoo structure focusing on shared memory with iGPU accelerated look-ups of the positions for insert, update and erase operations.

  相似文献   

19.
This paper presents a generic approach to highly efficient image registration in two and three dimensions. Both monomodal and multimodal registration problems are considered. We focus on the important class of affine-linear transformations in a derivative-based optimization framework. Our main contribution is an explicit formulation of the objective function gradient and Hessian approximation that allows for very efficient, parallel derivative calculation with virtually no memory requirements. The flexible parallelism of our concept allows for direct implementation on various hardware platforms. Derivative calculations are fully matrix free and operate directly on the input data, thereby reducing the auxiliary space requirements from \({\mathcal {O}}(n)\) to \({\mathcal {O}}(1)\). The proposed approach is implemented on multicore CPU and GPU. Our GPU code outperforms a conventional matrix-based CPU implementation by more than two orders of magnitude, thus enabling usage in real-time scenarios. The computational properties of our approach are extensively evaluated, thereby demonstrating the performance gain for a variety of real-life medical applications.  相似文献   

20.
一个基于混合并发模型的Java虚拟机   总被引:3,自引:0,他引:3  
杨博  王鼎兴  郑纬民 《软件学报》2002,13(7):1250-1256
从解释执行到及时编译的转变极大地提高了Java程序的运行速度.但是,现有的Java虚拟机还有待进一步的改进.提出了一种新的Java虚拟机编译与执行模型--混合并发模型HCCEM(hybrid concurrent compilation and execution model).该模型通过多线程控制方式将字节码的编译与执行过程相重叠,从而获取加速的效果.另外还给出了基于HCCEM的Java虚拟机JAFFE的设计方案,并就实现中的执行模式切换、异常处理以及层次线程等问题进行了讨论.实验结果表明,HCCEM能  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号