首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到16条相似文献,搜索用时 125 毫秒
1.
徐新海  杨学军  林宇斐  林一松  唐滔 《软件学报》2011,22(10):2538-2552
近年来,为了缓解日益严重的功耗问题,异构并行体系结构已成为超级计算机发展的一个重要趋势.图形处理器(graphics processing unit,简称GPU)凭借其超高的计算性能和性能功耗比,作为一种高效的加速部件已被广泛应用于高性能计算领域.但是,GPU先天的可靠性缺陷势必加剧超级计算机的可靠性问题.目前,国际上关于CPU-GPU异构系统容错技术的研究工作主要将GPU从异构系统中独立出来,以每次调用为粒度对其进行容错处理.设计了一种面向CPU-GPU异构系统的Lazy容错方法,给出了基于编译指导命令的容错框架及其约束,并讨论了相关的编译实现和优化方法,最后通过实验验证了该方法的正确性.实验结果表明,与现有的容错方法相比,利用所设计的LazyFT容错方法对GPGPU(general purpose computation on graphics hardware)程序进行容错处理,可以明显降低容错代价.  相似文献   

2.
以类OpenMP的并行程序为研究对象,在满足性能约束的条件下,结合异构系统并行循环调度和处理器动态电压调节技术优化系统功耗.首先建立了异构系统功耗感知的并行循环调度问题基本模型;然后,通过分析方法给出异构系统并行循环调度的能耗下界,该下界可用于评估功耗优化方法的实际效率;进而将异构系统并行循环调度问题归纳为整数规划问题,在此基础上,提出了处理器内循环再调度方法进一步降低功耗.最后,以CPU-GPU异构系统为平台评测了10个典型kernel程序.实验结果表明,该方法可以有效降低系统功耗,提高系统效能.  相似文献   

3.
王桂彬  杜静  唐滔 《软件学报》2013,24(10):2460-2472
高功耗已成为制约高性能计算机发展的重要问题之一.近年来,大量研究关注于如何在满足系统功耗约束的条件下优化系统执行性能.然而,已有方法大都针对同构系统,未考虑异构处理器之间的功耗或速度差异,难以高效应用于基于加速器的异构系统.对当前异构并行系统执行模型进行了抽象,并提出了融合两级功耗控制机制的系统功耗管理框架,自顶向下依次为系统级功耗控制器和异构处理引擎功耗控制器.在异构处理引擎功耗控制中,针对类OpenMP 并行循环,首先分析了异构多处理器在满足功耗约束条件下达到性能最优的条件.基于该结果,给出了功耗受限的并行循环划分算法,该方法通过协调并行循环调度和动态电压频率调节技术以优化异构并行处理.在系统级功耗控制中,建立了异构处理引擎效能评估方法,以此作为功耗划分的依据,在兼顾并发应用公平性的同时,提高系统整体执行效能.最后,基于典型CPU-GPU 异构系统验证了方法的有效性.  相似文献   

4.
图形处理器GPU以其高性能、高能效优势成为当前异构高性能计算机系统主要采用的加速部件。虽然GPU具有较高的理论峰值能效,但其绝对功耗开销明显高于通用处理器。随着GPU在高性能计算领域的应用逐渐扩展,面向GPU的低功耗优化研究将成为该领域的重要研究方向之一。准确的功耗预测是功耗优化研究的重要前提,本文提出了基于硬件性能计数器的GPU功耗预测方法。该方法基于硬件性能计数器信息,结合GPU在部分运行频率下的功耗值,通过线性回归的方法预测处理器在其他运行频率下的功耗值。实验结果表明,该方法可以准确地预测GPU功耗。  相似文献   

5.
在多核中央处理器(CPU)—图形处理器(GPU)异构并行体系结构上,采用OpenMP和计算统一设备架构(CUDA)编程实现了基于AMBER力场的蛋白质分子动力学模拟程序。通过合理地将程序划分为CPU单线程、CPU多线程和GPU多线程执行部分,高效地利用了计算机的处理能力。性能测试结果表明,相对于优化后的CPU串行计算,多核CPU-GPU异构并行计算模型有强大的性能优势,特别是将占整个程序执行时间90%的作用力的计算移植到GPU上执行,获得了最高可达12倍的计算加速比。  相似文献   

6.
研究GPU/CPU异构系统任务调度的节能问题.与传统同构体系结构相比,异构系统任务调度呈现较大的随机性和不定性,GPU/CPU异构系统中时间间隙片段呈现了较大的随机性,导致传统调度方法很难建立规则的描述时间片段的模型,调度能耗较高.为解决上述问题,提出了一种改进功耗优化的GPU/CPU异构环境下的任务调度算法,将任务关系图按照依赖关系计算量拆分,并分配到计算节点.在计算节点内根据权重法的思想,统计所有计算节点的处理情况,进而将节点内的子任务调度到合适的处理器.实验结果表明,在不影响应用性能的前提下,降低了异构系统的能耗开销,优化效果明显.  相似文献   

7.
应用级checkpointing技术是同构系统上最为常用和成熟的容错技术,但在异构系统下的应用还处于起步阶段,还没有一套严谨合理的针对异构系统架构和故障模型特点的实现方案和配置方法。针对这一现况,本文基于CUDA异构系统的体系结构和编程模型,对CUDA程序在CPU和GPU上的执行模式进行分析,提出了一种面向异构系统应用级checkpointing技术的异步执行机制,并基于这一机制对异构系统的检查点优化设置问题进行讨论,设计了一套优化方案。最后在CUDA平台下通过三个实例验证了这一技术的可行性和实用性,并进行了性能评估。结果表明,这种面向CPU-GPU的异构系统的应用级checkpointing异步执行机制是行之有效的,相比CPU-GPU同步执行的checkpointing机制在设置上更为灵活,优化空间更大。而本文基于这一机制所提出的检查点优化设置方法也有效地减少了check-pointing的开销,从而获得了更高的容错性能。  相似文献   

8.
CPU/GPU异构系统具有很大的发展潜力,深入研究CPU/GPU异构平台的并行优化,可实现系统整体计算能力的最大化。通过对CPU/GPU任务划分的优化来平衡CPU和GPU的负载,可提高计算资源的利用率,缩短计算任务的执行时间;通过对GPU线程划分的优化,可使GPU获得更高的速度。从而提高系统整体性能。  相似文献   

9.
GPU集群已经成为高性能计算的重要方式,特别对于计算密集型应用,具有成本低、性能高、功耗小的优势.为了解决GPU集群系统运行中的任务负载均衡问题,文中提出了一种面向计算密集型应用的异构GPU集群调度方法,该方法可以自动发现计算节点,并动态估计计算节点的计算能力,并根据计算能力、任务的计算强度和优先级在异构GPU集群上合理分配计算资源.同时,该系统还具有容错能力,能够处理计算节点的意外退出,可恢复意外退出计算节点的计算任务,并动态适应系统的计算规模.通过实验表明,文中采用的策略达到了预期目的  相似文献   

10.
王桂彬  杨学军  唐滔  徐新海 《软件学报》2012,23(6):1382-1396
随着处理器功耗不断增大,功耗问题逐渐成为高性能计算机系统设计与实现的首要问题.当前,异构系统已成为高性能计算机的发展趋势之一.与传统同构体系结构相比,异构体系结构具有更高的理论峰值性能和能效,但是如何在满足应用性能的条件下充分发掘异构系统的能效优势,仍是一个挑战性问题.通过将应用程序抽象为由串行段和并行段组成的一般程序模型,建立了异构并行系统能耗优化模型通过分析方法依次给出并行段以及全程序(多程序段)能耗最优时处理器间满足的关系,分别给出了时间约束下能耗最优的处理器频率选择算法.最后,以CPU-GPU异构系统为平台,通过8个典型应用程序验证了方法的有效性.  相似文献   

11.
Computing systems should be designed to exploit parallelism in order to improve performance. In general, a GPU (Graphics Processing Unit) can provide more parallelism than a CPU (Central Processing Unit), resulting in the wide usage of heterogeneous computing systems that utilize both the CPU and the GPU together. In the heterogeneous computing systems, the efficiency of the scheduling scheme, which selects the device to execute the application between the CPU and the GPU, is one of the most critical factors in determining the performance. This paper proposes a dynamic scheduling scheme for the selection of the device between the CPU and the GPU to execute the application based on the estimated-execution-time information. The proposed scheduling scheme enables the selection between the CPU and the GPU to minimize the completion time, resulting in a better system performance, even though it requires the training period to collect the execution history. According to our simulations, the proposed estimated-execution-time scheduling can improve the utilization of the CPU and the GPU compared to existing scheduling schemes, resulting in reduced execution time and enhanced energy efficiency of heterogeneous computing systems.  相似文献   

12.
The parallel preconditioned conjugate gradient method (CGM) is used in many applications of scientific computing and often has a critical impact on their performance and energy consumption. This article investigates the energy-aware execution of the CGM on multi-core CPUs and GPUs used in an adaptive FEM. Based on experiments, an application-specific execution time and energy model is developed. The model considers the execution speed of the CPU and the GPU, their electrical power, voltage and frequency scaling, the energy consumption of the memory as well as the time and energy needed for transferring the data between main memory and GPU memory. The model makes it possible to predict how to distribute the data to the processing units for achieving the most energy efficient execution: the execution might deploy the CPU only, the GPU only or both simultaneously using a dynamic and adaptive collaboration scheme. The dynamic collaboration enables an execution minimising the execution time. By measuring execution times for every FEM iteration, the data distribution is adapted automatically to changing properties, e.g. the data sizes.  相似文献   

13.
In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone.  相似文献   

14.
Recently, heterogeneous computing that incorporates the main processor(s) with accelerator(s) for boosting the performance of applications becomes popular. While joining forces of the accelerators could help improve performance, it may also sometimes produce the negative results. In particular, this happens during the execution of the image processing applications. Halide, in particular, has such a problem. Our previous study found that dynamically dispatching image processing tasks to the CPU and the GPU could often lead to prolonged execution time. In this paper, we propose a profile-guided job dispatching mechanism to better harness the computing power of the different types of computing elements. The proposed mechanism assigns the computation tasks onto the proper computing elements, based on the measured performance during the early rounds of the task execution. We implemented the proposed mechanism in the Halide framework. We evaluate the efficiency of the dispatching method with two benchmarks, including bilateral grid filters and local Laplacian filters using the CPU-only, the GPU-only and the hybrid CPU-GPU configurations. Our results show that the profile-guided approach boosts the performance with 1K resolution which is 52% faster than the dynamic approach for local Laplacian filters. On the other hand, for bilateral grid filters, the difference is within 7%. For local Laplacian filters with 8K resolution, the boosted performance is 38% faster than the dynamic approach. In addition, for bilateral grid filters, the difference is within 7%. As a result, it delivers better results than dispatching mechanism in previous work. Since the high-level C++ objects are offered to the programmers and the implementation details of the proposed method are hidden from them, the programmers are allowed to focus on the application logics rather than coordinating the computation between the heterogeneous computing elements.  相似文献   

15.
为有效提高异构的CPU/GPU集群计算性能,提出一种支持异构集群的CPU与GPU协同计算的两级动态调度算法。根据各节点计算能力评测结果和任务请求动态分发数据,在节点内CPU和GPU之间动态调度任务,使用数据缓存和数据处理双队列机制,提高异构集群的传输和处理效率。该算法实现了集群各节点“能者多劳”,避免了单节点性能瓶颈造成的任务长尾现象。实验结果表明,该算法较传统MPI/GPU并行计算性能提高了11倍。  相似文献   

16.
王桂彬 《计算机学报》2012,35(5):979-989
作为众核体系结构的典型代表,GPU(Graphics Processing Units)芯片集成了大量并行处理核心,其功耗开销也在随之增大,逐渐成为计算机系统中功耗开销最大的组成部分之一,而软件低功耗优化技术是降低芯片功耗的有效方法.文中提出了一种模型指导的多维低功耗优化技术,通过结合动态电压/频率调节和动态核心关闭技术,在不影响性能的情况下降低GPU功耗.首先,针对GPU多线程执行模型的特点,建立了访存受限程序的功耗优化模型;然后,基于该模型,分别分析了动态电压/频率调节和动态核心关闭技术对程序执行时间和能量消耗的影响,进而将功耗优化问题归纳为一般整数规划问题;最后,通过对9个典型GPU程序的评测以及与已有方法的对比分析,验证了该文提出的低功耗优化技术可以在不影响性能的情况下有效降低芯片功耗.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号