首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
目前随着通用GPU(general purpose computation on graphic processing units,GPGPU)性能的不断提高,利用CPU和GPU构建的异构系统已经成为高性能计算领域的研究热点.然而随着并行计算系统的不断增长,系统可靠性越来越低,已成为并行计算向大规模扩展的一个不容忽视的制约因素.由于商用GPGPU容错能力较弱,所以由CPU和GPU构建的大规模异构并行系统的可靠性问题更为尖锐,尚缺乏实用的容错手段,针对这一现实问题提出了一种基于冗余线程的GPU多副本容错技术:RB-TMR(Rollback TMR),同时根据异构系统的编程模型及程序特征对这一面向异构系统的容错机制的设计实现及其编译框架进行了具体分析和描述,最后通过10个案例对此技术进行了实现并评估了其性能.这一技术为异构系统的容错技术研究提供了新的思路,具有重大意义.  相似文献   

2.
大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点,已成为超级计算机的发展方向,但其复杂的异构结构和庞大的系统规模,也使系统的可用性面临巨大挑战,因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题,在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制,兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现,支持众核系统,可扩展到百万以上并行规模;编译指导与自动分析的检查点通过程序员插入简单的编译指示,由编译器进行分析,提示不需要保留的数据,可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明,两种容错措施相对于传统容错方法效果良好,轻量级降级的容错开销小于1%,相对于传统回卷容错方法单次故障执行时间可减少3.5%以上,编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10,具有很好的实用性。  相似文献   

3.
贾佳  杨学军  马亚青 《软件学报》2013,24(6):1361-1375
应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销。选择合适的checkpointing位置、减小全局checkpoint保存数据量是优化应用级 checkpointing 技术的关键问题。对于近年来推出的带有通用 GPU 的异构系统上的应用级checkpointing 技术,也同样面临上述问题。针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing 技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法:同步及异步检查点设置方法,并分别就checkpointing优化设置问题对其进行数学建模和求解。最后,通过实验验证并评估了所提出的两种方法的性能。  相似文献   

4.
高性能计算机系统规模越来越大,系统可靠性问题越来越严重.检查点技术是最典型的容错方法,但是因为并行文件系统的性能提高相对缓慢,数据写带宽低,传统检查点方法产生了严峻的性能问题.针对当前计算机系统计算和存储资源丰富,而并行文件系统写带宽提高相对滞后的特点,提出了基于内存缓存的异步检查点容错技术,传统的检查点技术被划分为两步:检查点文件首先被缓存在计算结点的局部内存,然后使用一个独立的帮助任务将数据拷贝到并行文件系统.利用局部内存带宽高以及帮助任务和计算任务并行执行的特点,新方法极大减小了检查点容错引入的时间开销,模拟和实际程序测试验证了异步检查点容错技术的有效性.  相似文献   

5.
CPU/GPU异构系统具有很大的发展潜力,深入研究CPU/GPU异构平台的并行优化,可实现系统整体计算能力的最大化。通过对CPU/GPU任务划分的优化来平衡CPU和GPU的负载,可提高计算资源的利用率,缩短计算任务的执行时间;通过对GPU线程划分的优化,可使GPU获得更高的速度。从而提高系统整体性能。  相似文献   

6.
应用级checkpointing技术是同构系统上最为常用和成熟的容错技术,但在异构系统下的应用还处于起步阶段,还没有一套严谨合理的针对异构系统架构和故障模型特点的实现方案和配置方法。针对这一现况,本文基于CUDA异构系统的体系结构和编程模型,对CUDA程序在CPU和GPU上的执行模式进行分析,提出了一种面向异构系统应用级checkpointing技术的异步执行机制,并基于这一机制对异构系统的检查点优化设置问题进行讨论,设计了一套优化方案。最后在CUDA平台下通过三个实例验证了这一技术的可行性和实用性,并进行了性能评估。结果表明,这种面向CPU-GPU的异构系统的应用级checkpointing异步执行机制是行之有效的,相比CPU-GPU同步执行的checkpointing机制在设置上更为灵活,优化空间更大。而本文基于这一机制所提出的检查点优化设置方法也有效地减少了check-pointing的开销,从而获得了更高的容错性能。  相似文献   

7.
在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。  相似文献   

8.
徐新海  杨学军  林宇斐  林一松  唐滔 《软件学报》2011,22(10):2538-2552
近年来,为了缓解日益严重的功耗问题,异构并行体系结构已成为超级计算机发展的一个重要趋势.图形处理器(graphics processing unit,简称GPU)凭借其超高的计算性能和性能功耗比,作为一种高效的加速部件已被广泛应用于高性能计算领域.但是,GPU先天的可靠性缺陷势必加剧超级计算机的可靠性问题.目前,国际上关于CPU-GPU异构系统容错技术的研究工作主要将GPU从异构系统中独立出来,以每次调用为粒度对其进行容错处理.设计了一种面向CPU-GPU异构系统的Lazy容错方法,给出了基于编译指导命令的容错框架及其约束,并讨论了相关的编译实现和优化方法,最后通过实验验证了该方法的正确性.实验结果表明,与现有的容错方法相比,利用所设计的LazyFT容错方法对GPGPU(general purpose computation on graphics hardware)程序进行容错处理,可以明显降低容错代价.  相似文献   

9.
许川佩  王光 《计算机应用》2016,36(7):1801-1806
针对尺度不变特征变换(SIFT)算法实时性差的问题,提出了利用开放式计算语言(OpenCL)并行优化的SIFT算法。首先,通过对原算法各步骤进行组合拆分、重构特征点在内存中的数据索引等方式对原算法进行并行化重构,使得算法的中间计算结果能够完全在显存中完成交互;然后,采用复用全局内存对象、共享局部内存、优化内存读取等策略对原算法各步骤进行并行设计,提高数据读取效率,降低传输延时;最后,利用OpenCL语言在图形处理单元(GPU)上实现了SIFT算法的细粒度并行加速,并在中央处理器(CPU)上完成了移植。与原SIFT算法配准效果相近时,并行化的算法在GPU和CPU平台上特征提取速度分别提升了10.51~19.33和2.34~4.74倍。实验结果表明,利用OpenCL并行加速的SIFT算法能够有效提高图像配准的实时性,并能克服统一计算设备架构(CUDA)因移植困难而不能充分利用异构系统中多种计算核心的缺点。  相似文献   

10.
针对嵌入式Linux系统的特点,通过设置检查点(checkpoint)实现ARM平台进程级容错。在检查点工作时,通过/proc文件系统与内核进行交互,实时地获取与进程有关的PID、CPU状态以及内存信息,并保存在存储介质中。当进程出现故障后,将上述与进程有关的状态信息进行恢复,从而实现进程级容错。实验表明,该进程级容错系统有较好的容错能力,极大地缩短了进程恢复的时间。  相似文献   

11.
As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors.  相似文献   

12.
With nowadays popularity of large-scale parallel computers, Multiprocessors System-on-Chip (MP-SoCs), multicomputers, cluster computers and peer-to-peer communication networks, fault-tolerant routing becomes an important issue in developing these systems. Fault-tolerant routing algorithms in such systems aim at providing continuous operations in the presence of one or more failures by allowing the graceful degradation of system. The Software-Based fault-tolerant routing scheme has been suggested as an efficient routing algorithm to preserve both communication performance and fault-tolerant demands in parallel computer systems. To study network performance, a number of different analytical models for fault-free routing algorithms have been proposed in the past literature. However, there has not been reported any similar analytical model of fault-tolerant routing in the presence of faulty components. This paper presents a new analytical modeling approach for determining the effects of failures in wormhole-switched 2-D tori using the fault-tolerant Software-Based scheme. More specifically, we describe a general model to derive mathematical expressions to investigate the performance behavior of routing algorithms confronting convex (|-shaped, □-shaped) or concave (U-shaped, +-shaped, T-shaped, H-shaped) faulty regions. The model is validated through comprehensive simulation experiments for different types of failures.
M. Ould-KhaouaEmail:
  相似文献   

13.
Unreliable resources pose challenges in design of deadlock avoidance algorithms as resources failures have negative impacts on scheduled production activities and may bring the system to dead states or deadlocks. This paper focuses on the development of a suboptimal polynomial complexity deadlock avoidance algorithm that can operate in the presence of unreliable resources for assembly processes. We formulate a fault-tolerant deadlock avoidance controller synthesis problem for assembly processes based on controlled assembly Petri net (CAPN), a class of Petri nets (PNs) that can model such characteristics as multiple resources and subassembly parts requirement in assembly production processes. The proposed fault-tolerant deadlock avoidance algorithm consists of a nominal algorithm to avoid deadlocks for nominal system state and an exception handling algorithm to deal with resources failures. We analyze the fault-tolerant property of the nominal deadlock avoidance algorithm based on resource unavailability models. Resource unavailability is modeled as loss of tokens in nominal Petri Net models to model unavailability of resources in the course of time-consuming recovery procedures. We define three types of token loss to model 1) resource failures in a single operation, 2) resource failures in multiple operations of a production process and 3) resource failures in multiple operations of multiple production processes. For each type of token loss, we establish sufficient conditions that guarantee the liveness of a CAPN after some tokens are removed. An algorithm is proposed to conduct feasibility analysis by searching for recovery control sequences and to keep as many types of production processes as possible continue production so that the impacts on existing production activities can be reduced.  相似文献   

14.
云计算环境下的容错并行Skyline查询算法研究   总被引:1,自引:0,他引:1       下载免费PDF全文
云计算为分布并行Skyline查询提供强大存储能力和计算能力的同时,其大规模数据中心固有的故障频发特性给可靠Skyline查询处理带来极大挑战。现有研究致力于提高Skyline算法的响应时间、渐进性、负载均衡等各项性能,不能保证故障情况下查询继续正确执行。为此,提出一种容错并行Skyline查询算法(fault-tolerant parallel Skyline,FTPS)。该算法通过故障监测和任务迁移,使得能够在查询过程中及时发现故障,并将故障节点的计算任务迁移到副本节点,保证查询的正确执行。理论分析和实验证明,FTPS算法能够在不影响正常Skyline查询处理性能的情况下获取较好的容错处理性能。  相似文献   

15.
针对不确定时滞关联大系统,提出了一种分散鲁棒容错控制方法。目的是当发生传感器或执行器故障以及具有参数不确定时,使系统仍保持渐进稳定。基于Lyapunov稳定性理论,给出了该系统在传感器失效时具有容错性能的充分条件及控制器的设计步骤,并将结果推广到执行器失效的情况。最后通过实例仿真验证了方法的正确性,并对仿真结果进行了分析。  相似文献   

16.
A strong failure recovery mechanism handling diverse failures in heterogeneous and dynamic Grid is so important to ensure the complete execution of long-running applications. Although there have been various efforts made to address this issue, existing solutions either focus on employing only one single fault-tolerant technique without considering the diversity of failures, or propose some frameworks which cannot deal with various kinds of failures adaptively in Grid. In this paper, an adaptive task-level, fault-tolerant approach to Grid is proposed. This approach aims at handling quite a complete set of failures arising in Grid environment by integrating basic fault-tolerant approaches. Moreover, this paper puts forward that resource consumption (not received enough attention) is also an important evaluation metric for any fault-tolerant approach. The corresponding evaluation models based on mean execution time and resource consumption are constructed to evaluate any fault-tolerant approach. Based on the models, we also demonstrate the effectiveness of our approach and illustrate the performance gains achieved via simulations. The experiments based on a real Grid have been made and the results show that our approach can achieve better performance and consume less resource.  相似文献   

17.
崔萌  王鑫  邓超 《控制与决策》2023,38(5):1303-1311
针对一类线性多智能体系统,研究其在网络间歇性拒绝服务攻击下的最优同步控制问题.首先,在时变非对称通讯网络拓扑结构下,提出一种弹性最优协同容错控制策略,并优化多智能体的合作二次性能指标,然后证明全局跟踪误差在出现执行器故障和网络攻击时仍然渐进收敛.进一步,当考虑多智能体子系统模型参数未知,同时系统发生执行器故障的情况下,提出利用局部系统状态和输入信息的自学习迭代算法求解代数Riccati方程,计算子系统的反馈控制器增益,实现弹性协同容错控制目标.最后,通过Chua电路网络仿真算例验证所提出的控制方法的有效性.  相似文献   

18.
Integrating External and Internal Clock Synchronization   总被引:2,自引:1,他引:1  
We address the problem of how to integrate fault-tolerant external and internal clock synchronization. In this paper we propose a new external/internal clock synchronization algorithm which provides both external and internal clock synchronization for as long as a majority of the reference time servers (servers with access to reference time) stay correct. When half or more of the reference time servers are faulty, the algorithm degrades to a fault-tolerant internal clock synchronization algorithm. We prove that at least 2 F+1 reference time servers are necessary for achieving external clock synchronization when up to F reference time servers can suffer arbitrary failures, thus the proposed algorithm provides maximum fault-tolerance. In this paper we also derive lower bounds for the best maximum external deviation achievable in standard mode and the best drift rate achievable in degraded mode. Our algorithm is optimal with respect to these two bounds: (1) the maximum external deviation is optimal in standard mode, and (2) the drift rate of the clocks is optimal in standard and degraded mode.  相似文献   

19.
提出了一种新的耐故障Clos网,通过在基础Clos网各段中增加冗余的交换单元,使其能够在发生少量故障的情况下正常工作,从而提供更可靠的服务。针对耐故障Clos网,给出一种耐故障Clos路由算法,该算法采用最小分布优先的策略逐列计算Clos网连接说明矩阵,通过重排完全实现无阻塞路由,该算法的时间复杂度在最坏情况下仅为O(N3/2)。该耐故障Clos网及其算法设计可以用于实现更为可靠的Clos网络。  相似文献   

20.
In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号