首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
并行计算机高可用性分析与设计   总被引:1,自引:1,他引:0       下载免费PDF全文
随着并行计算机系统规模的不断增大,系统的失效率呈线性增长。如何保证大规模并行系统能够提供持续不断的服务,即提高系统的可用性,达到高可用的目标,已成为并行系统设计的重要方面。系统级容错的概念目前已经提出,但系统可用性的度量仍然需要深入研究。本文运用组合模型和马尔科夫过程模型,对系统可靠性和可用性进行了建模模和分析,推导了基于马尔科夫过程的可用性度量公式,得出运用高可用技术可以提高系统的可用性。在此基础上,还给出了一个大规模并行计算机系统的高可用系统结构。  相似文献   

2.
In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.  相似文献   

3.
Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.  相似文献   

4.
李静  罗金飞  李炳超 《计算机应用》2021,41(4):1113-1121
主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据,从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题,提出几种副本存储系统的状态转换模型,然后利用蒙特卡洛仿真算法实现了该模型,从而模拟主动容错副本存储系统的运行,最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布,并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明,当预测模型的准确率达到50%时,系统的可靠性可以提高1~3倍;与二副本系统相比,三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平,从而搭建高可靠和高可用的存储系统。  相似文献   

5.
可信计算及其关键技术研究   总被引:2,自引:0,他引:2  
The dependability is the latest and highest techno-target used to evaluate the performance quality of a dis-tributed computing system in open network environment, it includes traditional reliability, availability, robustness,survivability, security, data integrity and software protecting ability, etc. A dependable system should not only be provided with fault tolerance ability, but also withstand from risk and recover from disaster, its realization foun dationis the high availability of the information transmission Jaetwork and survivability, fault tolerance and security safe-guard of the system. This paper presents a survey of the survivability mechanisms such as long-distance backup, clus-ter and system recovery, while discussing the techniques of fault tolerance design and information network system se-curity safeguard, and analyzing the information redundant dispersal strategy and model for survivability and security safeguard.  相似文献   

6.
针对航天测控系统的可靠性需求,提出了一种紧凑型PCI总线测控系统的冗余容错设计方案。系统下位机采用了基于VxWorks嵌入式操作系统来保证实时性,并在VxWorks系统中实现了高可用热插拔技术用于提高系统的冗余容错性能。提出了利用基于概率神经网络(PNN)的故障诊断方法对热冗余设备进行在线故障诊断。仿真与实验验证的结果表明,该系统具有良好的冗余容错性能,该设计方法可以有效提升系统的可靠性。  相似文献   

7.
为提高集群系统的可靠性和计算性能并降低成本,提出将单一系统映像的集群系统(Single System Image)和分布式复制块设备技术(DRBD)结合起来构建一种高可用集群(SSI-DRBD集群).这种利用单一系统映像和DRBD技术所构建的集群具有高性能、高可靠、实时性强、易管理和低成本等特点,可作为周期性、高强度和多元信息处理的平台.  相似文献   

8.
目前许多P2P网络存储系统都采用了m/n容错机制来提高系统的可用性和可靠性,但是在实际应用中,服务器之间发生相关错误会导致这种容错机制具有低容错率.针对这种问题,描述了一种在P2P系统中寻找低错误相关的服务器节点集合的方法,m/n容错机制可以通过使用此集合中的服务器节点来提高其容错率,从而使得系统具有高可用性和可靠性,并对此方法进行了实验分析,验证了方法实用有效.  相似文献   

9.
The extreme complexity of grid system makes it extremely difficult to achieve high service reliability, and this situation is aggravated by the fact that many grid services need to perform time-consuming tasks that may require several days or even months of computation. To improve grid service reliability, this paper studies a fault recovery technique in grid systems and conducts in-depth research on grid reliability modeling and analysis with fault recovery. Grid failures considered in this paper are classified into two categories: unrecoverable failures and recoverable failures. Software reliability is taken into account as well. To make fault recovery more practical, certain constraints on fault recovery are introduced and grid service reliability models under these practical constraints are developed. Numerical examples are presented, and based on the results obtained, the impact of fault recovery as well as that of practical constraints on grid service reliability is discussed.  相似文献   

10.
错误的频繁发生已经成为阻碍网格稳健发展和大规模应用的主要障碍之一,网格系统的容错性研究显得尤为重要。根据网格计算的特点,提出了网格环境下的特殊容错需求;结合用户的服务质量要求,建立了包括网格错误检测与网格错误管理的动态容错服务架构,阐述了错误检测服务与错误管理服务的组织结构、各组成模块的具体功能;最后,给出了一个完整的容错服务实现过程。  相似文献   

11.
移动网格的资源环境具有很高的动态性,在任意时刻可能发生资源加入、退出、故障、移动等。采用任务复制策略实现对资源不可靠性的容错。用weibull分布刻画资源的可靠性,建立任务复制模型;形式化描述了基于复制策略的独立任务调度问题,给出调度目标和约束条件;通过遗传算法解决调度问题。仿真结果表明,调度算法具有良好的可扩展性,调度性能与资源可靠性呈线性关系。  相似文献   

12.
Availability is one of the most important requirements in production system. Keeping a persistent level of high availability in the Infrastructure-as-a-Service (IaaS) cloud computing is a challenge due to the complexity of service providing. By definition, the availability can be maintained by coupling with the fault tolerance approaches. Recently, many fault tolerance methods have been developed, but few of them adequately consider the fault detection aspect, which is critical to issue the appropriate recovery actions just in time. In this paper, based on a rigorous analysis on the nature of failures, we would like to introduce a method to early identify the faults occurring in the IaaS system. By engaging fuzzy logic algorithm and prediction technique, the proposed approach can provide better performance in terms of accuracy and reaction rate, which subsequently enhances the system reliability.  相似文献   

13.
顾佳伟 《微机发展》2007,17(8):140-143
为了构造和部署大规模的多agent系统,人们必须找到并解决其基本问题,其中之一就是可能存在的局部性系统故障。这也就意味着,容错对于大规模多agent系统来说,是一个无法回避的主题。文中讨论了这类问题并且提出了一种多agent系统的容错方法。最先的想法是将复制策略运用到agent中,对处于危急状态的agent进行复制从而避免系统故障,但是由于agent的危急性会在执行过程中演变,并且agent的可用资源是绑定的,所以需要动态以及自动地调整agent的复制体个数,从而最大化它们的作用和可靠性。文中将描述评估某个agent危险性的方法以及相关机制,并且决定使用何种策略(如:主动复制,被动复制)以及如何将其参数化(如:复制的个数)。  相似文献   

14.
张琳  杨静 《计算机应用》2004,24(7):16-17,21
检查点机制作为一种软件容错机制,可以与新出现的广域分布式系统网格相结合,更好地满足网格系统的容错要求。文中详细分析了检查点回卷恢复协议的关键点,并对数据网格中GridCPR API进行了解析,提出一些改进,这样就更有利于网格系统的故障检测和容错服务。  相似文献   

15.
为了提高电网复杂故障时的应对能力,提出了基于信息优化的动态建模模糊Petri网的电网故障诊断方法。首先,基于层次建模的思想,在建立常规故障诊断模型的基础上,引入动态库所、动态弧、动态变迁的概念来合理地拟合各种保护与断路器之间的逻辑关系,动态建立综合性故障的诊断模型;其次,依据故障信息源的特性对其进行了优化和预处理,以确定故障性质并动态建立相应故障诊断模型;再次,利用智能优化算法对模型进行了训练、学习;最后,分析了该模型在故障信息缺失时的容错性和在系统架构改变时的通用性。对算例系统仿真的结果表明:该算法显著地增加了故障诊断过程的层次性、诊断模型的透明性、可理解性和易维护性,在故障信息缺失的情况下诊断结果具有较高的可信度。  相似文献   

16.
利用软件容错技术提高Web服务组合的可靠性   总被引:1,自引:0,他引:1  
Web服务的一个优点就是可以通过基本服务组合形成更为复杂的服务。为了确保Web服务组合的可靠性,可以利用软件容错技术来提高服务组合的可靠性。针对BPEL流程形式描述的组合服务,本文提出了一种利用软件容错模式增强组合服务可靠性的方法,并利用随机回报网模型度量组合服务的可靠性。  相似文献   

17.
可靠的网格作业调度机制   总被引:1,自引:1,他引:0  
陶永才  石磊 《计算机应用》2010,30(8):2066-2069
针对网格环境的动态性特征,提出了一种可靠的网格作业调度机制(DGJS)。按照作业完成时间期限,DGJS将作业分为:高QoS级、低QoS级和无QoS级,不同QoS级作业有不同的调度优先权;基于资源可用性预测,DGJS采用基于可靠性代价的作业调度策略,将作业尽可能调度到可靠性高的资源节点;另外,DGJS对不同QoS级作业采用不同的容错策略,在保证故障容错的同时,节省网格资源。实验表明:在动态的网格环境下,较之传统的网格作业调度算法,DGJS提高了作业成功率,减少了作业完成时间。  相似文献   

18.
电源中的容错技术   总被引:2,自引:1,他引:1  
为了满足用户高可靠,不停电供电的需求,作者提出了可以将大型电子系统的供电电源视作一个实时控制系统的概念,在电源设计中采用容错技术。  相似文献   

19.
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.  相似文献   

20.
In signal processing and communication systems, digital filters are widely employed. In some circumstances, the reliability of those systems is crucial, necessitating the use of fault tolerant filter implementations. Many strategies have been presented throughout the years to achieve fault tolerance by utilising the structure and properties of the filters. As technology advances, more complicated systems with several filters become possible. Some of the filters in those complicated systems frequently function in parallel, for example, by applying the same filter to various input signals. Recently, a simple strategy for achieving fault tolerance that takes advantage of the availability of parallel filters was given. Many fault-tolerant ways that take advantage of the filter’s structure and properties have been proposed throughout the years. The primary idea is to use structured authentication scan chains to study the internal states of finite impulse response (FIR) components in order to detect and recover the exact state of faulty modules through the state of non-faulty modules. Finally, a simple solution of Double modular redundancy (DMR) based fault tolerance was developed that takes advantage of the availability of parallel filters for image denoising. This approach is expanded in this short to display how parallel filters can be protected using error correction codes (ECCs) in which each filter is comparable to a bit in a standard ECC. “Advanced error recovery for parallel systems,” the suggested technique, can find and eliminate hidden defects in FIR modules, and also restore the system from multiple failures impacting two FIR modules. From the implementation, Xilinx ISE 14.7 was found to have given significant error reduction capability in the fault calculations and reduction in the area which reduces the cost of implementation. Faults were introduced in all the outputs of the functional filters and found that the fault in every output is corrected.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号