共查询到20条相似文献,搜索用时 31 毫秒
1.
组件应用服务器框架是一种特定形式的分布式对象系统平台,要求成为高可靠性的系统.这里指的可靠性主要是指错误容忍和错误恢复两个特性.本文的主要目标是建立基于分布式对象的组件应用服务器的软件容错服务框架.我们采用一种名叫对象容错服务(OFS)的办法解决对象容错,我们解决的问题包括:对象失效、节点错误、网络隔离和不可预知的通信延迟等.本文介绍了OFS的服务规范,并给出了一个OFS实现的系统结构. 相似文献
2.
主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据,从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题,提出几种副本存储系统的状态转换模型,然后利用蒙特卡洛仿真算法实现了该模型,从而模拟主动容错副本存储系统的运行,最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布,并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明,当预测模型的准确率达到50%时,系统的可靠性可以提高1~3倍;与二副本系统相比,三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平,从而搭建高可靠和高可用的存储系统。 相似文献
3.
High-performance supercomputers generally comprise millions of CPUs in which interconnection networks play an important role to achieve high performance. New design paradigms of dynamic on-chip interconnection network involve a) topology b) synthesis, modeling and evaluation c) quality of service, fault tolerance and reliability d) routing procedures. To construct a dynamic highly fault tolerant interconnection networks requires more disjoint paths from each source-destination node pair at each stage and dynamic rerouting capability to use the various available paths effectively. Fast routing and rerouting strategy is needed to provide reliable performance on switch/link failures. This paper proposes two new architecture designs of fault tolerant interconnection networks named as reliable interconnection networks (RIN-1 and RIN-2). The proposed layouts are multipath multi-stage interconnection networks providing four disjoint paths for all the source-destination node pairs with dynamic rerouting capability. The designs can withstand switch failures in all the stages (including input and output stages) and provide more reliability. Reliability analysis of various MIN architectures is evaluated. On comparing the results with some existing MINs it is evident that the proposed designs provides higher reliability values and fault tolerance. 相似文献
4.
片上网络互连拓扑综述 总被引:1,自引:0,他引:1
随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一。片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错能力和功耗等,同时影响网络路由策略和网络芯片的布局布线方法,是片上网络研究中的关键之一。对比了不同片上网络的拓扑结构,分析了各种结构的性能,并对未来片上网络拓扑研究提出建议。 相似文献
5.
Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651
Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献
6.
Basic concepts and taxonomy of dependable and secure computing 总被引:33,自引:0,他引:33
Avizienis A. Laprie J.-C. Randell B. Landwehr C. 《Dependable and Secure Computing, IEEE Transactions on》2004,1(1):11-33
This paper gives the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures. 相似文献
7.
Discrete controller synthesis (DCS) is a formal approach, based on the same state-space exploration algorithms as model-checking.
Its interest lies in the ability to obtain automatically systems satisfying by construction formal properties specified a
priori. In this paper, our aim is to demonstrate the feasibility of this approach for fault tolerance. We start with a fault
intolerant program, modeled as the synchronous parallel composition of finite labeled transition systems; we specify formally
a fault hypothesis; we state some fault tolerance requirements; and we use DCS to obtain automatically a program, having the
same behavior as the initial fault intolerant one in the absence of faults, and satisfying the fault tolerance requirements
under the fault hypothesis. Our original contribution resides in the demonstration that DCS can be elegantly used to design
fault tolerant systems, with guarantees on key properties of the obtained system, such as the fault tolerance level, the satisfaction
of quantitative constraints, and so on. We show with numerous examples taken from case studies that our method can address
different kinds of failures (crash, value, or Byzantine) affecting different kinds of hardware components (processors, communication
links, actuators, or sensors). Besides, we show that our method also offers an optimality criterion very useful to synthesize
fault tolerant systems compliant to the constraints of embedded systems, like power consumption. 相似文献
8.
深亚微米工艺使SoC芯片集成越来越复杂的功能,测试开发的难度也不断提高。由各种电路结构以及设计风格组成的异构系统使测试复杂度大大提高,增加了测试时间以及测试成本。描述了一款通讯基带SoC芯片的DFT实现,这款混合信号基带芯片包含模拟和数字子系统,IP核以及片上嵌入式存储器,为了满足测试需求,通过片上测试控制单元,控制SoC各种测试模式,支持传统的扫描测试以及专门针对深亚微米工艺的,操作在不同时钟频率和时钟域的基于扫描的延迟测试模式,可配置的片上存储器的BIST操作以及其它一些特定测试模式。 相似文献
9.
10.
Availability is one of the most important requirements in production system. Keeping a persistent level of high availability in the Infrastructure-as-a-Service (IaaS) cloud computing is a challenge due to the complexity of service providing. By definition, the availability can be maintained by coupling with the fault tolerance approaches. Recently, many fault tolerance methods have been developed, but few of them adequately consider the fault detection aspect, which is critical to issue the appropriate recovery actions just in time. In this paper, based on a rigorous analysis on the nature of failures, we would like to introduce a method to early identify the faults occurring in the IaaS system. By engaging fuzzy logic algorithm and prediction technique, the proposed approach can provide better performance in terms of accuracy and reaction rate, which subsequently enhances the system reliability. 相似文献
11.
针对航天测控系统的可靠性需求,提出了一种紧凑型PCI总线测控系统的冗余容错设计方案。系统下位机采用了基于VxWorks嵌入式操作系统来保证实时性,并在VxWorks系统中实现了高可用热插拔技术用于提高系统的冗余容错性能。提出了利用基于概率神经网络(PNN)的故障诊断方法对热冗余设备进行在线故障诊断。仿真与实验验证的结果表明,该系统具有良好的冗余容错性能,该设计方法可以有效提升系统的可靠性。 相似文献
12.
13.
14.
15.
In robot applications where the consequences of system failure are unbearable, fault tolerance is mandatory. Fault tolerant robots continue to function correctly despite component failures. Fault tolerant robots can be designed using the Helenic architecture. This architecture uses non-homogeneous functional modular redundancy and a democratic dynamic weighted voting algorithm for redundancy management to achieve fault tolerance. The benefits offered are increased reliability, maintainability, common mode failure resistance, and significant cost reductions. To demonstrate the fault tolerance capabilities of this system architecture, a 5 wheel omnidirectional mobile robot with sensors, computing elements and actuators was designed and simulated. Simulation results verify the robot's ability to continue ‘correct’ operation despite internal subsystem failures. 相似文献
16.
为了提高嵌入式系统在恶劣环境下的可靠性,除了在硬件上采用诸如双机冷备份之类的容错方案外,在实时操作系统级提供软件容错处理功能既可以减小硬件资源开销,又可以在不影响系统工作效率的前提下明显提高系统的容错纠错能力.本文针对RTEMS实时操作系统缺乏软件容错支持功能的不足,在操作系统级设计了一套两级软件容错的方案,提高了嵌入式系统的可靠性. 相似文献
17.
在硬件实时操作系统中,系统CPU的使用率是系统性能的一项重要指标,如果任务占据了系统的全部CPU,其它任务将无法继续运行,给系统带来灾难性后果。
通过分析实时操作系统中软件运行的特点,系统设计需要采取一定容错策略,以提高系统可靠性和容错能力。在μC/ OS-Ⅱ实时操作系统下对飞行控制软件中的任务进行实时监测。首先给出在μC/ OS Ⅱ实时操作系统下CPU使用率的计算方法,合理提出CPU的监测周期。其次,给出对CPU使用率异常的故障检测算法,对故障进行故障处置,提高系统的容错能力。最后,通过在MPC5674飞行控制计算机中编写嵌入式飞行控制软件来验证四种对CPU使用率异常的处置方法。仿真结果表明,实时操作系统中CPU的软件容错方法可以有效提高系统可靠性和容错能力。 相似文献
18.
Multistage interconnection networks (MINs) are widely used for reliable data communication in a tightly coupled large-scale multiprocessor system. High reliability of MINs can be achieved using fault tolerance techniques. The fault tolerance is generally achieved by disjoint paths available through multiple connectivity options. The gamma interconnection network (GIN) is a class of fault tolerant MINs providing alternate paths for source–destination node pairs. Various 2-disjoint and 3-disjoint GIN architectures have been presented in the literature. In this paper, two new designs of 4-disjoint paths multistage interconnection networks, called 4-disjoint gamma interconnection networks (4DGIN-1 and 4DGIN-2) are proposed. The proposed 4DGINs provide four disjoint paths for each source–destination pair and can tolerate three switches/link failures in intermediate interconnection layers. Proposed designs are highly reliable GIN with higher fault-tolerant capability than other gamma networks at low cost. Terminal pair reliabilities of proposed designs and various other 2-disjoint and 3-disjoint GINs are evaluated, analyzed and compared. Reliability values of proposed designs are found higher. 相似文献
19.
The growing size of a multiprocessor system increases its vulnerability to component failures. In order to maintain the system's high reliability, it is crucial to identify and replace the faulty processors through testing, a process known as fault diagnosis. The minimum size of a largest connected component in such a networked system is typically used as a measure for fault tolerance of the system. For this measure, the conditional diagnosability of the system in terms of an alternating group network is important, which is studied in the present paper under a comparison model, with some precise and useful bounds of tolerance derived. 相似文献
20.
计算机控制系统的容错技术 总被引:1,自引:0,他引:1
计算机控制系统的可靠性设计是实现柔性智能控制所面临的一个重要课题,而容错技术是系统可靠性设计的关键技术。本文在分析计算机系统可靠性设计的基础上,综述了容错技术的发展、研究的主要内容及实现的主要方法,对常用的几种容错结构进行了比较和评价。指出了对计算机系统进行容错设计必须解决的主要问题。 相似文献