期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

赵季中齐勇侯迪《小型微型计算机系统》2003,24(1):26-29

组件应用服务器框架是一种特定形式的分布式对象系统平台,要求成为高可靠性的系统.这里指的可靠性主要是指错误容忍和错误恢复两个特性.本文的主要目标是建立基于分布式对象的组件应用服务器的软件容错服务框架.我们采用一种名叫对象容错服务(OFS)的办法解决对象容错,我们解决的问题包括:对象失效、节点错误、网络隔离和不可预知的通信延迟等.本文介绍了OFS的服务规范,并给出了一个OFS实现的系统结构. 相似文献

2.

主动容错副本存储系统的可靠性分析模型

李静罗金飞李炳超《计算机应用》2021,41(4):1113-1121

主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据,从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题,提出几种副本存储系统的状态转换模型,然后利用蒙特卡洛仿真算法实现了该模型,从而模拟主动容错副本存储系统的运行,最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布,并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明,当预测模型的准确率达到50%时,系统的可靠性可以提高1~3倍;与二副本系统相比,三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平,从而搭建高可靠和高可用的存储系统。相似文献

3.

Reliable multistage interconnection network design

S. Rajkumar Neeraj Kumar Goyal 《Peer-to-Peer Networking and Applications》2016,9(6):979-990

High-performance supercomputers generally comprise millions of CPUs in which interconnection networks play an important role to achieve high performance. New design paradigms of dynamic on-chip interconnection network involve a) topology b) synthesis, modeling and evaluation c) quality of service, fault tolerance and reliability d) routing procedures. To construct a dynamic highly fault tolerant interconnection networks requires more disjoint paths from each source-destination node pair at each stage and dynamic rerouting capability to use the various available paths effectively. Fast routing and rerouting strategy is needed to provide reliable performance on switch/link failures. This paper proposes two new architecture designs of fault tolerant interconnection networks named as reliable interconnection networks (RIN-1 and RIN-2). The proposed layouts are multipath multi-stage interconnection networks providing four disjoint paths for all the source-destination node pairs with dynamic rerouting capability. The designs can withstand switch failures in all the stages (including input and output stages) and provide more reliability. Reliability analysis of various MIN architectures is evaluated. On comparing the results with some existing MINs it is evident that the proposed designs provides higher reliability values and fault tolerance. 相似文献

4.

片上网络互连拓扑综述 总被引：1，自引：0，他引：1

王炜乔林汤志忠《计算机科学》2011,38(10):1-5

随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一。片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错能力和功耗等,同时影响网络路由策略和网络芯片的布局布线方法,是片上网络研究中的关键之一。对比了不同片上网络的拓扑结构,分析了各种结构的性能,并对未来片上网络拓扑研究提出建议。相似文献

5.

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献

6.

Basic concepts and taxonomy of dependable and secure computing 总被引：33，自引：0，他引：33

Avizienis A. Laprie J.-C. Randell B. Landwehr C. 《Dependable and Secure Computing, IEEE Transactions on》2004,1(1):11-33

This paper gives the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures. 相似文献

7.

Automating the addition of fault tolerance with discrete controller synthesis

Alain Girault Éric Rutten 《Formal Methods in System Design》2009,35(2):190-225

Discrete controller synthesis (DCS) is a formal approach, based on the same state-space exploration algorithms as model-checking. Its interest lies in the ability to obtain automatically systems satisfying by construction formal properties specified a priori. In this paper, our aim is to demonstrate the feasibility of this approach for fault tolerance. We start with a fault intolerant program, modeled as the synchronous parallel composition of finite labeled transition systems; we specify formally a fault hypothesis; we state some fault tolerance requirements; and we use DCS to obtain automatically a program, having the same behavior as the initial fault intolerant one in the absence of faults, and satisfying the fault tolerance requirements under the fault hypothesis. Our original contribution resides in the demonstration that DCS can be elegantly used to design fault tolerant systems, with guarantees on key properties of the obtained system, such as the fault tolerance level, the satisfaction of quantitative constraints, and so on. We show with numerous examples taken from case studies that our method can address different kinds of failures (crash, value, or Byzantine) affecting different kinds of hardware components (processors, communication links, actuators, or sensors). Besides, we show that our method also offers an optimality criterion very useful to synthesize fault tolerant systems compliant to the constraints of embedded systems, like power consumption. 相似文献

8.

深亚微米SoC芯片的可测试性设计

下载免费PDF全文

胡剑沈绪榜王涛《计算机工程与应用》2008,44(23):88-92

深亚微米工艺使SoC芯片集成越来越复杂的功能,测试开发的难度也不断提高。由各种电路结构以及设计风格组成的异构系统使测试复杂度大大提高,增加了测试时间以及测试成本。描述了一款通讯基带SoC芯片的DFT实现,这款混合信号基带芯片包含模拟和数字子系统,IP核以及片上嵌入式存储器,为了满足测试需求,通过片上测试控制单元,控制SoC各种测试模式,支持传统的扫描测试以及专门针对深亚微米工艺的,操作在不同时钟频率和时钟域的基于扫描的延迟测试模式,可配置的片上存储器的BIST操作以及其它一些特定测试模式。相似文献

9.

基于嵌入式工程机械监控和故障诊断系统的设计 总被引：1，自引：0，他引：1

刘燕周国荣徐丽莎《工业控制计算机》2008,21(8)

介绍了基于ARM-Linux嵌入式系统的监控和故障诊断系统,时该系统的软硬件设计方法等进行了阐述.综合运用了GPRS无线通讯技术,故障树分析法,专家系统等理论,研究工程机械监控和故障诊断.该系统集实时数据采集、信号处理、在线状态监测、网络通讯、故障诊断于一体,采用专家控制诊断系统,实现了工程机械监控诊断系统的可靠化、小型化和智能化,该系统具有很高的产业化价值. 相似文献

10.

Early fault detection in IaaS cloud computing based on fuzzy logic and prediction technique

Dinh-Mao Bui Thien Huynh-The Sungyoung Lee 《The Journal of supercomputing》2018,74(11):5730-5745

Availability is one of the most important requirements in production system. Keeping a persistent level of high availability in the Infrastructure-as-a-Service (IaaS) cloud computing is a challenge due to the complexity of service providing. By definition, the availability can be maintained by coupling with the fault tolerance approaches. Recently, many fault tolerance methods have been developed, but few of them adequately consider the fault detection aspect, which is critical to issue the appropriate recovery actions just in time. In this paper, based on a rigorous analysis on the nature of failures, we would like to introduce a method to early identify the faults occurring in the IaaS system. By engaging fuzzy logic algorithm and prediction technique, the proposed approach can provide better performance in terms of accuracy and reaction rate, which subsequently enhances the system reliability. 相似文献

11.

基于VxWorks下PNN故障诊断的冗余容错系统

李丹胡晓光《计算机工程与应用》2016,52(15):13-18

针对航天测控系统的可靠性需求,提出了一种紧凑型PCI总线测控系统的冗余容错设计方案。系统下位机采用了基于VxWorks嵌入式操作系统来保证实时性,并在VxWorks系统中实现了高可用热插拔技术用于提高系统的冗余容错性能。提出了利用基于概率神经网络（PNN）的故障诊断方法对热冗余设备进行在线故障诊断。仿真与实验验证的结果表明,该系统具有良好的冗余容错性能,该设计方法可以有效提升系统的可靠性。相似文献

12.

一种基于中间件的容错系统的研究与设计

下载免费PDF全文

姚兰桂勋巨军让《计算机工程》2007,33(6):83-85,1

随着硬件容错技术的成熟,软件容错技术成为提高系统可靠性的热点问题。直接开发容错应用是非常困难的,鉴于中间件为应用系统提供了良好的开发环境,该文研究和设计了一个基于中间件的容错系统模型,提出了一种新的节点容错结构构造方法,为解决冗余、失效检测和恢复等容错的关键技术问题形成了一套较完整的解决方案。采用马尔科夫过程求出系统的可靠度,验证了系统设计的合理性和可靠性。相似文献

13.

基于片上网络的多核芯片组通讯方案

侯宁卢亚鹏张多利《计算机时代》2014,(10):17-18

多芯片协同工作是一种廉价、低风险的高密度计算应用解决方案。由于片上网络(Network On Chip,NoC)的数据通讯具有并发、分离的特性,因此可以方便地在板级集成多块NoC多核芯片协同工作,构成NoC多核芯片组,快速提供更强大的处理能力。基于某高性能图像处理项目,其硬件系统主要由4块NoC多核芯片构成,4块芯片采用全互连方式,研究了报文数据在不同多核芯片间的传输问题,提出了一种通过硬件实现的多核芯片组通讯方案,该方案已应用在某高性能图像处理项目。相似文献

14.

Building a reliable and high-performance content-based publish/subscribe system

Yaxiong Zhao Jie Wu 《Journal of Parallel and Distributed Computing》2013

相似文献

15.

Helenic fault tolerance for robots

George Toye Larry J. Leifer 《Computers & Electrical Engineering》1994,20(6):479-497

In robot applications where the consequences of system failure are unbearable, fault tolerance is mandatory. Fault tolerant robots continue to function correctly despite component failures. Fault tolerant robots can be designed using the Helenic architecture. This architecture uses non-homogeneous functional modular redundancy and a democratic dynamic weighted voting algorithm for redundancy management to achieve fault tolerance. The benefits offered are increased reliability, maintainability, common mode failure resistance, and significant cost reductions. To demonstrate the fault tolerance capabilities of this system architecture, a 5 wheel omnidirectional mobile robot with sensors, computing elements and actuators was designed and simulated. Simulation results verify the robot's ability to continue ‘correct’ operation despite internal subsystem failures. 相似文献

16.

RTEMS嵌入式系统中的软件容错设计 总被引：1，自引：0，他引：1

下载免费PDF全文

张靓刘光明《计算机工程与科学》2007,29(5):147-151

为了提高嵌入式系统在恶劣环境下的可靠性,除了在硬件上采用诸如双机冷备份之类的容错方案外,在实时操作系统级提供软件容错处理功能既可以减小硬件资源开销,又可以在不影响系统工作效率的前提下明显提高系统的容错纠错能力.本文针对RTEMS实时操作系统缺乏软件容错支持功能的不足,在操作系统级设计了一套两级软件容错的方案,提高了嵌入式系统的可靠性. 相似文献

17.

实时操作系统CPU使用率监测的软件容错研究

王余伟曹东施书成《计算机工程与科学》2018,40(8):1337-1343

在硬件实时操作系统中,系统CPU的使用率是系统性能的一项重要指标,如果任务占据了系统的全部CPU,其它任务将无法继续运行,给系统带来灾难性后果。通过分析实时操作系统中软件运行的特点,系统设计需要采取一定容错策略,以提高系统可靠性和容错能力。在μC/ OS-Ⅱ实时操作系统下对飞行控制软件中的任务进行实时监测。首先给出在μC/ OS Ⅱ实时操作系统下CPU使用率的计算方法,合理提出CPU的监测周期。其次,给出对CPU使用率异常的故障检测算法,对故障进行故障处置,提高系统的容错能力。最后,通过在MPC5674飞行控制计算机中编写嵌入式飞行控制软件来验证四种对CPU使用率异常的处置方法。仿真结果表明,实时操作系统中CPU的软件容错方法可以有效提高系统可靠性和容错能力。相似文献

18.

Design of 4-disjoint gamma interconnection network layouts and reliability analysis of gamma interconnection Networks

S. Rajkumar Neeraj Kumar Goyal 《The Journal of supercomputing》2014,69(1):468-491

Multistage interconnection networks (MINs) are widely used for reliable data communication in a tightly coupled large-scale multiprocessor system. High reliability of MINs can be achieved using fault tolerance techniques. The fault tolerance is generally achieved by disjoint paths available through multiple connectivity options. The gamma interconnection network (GIN) is a class of fault tolerant MINs providing alternate paths for source–destination node pairs. Various 2-disjoint and 3-disjoint GIN architectures have been presented in the literature. In this paper, two new designs of 4-disjoint paths multistage interconnection networks, called 4-disjoint gamma interconnection networks (4DGIN-1 and 4DGIN-2) are proposed. The proposed 4DGINs provide four disjoint paths for each source–destination pair and can tolerate three switches/link failures in intermediate interconnection layers. Proposed designs are highly reliable GIN with higher fault-tolerant capability than other gamma networks at low cost. Terminal pair reliabilities of proposed designs and various other 2-disjoint and 3-disjoint GINs are evaluated, analyzed and compared. Reliability values of proposed designs are found higher. 相似文献

19.

Conditional diagnosability of alternating group networks

Shuming Zhou Wenjun Xiao 《Information Processing Letters》2010,110(10):403-297

The growing size of a multiprocessor system increases its vulnerability to component failures. In order to maintain the system's high reliability, it is crucial to identify and replace the faulty processors through testing, a process known as fault diagnosis. The minimum size of a largest connected component in such a networked system is typically used as a measure for fault tolerance of the system. For this measure, the conditional diagnosability of the system in terms of an alternating group network is important, which is studied in the present paper under a comparison model, with some precise and useful bounds of tolerance derived. 相似文献

20.

计算机控制系统的容错技术 总被引：1，自引：0，他引：1

王敏黄心汉《微处理机》1995,(1):18-22

计算机控制系统的可靠性设计是实现柔性智能控制所面临的一个重要课题，而容错技术是系统可靠性设计的关键技术。本文在分析计算机系统可靠性设计的基础上，综述了容错技术的发展、研究的主要内容及实现的主要方法，对常用的几种容错结构进行了比较和评价。指出了对计算机系统进行容错设计必须解决的主要问题。相似文献