首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Fault tolerance is considered as the ideal candidate not only for the failsafe system, but also for the reduction of the failure effect and the continuation of the remaining task. The proposed fault-tolerant architecture includes the software design of error detection and diagnosis, as well as the error recovery. Multi-tasks are managed by a computer with a redundant one or by multi-computers with redundancy from each other are employed and evaluated in terms of the reliability and effectiveness. The executing program is supervised by the watchdog, which warns a failure condition of the software program in case that the execution time of each subprogram runs over its default value. The computers are mutually sending the heartbeat signals periodically. The message of the receiving signal indicates whether the system is under failure. The entire detection of the heartbeat function is supervised by the time daemon to ensure that the fault recovery is feasible. Once a computer is failed, the other computer immediately takes over its position and accomplishes the remaining task.  相似文献   

2.
An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency, which is the main source of multistep rollback in distributed systems, is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework  相似文献   

3.
Several different approaches for implementing conversations in message-based distributed computer systems (DCSs) are discussed. Two different exit control strategies (synchronous and asynchronous) and three different approaches to execution of the conversation acceptance test (centralized, decentralized, and semicentralized) are examined and compared in terms of system performance and implementation cost. An efficient approach to run-time management of recovery information based on an extension of the recovery cache scheme is also discussed. The two major types of conversation structures, name-linked recovery block and abstract data type conversations, are examined to analyze which execution approaches are the most efficient for each conversation structure. As a case study, an unmanned vehicle system is used to illustrate how the approaches can be used in a realistic real-time application  相似文献   

4.
王准  陈俊亮 《计算机学报》1998,21(8):730-737
消息日志是用于多进程、分布式系统中状态恢复的一种方法。本文针对传统的消息日志方法仅仅适用于确定性进程的局限性,提出一种新的消息日志思想,充分考虑到不确定性的存在在容错方面的积极作用,主张在满足应用进程一致性语义的基础上,在一定程度上允许不确定性现象的存在。从而以新的角度看待单一进程和分布式并发系统中存在的不确定性所带来的状态重建不能完全复原的问题。这样,消息日志亦能适用于某些不满足确定性条件的进程  相似文献   

5.
This paper introduces a novel execution paradigm called the Write-Only Architecture (WOA) that reduces communication latency overheads by up to a factor of five over previous methods. The WOA writes data through distributed control flow logic rather than using a read–write paradigm or a centralised message hub which allows tasks to be partitioned at a fine-grained level without suffering from excessive communication overheads on distributed systems. In this paper we provide formal assignment results for software benchmarks partitioned using the WOA and previous execution paradigms for distributed heterogeneous architectures along with bounds and complexity information to demonstrate the robust performance improvements possible with the WOA.  相似文献   

6.
随着物联网飞速发展,设备数量呈指数级增长,随之而来的IoT安全问题也受到了越来越多的关注.通常IoT设备完整性认证采用软件证明方法实现设备完整性校验,以便及时检测出设备中恶意软件执行所导致的系统完整性篡改.但现有IoT软件证明存在海量设备同步证明性能低、通用IoT通信协议难以扩展等问题.针对这些问题,本文提供一种轻量级的异步完整性监控方案,在通用MQTT协议上扩展软件证明安全认证消息,异步推送设备完整性信息,在保障IoT系统高安全性的同时,提高了设备完整性证明验证效率.我们的方案实现了以下3方面安全功能:以内核模块方式实现设备完整性度量功能,基于MQTT的设备身份和完整性轻量级认证扩展,基于MQTT扩展协议的异步完整性监控.本方案能够抵抗常见的软件证明和MQTT协议攻击,具有轻量级异步软件证明、通用MQTT安全扩展等特点.最后在基于MQTT的IoT认证原型系统的实验结果表明, IoT节点的完整性度量、MQTT协议连接认证、PUBLISH报文消息认证性能较高,都能满足海量IoT设备完整性监控的应用需求.  相似文献   

7.
This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks.  相似文献   

8.
Real-time distributed systems include communicating tasks that interact via message-passing. In such systems the timely delivery of messages is essential for meeting task timing constraints. Consequently, in addition to task execution times, message delivery times must also be constrained. In order to minimize the number of failures to meet timing constraints message communication protocols, in addition to task scheduling algorithms, play a crucial role. A legitimate question to ask is whether making such protocols adaptive to run-time system and environment status can significantly improve system performance. Consequently, a rum-time monitoring approach to adaptive real-time distributed systems is proposed; the work focuses on an investigation of adaptive message communication protocols and corresponding run-time support mechanisms. Simulation is used to obtain performance results. It is concluded that although improvement is obtained it ,ay not be significant enough to offset the increased overhead and requirement for task information.  相似文献   

9.
An intelligent service-based network architecture for wearable robots   总被引:2,自引:0,他引:2  
We are developing a novel robot concept called the wearable robot. Wearable robots are mobile information devices capable of supporting remote communication and intelligent interaction between networked entities. In this paper, we explore the possible functions of such a robotic network and will present a distributed network architecture based on service components. In order to support the interaction and communication between the components in the wearable robot system, we have developed an intelligent network architecture. This service-based architecture involves three major mechanisms. The first mechanism involves the use of a task coordinator service such that the execution of the services can be managed using a priority queue. The second mechanism enables the system to automatically push the required service proxy to the client intelligently based on certain system-related conditions. In the third mechanism, we allow the system to automatically deliver services based on contextual information. Using a fuzzy-logic-based decision making system, the matching service can determine whether the service should be automatically delivered utilizing the information provided by the service, client, lookup service, and context sensors. An application scenario has been implemented to demonstrate the feasibility of this distributed service-based robot architecture. The architecture is implemented as extensions to the Jini network model.  相似文献   

10.
To find the cause of a functional or non-functional defect (bug) in software running on a multi-processor System-on-Chip (MPSoC), developers need insight into the chip. Tracing systems provide this insight non-intrusively, at the cost of high off-chip bandwidth requirements. This I/O bottleneck limits the observability, a problem becoming more severe as more functionality is integrated on-chip. In this paper, we present DiaSys, an MPSoC diagnosis system with the potential to replace today’s tracing systems. Its main idea is to partially execute the analysis of observation data on the chip; in consequence, more information and less data is sent to the attached host PC. With DiaSys, the data analysis is performed by the diagnosis application. Its input are events, which are generated by observation hardware at interesting points in the program execution (like a function call). Its outputs are events with higher information density. The event transformation is modeled as dataflow application. For execution, it is mapped in part to dedicated and distributed on-chip components, and in part to the host PC; the off-chip boundary is transparent to the developer of the diagnosis application. We implement DiaSys as extension to an existing SoC with four tiles and a mesh network running on an FPGA platform. Two usage examples confirm that DiaSys is flexible enough to replace a tracing system, while significantly lowering the off-chip bandwidth requirements. In our examples, the debugging of a race-condition bug, and the creation of a lock contention profile, we see a reduction of trace bandwidth of more than three orders of magnitude, compared to a full trace created by a common tracing system.  相似文献   

11.
This paper presents a recovery technique for distributed communicating processsystems.It handles both hardware faults and software faults uniformly.Differing fromother recovery techniques,it brings an extremely small amount of execution overheadto nonfailing processes and can be implemented easily.It can be applied to aprogramming procedure to mask the software design errors or imbeded into anoperating system to enhance the reliability of the whole system.The theoretical workis carried out first,then the implementation problems are considered and theevaluation techniques are discussed last.  相似文献   

12.
Accessing and updating information in a self organizing data structure in a distributed environment requires execution of various distributed algorithms. Design of such algorithms is often facilitated by the use of a distributed termination detection algorithm superimposed on top of another distributed algorithm. The problem of distributed termination detection is considered, and message counting is introduced as an effective technique in designing such algorithms. A class of efficient algorithms, based on the idea of message counting, for this problem is presented. After termination has occurred, it is detected within a small number of message communications. These algorithms do not require the FIFO (first in, first out) property for the communication lines. Assumptions regarding the connectivity of the processes are simple. The algorithms are incrementally developed, i.e. a succession of algorithms leading to the final algorithms is presented  相似文献   

13.
The mobile agents create a new paradigm for data exchange and resource sharing in rapidly growing and continually changing computer networks. In a distributed system, failures can occur in any software or hardware component. A mobile agent can get lost when its hosting server crashes during execution, or it can get dropped in a congested network. Therefore, survivability and fault tolerance are vital issues for deploying mobile-agent systems. This fault tolerance approach deploys three kinds of cooperating agents to detect server and agent failures and recover services in mobile-agent systems. An actual agent is a common mobile agent that performs specific computations for its owner. Witness agents monitor the actual agent and detect whether it's lost. A probe recovers the failed actual agent and the witness agents. A peer-to-peer message-passing mechanism stands between each actual agent and its witness agents to perform failure detection and recovery through time-bounded information exchange; a log records the actual agent's actions. When failures occur, the system performs rollback recovery to abort uncommitted actions. Moreover, our method uses checkpointed data to recover the lost actual agent.  相似文献   

14.
This paper describes the architecture of DISC, a system for parallel software development. The system is designed for programming computer systems having several autonomous units, not memory-sharing, and linked by means of a communication network.

The system consists of three parts. The concurrent programming language DISC (DIStributed C), which is an extension of the C language based on the concurrent mechanisms envisaged by the CSP computational model. The programming environment, designed to promote software engineering techniques in the development of distributed-programs. The language run-time support, which provides for the distributed execution of programs.  相似文献   


15.
A sequenced process of Fault Detection followed by the erroneous node's Isolation and system Reconfiguration (node exclusion or recovery), that is, the FDIR process, characterizes the sustained operations of a fault-tolerant system. For distributed systems utilizing message passing, a number of diagnostic (and associated FDIR) approaches, including our prior algorithms, exist in literature and practice. Invariably, the focus is on proving the completeness and correctness (all and only the faulty nodes are isolated) for the chosen fault model, without explicitly segregating permanent from transient faulty nodes. To capture diagnostic issues related to the persistence of errors (transient, intermittent, and permanent), we advocate the integration of count-and-threshold mechanisms into the FDIR framework. Targeting pragmatic system issues, we develop an adaptive online FDIR framework that handles a continuum of fault models and diagnostic protocols and comprehensively characterizes the role of various probabilistic parameters that, due to the count-and-threshold approach, influence the correctness and completeness of diagnosis and system reliability such as the fault detection frequency. The FDIR framework has been implemented on two prototypes for automotive and aerospace applications. The tuning of the protocol parameters at design time allows a significant improvement with respect to prior design choices.  相似文献   

16.
1IntroductionSoftwaredistributedsharedmemory(DSM)system,orsharedvirtualmemory(SVM)system,providesanabstractionofsinglesharedspaceontopofthephysicallydistributedmemoriespresentedonnetworkofworkstations.Ithasbeenextensivelystudiedinthepastdecadesinceitcombinestheprogrammabilityofsharedmemorysystemsandscalabilityofdistributedsystems[1].However,theperformancegapbetweensoftwareDSMsystemsandmessagepajssingplatformsremainsexisting,whichpreventstheprevalenceofthesoftwareDSMsystemsgreatly.Ingenera…  相似文献   

17.
《Information Sciences》1986,38(2):165-180
One complication in using distributed computer systems is the increased complexity of developing distributed software systems. These software systems are composed of asynchronously executing components which communicate via message passing. Current software design techniques are not adequate for use in the design of distributed software systems. New design methods which explicitly address the problem of system partitioning are needed. An overall distributed software design approach is presented. The key to the design approach is the presentation of a distributed processing component (DPC) partitioning algorithm for clustering functional modules in order to derive a set of distributed processing components. The design approach is oriented towards producing a software system which is hierarchical, which exploits potential concurrency that exists between functional modules, and which avoids nonprofitable message traffic.  相似文献   

18.
We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.  相似文献   

19.
分布式系统中软件可靠性是应用软件的发布者和用户关心的重要问题。针对大规模分布式应用,包括电子政务、电子商务、多媒体服务和端到端的自动化解决方案,已经产生了各种各样的模型来评价或预测其可靠性,但是这些系统的可靠性问题依然存在。相反,为了确保分布式系统的可靠性,要求在预测或评价整个系统可靠性之前,检查与企业分布式应用相关的每一个单个构件或因素的可靠性,且实现透明的错误检测和错误恢复机制为用户提供无缝交互。因此,文章从检查单个构件可靠性的角度,提出了在分布式系统上运行的应用软件可靠性的问题和挑战。  相似文献   

20.
隧道施工中采用无线监控技术具有减少网络布线、增加系统灵活性的优点,但是现场施工环境的复杂多变性及传感器节点软硬件不稳定性会导致监控系统瞬时故障而引发安全事故。针对瞬时故障,从保障系统功能出发,建立瞬时故障层次模型,提出了一种多层次故障恢复控制策略,对现场数据监测层、数据传输层、安全防护层、应急响应层四层瞬时故障进行处理。模拟实验表明,该恢复控制策略能提高监控系统对灾害的检测准确率和应急响应动作的执行有效率,有利于保障隧道施工人员安全。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号