共查询到20条相似文献,搜索用时 0 毫秒
1.
Ishfaq Ahmad 《The Journal of supercomputing》1995,9(1-2):135-162
Building large-scale parallel computer systems for time-critical applications is a challenging task since the designers of such systems need to consider a number of related factors such as proper support for fault tolerance, efficient task allocation and reallocation strategies, and scalability. In this paper we propose a massively parallel fault-tolerant architecture using hundreds or thousands of processors for critical applications with timing constraints. The proposed architecture is based on an interconnection network called thebisectional network. A bisectional network is isomorphic to a hypercube in that a binary hypercube network can be easily extended as a bisectional network by adding additional links. These additional links add to the network some rich topological properties such as node symmetry, small diameter, small internode distance, and partitionability. The important property of partitioning is exploited to propose a redundant task allocation and a task redistribution strategy under realtime constraints. The system is partitioned into symmetric regions (spheres) such that each sphere has a central control point. The central points, calledfault control points (FCPs), are distributed throughout the entire system in an optimal fashion and provide two-level task redundancy and efficiently redistribute the loads of failed nodes. FCPs are assigned to the processing nodes such that each node is assigned two types of FCPs for storing two redundant copies of every task present at the node. Similarly, the number of nodes assigned to each FCP is the same. For a failure-repair system environment the performance of the proposed system has been evaluated and compared with a hypercube-based system. Simulation results indicate that the proposed system can yield improved performance in the presence of a high number of node failures. 相似文献
2.
In the hot-standby replication system, the system cannot process its tasks anymore when all replicated nodes have failed. Thus, the remaining living nodes should be well-protected against failure when parts of replicated nodes have failed. Design faults and system-specific weaknesses may cause chain reactions of common faults on identical replicated nodes in replication systems. These can be alleviated by replicating diverse hardware and software. Going one-step forward, failures on the remaining nodes can be suppressed by predicting and preventing the same fault when it has occurred on a replicated node. In this paper, we propose a fault avoidance scheme which increases system dependability by avoiding common faults on remaining nodes when parts of nodes fail, and analyze the system dependability. 相似文献
3.
安全关键系统高可信保障技术的研究 总被引:5,自引:0,他引:5
1 引言安全关键系统SCS(Safety Critical Systems)是指系统功能一旦失效将引起生命、财产的重大损失以及环境可能遭到严重破坏的系统。这类系统广泛存在于航空航天、国防、交通运输、核电能源和医疗卫生等诸多安全关键领域中。而高可信(Ultradependability)则是指系统在任务开始时可用性给定的情况下,在规定的时间和环境内能够使用且能完成规定功能的能力,即系统“动则成功”的能力。随着现代社会的高速发展及不稳定因素的存在,安全关键系统日益庞大和复杂,带来了系统可靠性和安全性的下降、投资增加、研发周期加长、风险增加。安全关键系统的应用环境也更加复杂和恶劣,从陆地、海洋到天空、太空,安全关键系统的使用环境不断地扩展和更加严酷。严酷的环境对系统高可靠、高安全性等综合特性的实现提出了严峻的挑战。除此,系统要求的持续无故障任务 相似文献
4.
5.
This paper examines the use of speculations, a form of distributed transactions, for improving the reliability and fault tolerance of distributed systems. A speculation is defined as a computation that is based on an assumption that is not validated before the computation is started. If the
assumption is later found to be false, the computation is aborted and the state of the program is rolled back; if the assumption is found to be true, the results of the computation are committed. The primary difference between a speculation and a transaction is that a speculation is not isolated—for example, a speculative
computation may send and receive messages, and it may modify shared objects. As a result, processes that share those objects
may be absorbed into a speculation. We present a syntax, and an operational semantics in two forms. The first one is a speculative
model, which takes full advantage of the speculative features. The second one is a nonspeculative, nondeterministic model,
where aborts are treated as failures. We prove the equivalence of the two models, demonstrating that speculative execution
is equivalent to failure-free computation. 相似文献
6.
Xubin Li Christian Xin Stephen L. 《Journal of Parallel and Distributed Computing》2009,69(12):961-973
High availability data storage systems are critical for many applications as research and business become more data driven. Since metadata management is essential to system availability, multiple metadata services are used to improve the availability of distributed storage systems. Past research has focused on the active/standby model, where each active service has at least one redundant idle backup. However, interruption of service and even some loss of service state may occur during a fail-over depending on the replication technique used. In addition, the replication overhead for multiple metadata services can be very high. The research in this paper targets the symmetric active/active replication model, which uses multiple redundant service nodes running in virtual synchrony. In this model, service node failures do not cause a fail-over to a backup and there is no disruption of service or loss of service state. A fast delivery protocol is further discussed to reduce the latency of the total order broadcast needed. The prototype implementation shows that metadata service high availability can be achieved with an acceptable performance trade-off using the symmetric active/active metadata service solution. 相似文献
7.
In this paper we present an algebra of actors extended with mechanisms to model crash failures and their detection. We show how this extended algebra of actors can be successfully used to specify distributed software architectures. The main components of a software architecture can be specified following an object-oriented style and then they can be composed using asynchronous message passing or more complex interaction patterns. This formal specification can be used to show that several requirements of a software system are satisfied at the architectural level despite failures. We illustrate this process by means of a case study: the specification of a software architecture for intelligent agents which supports a fault tolerant anonymous interaction protocol. 相似文献
8.
ContextCertification of safety–critical software systems requires submission of safety assurance documents, e.g., in the form of safety cases. A safety case is a justification argument used to show that a system is safe for a particular application in a particular environment. Different argumentation strategies (informal and formal) are applied to determine the evidence for a safety case. For critical software systems, application of formal methods is often highly recommended for their safety assurance.ObjectiveThe objective of this paper is to propose a methodology that combines two activities: formalisation of system safety requirements of critical software systems for their further verification as well as derivation of structured safety cases from the associated formal specifications.MethodWe propose a classification of system safety requirements in order to facilitate the mapping of informally defined requirements into a formal model. Moreover, we propose a set of argument patterns that aim at enabling the construction of (a part of) a safety case from a formal model in Event-B.ResultsThe results reveal that the proposed classification-based mapping of safety requirements into formal models facilitates requirements traceability. Moreover, the provided detailed guidelines on construction of safety cases aim to simplify the task of the argument pattern instantiation for different classes of system safety requirements. The proposed methodology is illustrated by numerous case studies.ConclusionFirstly, the proposed methodology allows us to map the given system safety requirements into elements of the formal model to be constructed, which is then used for verification of these requirements. Secondly, it guides the construction of a safety case, aiming to demonstrate that the safety requirements are indeed met. Consequently, the argumentation used in such a constructed safety case allows us to support it with formal proofs and model checking results used as the safety evidence. 相似文献
9.
Min-Kyu Joo Author Vitae Author Vitae 《Computers & Electrical Engineering》2005,31(6):380-390
Secure transmission over wired/wireless networks requires encryption of data and control information. For high-speed data transmission, it would be desirable to implement the encryption algorithms in hardware. Faults in the hardware, however, may cause interruption of service. This paper presents a simple technique for achieving fault tolerance in pipelined implementation of symmetric block ciphers. It detects errors, locates the corresponding faults, and readily reconfigures during normal operation to isolate the identified faulty modules. Bypass links with some extra pipeline stages are used to achieve fault tolerance. The hardware overhead can be controlled by properly choosing the number of extra stages. Moreover, fault tolerance is achieved with negligible time overhead. 相似文献
10.
This paper describes a model for the assessment and certification of safety-critical programmable electronic systems in the
transportation industries. The proposed model is founded on the significant commonalities between emerging international safety-related
standards in the automotive, railway and aerospace industries. It contains a system development and a safety assessment process
which rationalise and unify the common requirements among the standards in these areas. In addition, it defines an evolutionary
process for the development of the system’s safety case. The safety case process shows how the evidence produced in the progression
of safety assessment can be structured in order to form an overall argument about the safety of the system. We conclude that
it is possible to use this model as the basis of a generic approach to the certification of systems across the transportation
sector. 相似文献
11.
Modular redundancy and temporal redundancy are traditional techniques to increase system reliability. In addition to being used as temporal redundancy, with
technology advancements, slack time in a system can also be used by energy management schemes to save energy. In this paper,
we consider the combination of modular and temporal redundancy to achieve energy efficient reliable real-time service provided
by multiple servers. We first propose an efficient adaptive parallel recovery scheme that appropriately processes service requests in parallel to increase the number of faults that can be tolerated and
thus system reliability. Then we explore schemes to determine the optimal redundant configurations of the parallel servers to minimize system energy consumption for a given reliability goal or to maximize system reliability
for a given energy budget. Our analysis results show that small requests, optimistic approaches, and parallel recovery favor
lower levels of modular redundancy, while large requests, pessimistic approaches and restricted serial recovery favor higher
levels of modular redundancy.
相似文献
Daniel MosséEmail: |
12.
13.
Alfredo Capozucca Author Vitae Alexander Romanovsky Author Vitae 《Journal of Systems and Software》2009,82(2):207-228
This paper1 presents ways of implementing dependable distributed applications designed using the Coordinated Atomic Action (CAA) paradigm. CAAs provide a coherent set of concepts adapted to fault tolerant distributed system design that includes structured transactions, distribution, cooperation, competition, and forward and backward error recovery mechanisms triggered by exceptions. DRIP (Dependable Remote Interacting Processes) is an efficient Java implementation framework which provides support for implementing Dependable Multiparty Interactions (DMI). As DMIs have a softer exception handling semantics compared with the CAA semantics, a CAA design can be implemented using the DRIP framework. A new framework called CAA-DRIP allows programmers to exclusively implement the semantics of CAAs using the same terminology and concepts at the design and implementation levels. The new framework not only simplifies the implementation phase, but also reduces the final system size as it requires less number of instances for creating a CAA at runtime. The paper analyses both implementation frameworks in great detail, drawing a systematic comparison of the two. The CAAs behaviour is described in terms of Statecharts to better understand the differences between the two frameworks. Based on the results of the comparison, we use one of the frameworks to implement a case study belonging to the e-health domain. 相似文献
14.
15.
We have developed an architecture for distributed simulations with reuse in mind from the start. During the past six years, two teams on five projects at US Army Tank-Automotive and Armaments Command (TACOM) have used it. We found that in order to reuse the architecture, it was not only necessary to design it for reuse but also to develop and provide associated development processes and standards for reuse. Not only its functionality for the task to be performed, but also its resultant impact on the software development process determine acceptance of an architecture for reuse. Software reuse is similar to both software tool assessment/technology transfer and requirements analysis in that in all three cases one needs to assess not only product functionality but also its resultant impact on the way that the users of the product work. The DoD is now mandating the use of the High Level Architecture Standard (HLA) for distributed military simulations. Because of our experience in architecture reuse, we realize that HLA adoption will impact not only our product, but also the way we work. In addition to examining HLA functionality, we are assessing HLA's impact on our existing development processes and standards. 相似文献
16.
Summary Byzantine Agreement is important both in the theory and practice of distributed computing. However, protocols to reach Byzantine Agreement are usually expensive both in the time required as well as in the number of messages exchanged. In this paper, we present a self-adjusting approach to the problem. The Mostly Byzantine Agreement is proposed as a more restrictive agreement problem that requires that in the consecutive attempts to reach agreement, the number of disagreements (i.e., failures to reach Byzantine Agreement) is finite. Fort faulty processes, we give an algorithm that has at mostt disagreements for 4t or more processes. Another algorithm is given forn3t+1 processes with the number of disagreements belowt
2/2. Both algorithms useO(n
3) message bits for binary value agreement.
Yi Zhao is currently working on his Ph.D. degree in Computer Science at University of Houston. His research interests include fault tolerance, distributed computing, parallel computation and neural networks. He obtained his M.S. from University of Houston in 1988 and B.S. from Beijing University of Aeronautics and Astronautics in 1984, both in computer science.
Farokh B. Bastani received the B. Tech. degree in electrical engineering from the Indian Institute of Technology, Bombay, India, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the University of California, Berkeley. He joined the University of Houston in 1980, where he is currently an Associate Professor of Computer Science. His research interests include software design and validation techniques, distributed systems, and fault-tolerant systems. He is a member of the ACM and the IEEE and is on the editorial board of theIEEE Transactions on Software Engineering. 相似文献
17.
Grid computing is a largely adopted paradigm to federate geographically distributed data centers. Due to their size and complexity, grid systems are often affected by failures that may hinder the correct and timely execution of jobs, thus causing a non-negligible waste of computing resources. Despite the relevance of the problem, state-of-the-art management solutions for grid systems usually neglect the identification and handling of failures at runtime. Among the primary goals to be considered, we claim the need for novel approaches capable to achieve the objectives of scalable integration with efficient monitoring solutions and of fitting large and geographically distributed systems, where dynamic and configurable tradeoffs between overhead and targeted granularity are necessary. This paper proposes GAMESH, a Grid Architecture for scalable Monitoring and Enhanced dependable job ScHeduling. GAMESH is conceived as a completely distributed and highly efficient management infrastructure, concentrating on two crucial aspects for large-scale and multi-domain grid environments: (i) the scalable dissemination of monitoring data and (ii) the troubleshooting of job execution failures. GAMESH has been implemented and tested in a real deployment encompassing geographically distributed data centers across Europe. Experimental results show that GAMESH (i) enables the collection of measurements of both computing resources and conditions of task scheduling at geographically sparse sites, while imposing a limited overhead on the entire infrastructure, and (ii) provides a failure-aware scheduler able to improve the overall system performance, even in the presence of failures, by coordinating local job schedulers at multiple domains. 相似文献
18.
The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. 相似文献
19.
20.
This paper presents a set of distributed algorithms that support an Intrusion Detection System (IDS) model for Mobile Ad hoc NETworks (MANETs). The development of mobile networks has implicated the need of new IDS models in order to deal with new security issues in these communication environments. More conventional models have difficulties to deal with malicious components in MANETs. In this paper, we describe the proposed IDS model, focusing on distributed algorithms and their computational costs. The proposal employs fault tolerance techniques and cryptographic mechanisms to detect and deal with malicious or faulty nodes. The model is analyzed along with related works. Unlike studies in the references, the proposed IDS model admits intrusions and malice in their own algorithms. In this paper, we also present test results obtained with an implementation of the proposed model. 相似文献