期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于路径跟踪的控制流检测 总被引：1，自引：0，他引：1

李剑明谭庆平徐建军蒋诚《计算机工程》2009,35(20):68-70

硬件瞬时故障可以通过修改指令操作码和操作数的方式引发控制流错误,破坏程序的正常执行。针对硬件瞬时故障引起的程序控制流错误,提出一种指令级控制流检测方法,对程序执行路径进行跟踪。故障注入实验结果表明,该方法的平均错误检测率、增加的内存消耗和性能损耗分别为97.8%, 83.2%和52.9%。相似文献

2.

Achieving cost-efficient fail-operational behavior based on inherent redundancy at the system level

《Microprocessors and Microsystems》2021

To fulfill their safety requirements, modern embedded systems are increasingly often expected to deliver a guaranteed minimum level of functionality at all times. In practice, such fail-operational systems are often based on fault tolerance mechanisms that are inadequate for use in cost-driven environments such as the automotive domain. In this work, we consider safety-critical embedded systems with a certain degree of spare resources at the system level and propose a cost-efficient fault tolerance approach that protects a pair of execution units from severe hardware faults. The concept requires no replication of an execution unit. Instead, it employs a state-preserving proxy unit that communicates with low-level devices such as sensors or actuators and handles faults of one execution unit by dynamically migrating the safety-critical portion of its functionality to the redundant counterpart. Based on the application of this concept to an example scenario from the automotive domain, we analyze the resource overhead of the proxy unit and evaluate both the achieved fault handling time and the generated computational overhead experimentally. 相似文献

3.

Fault tolerance in highly parallel hardware systems

Grosspietsch K.E. 《Micro, IEEE》1994,14(1):60-68

As the demand for highly parallel systems grows, the vast amount of concurrently operating hardware involved can make it difficult to guarantee proper system behavior. Problems arise both from permanent and transient hardware faults and from errors caused by improper programming. A number of fault tolerance solutions have emerged. Following a survey of fault tolerance in arrays, a discussion of solutions for more specialized architectures is presented 相似文献

4.

A Recovery Technique for Distributed Communicating Process Systems

下载免费PDF全文

Zhou Di 《计算机科学技术学报》1986,1(2):34-43

This paper presents a recovery technique for distributed communicating processsystems.It handles both hardware faults and software faults uniformly.Differing fromother recovery techniques,it brings an extremely small amount of execution overheadto nonfailing processes and can be implemented easily.It can be applied to aprogramming procedure to mask the software design errors or imbeded into anoperating system to enhance the reliability of the whole system.The theoretical workis carried out first,then the implementation problems are considered and theevaluation techniques are discussed last. 相似文献

5.

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Sebastian Weis Arne Garbade Bernhard Fechner Avi Mendelson Roberto Giorgi Theo Ungerer 《International journal of parallel programming》2016,44(2):208-232

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates. 相似文献

6.

Software dependability in the Tandem GUARDIAN system

Inhwan Lee Iyer R.K. 《IEEE transactions on pattern analysis and machine intelligence》1995,21(5):455-467

Based on extensive field failure data for Tandem's GUARDIAN operating system, the paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate 相似文献

7.

一种支持容错的任务并行程序设计模型

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献

8.

DAFT: Decoupled Acyclic Fault Tolerance

Yun Zhang Jae W. Lee Nick P. Johnson David I. August 《International journal of parallel programming》2012,40(1):118-140

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such faults, but software techniques are more appealing for their low cost and flexibility. Recent software proposals have not achieved widespread acceptance because they either increase register pressure, double memory usage, or are too slow in the absence of hardware extensions. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Evaluation results demonstrate that speculation allows DAFT to improves the performance of software redundant multithreading by 2.17× with no degradation of fault coverage. 相似文献

9.

Error detection and diagnosis for fault tolerance in distributed systems

Kassem Saleh Khaled Al-Saqabi 《Information and Software Technology》1998,39(14-15)

The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. 相似文献

10.

无线隧道施工监控系统瞬时故障恢复控制

左泽华黄雄峰秦元庆周纯杰《计算机应用》2012,32(5):1443-1445

隧道施工中采用无线监控技术具有减少网络布线、增加系统灵活性的优点,但是现场施工环境的复杂多变性及传感器节点软硬件不稳定性会导致监控系统瞬时故障而引发安全事故。针对瞬时故障,从保障系统功能出发,建立瞬时故障层次模型,提出了一种多层次故障恢复控制策略,对现场数据监测层、数据传输层、安全防护层、应急响应层四层瞬时故障进行处理。模拟实验表明,该恢复控制策略能提高监控系统对灾害的检测准确率和应急响应动作的执行有效率,有利于保障隧道施工人员安全。相似文献

11.

A dynamic code coverage approach to maximize fault localization efficiency

《Journal of Systems and Software》2014

Spectrum-based fault localization is amongst the most effective techniques for automatic fault localization. However, abstractions of program execution traces, one of the required inputs for this technique, require instrumentation of the software under test at a statement level of granularity in order to compute a list of potential faulty statements. This introduces a considerable overhead in the fault localization process, which can even become prohibitive in, e.g., resource constrained environments. To counter this problem, we propose a new approach, coined dynamic code coverage (DCC), aimed at reducing this instrumentation overhead. This technique, by means of using coarser instrumentation, starts by analyzing coverage traces for large components of the system under test. It then progressively increases the instrumentation detail for faulty components, until the statement level of detail is reached. To assess the validity of our proposed approach, an empirical evaluation was performed, injecting faults in six real-world software projects. The empirical evaluation demonstrates that the dynamic code coverage approach reduces the execution overhead that exists in spectrum-based fault localization, and even presents a more concise potential fault ranking to the user. We have observed execution time reductions of 27% on average and diagnostic report size reductions of 77% on average. 相似文献

12.

实时操作系统CPU使用率监测的软件容错研究

王余伟曹东施书成《计算机工程与科学》2018,40(8):1337-1343

在硬件实时操作系统中,系统CPU的使用率是系统性能的一项重要指标,如果任务占据了系统的全部CPU,其它任务将无法继续运行,给系统带来灾难性后果。通过分析实时操作系统中软件运行的特点,系统设计需要采取一定容错策略,以提高系统可靠性和容错能力。在μC/ OS-Ⅱ实时操作系统下对飞行控制软件中的任务进行实时监测。首先给出在μC/ OS Ⅱ实时操作系统下CPU使用率的计算方法,合理提出CPU的监测周期。其次,给出对CPU使用率异常的故障检测算法,对故障进行故障处置,提高系统的容错能力。最后,通过在MPC5674飞行控制计算机中编写嵌入式飞行控制软件来验证四种对CPU使用率异常的处置方法。仿真结果表明,实时操作系统中CPU的软件容错方法可以有效提高系统可靠性和容错能力。相似文献

13.

Fault Tolerance Achieved in VLSI

Emmerson R. Mcgowan M.J. 《Micro, IEEE》1984,4(6):34-43

This quad-modular redundant system offers a cost-effective alternativefor supporting fault tolerance by incorporating hardware/software independence and five redundancy mechanisms to correct both transient and permanent errors. 相似文献

14.

A distributed recovery block approach to fault-tolerant executionof application tasks in hypercubes

Kim K.H. Kavianpour A. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(1):104-111

An approach to fault-tolerant execution of real-time application tasks in hypercubes is proposed. The approach is based on the distributed recovery block (DRB) scheme and does not require special hardware mechanisms in support of fault tolerance. Each task is assigned to a pair of processors forming a DRB computing station for execution in a dual-redundant and self-checking mode. Assignment of all tasks in an application in such a form is called the full DRB mapping. The DRB scheme was developed as an approach to uniform treatment of hardware and software faults with the effect of fast forward recovery. However, if the system developer is concerned with hardware fault possibilities only, then forming DRB stations becomes a mechanical process not burdening the application software designer in any way. A procedure for converting an efficient nonredundant task-to-processor mapping into an efficient full DRB mapping is presented 相似文献

15.

Principles to Support Modular Software Construction

下载免费PDF全文

Jack B. Dennis 《计算机科学技术学报》2017,32(1):3-10

The construction of large software systems is always achieved through assembly of independently written components — program modules. For these software components to work together, they must share a common set of data types and principles for representing structured data such as arrays of values and files. This common set of tools for creating and operating on data objects is provided by the infrastructure of the computer system: the hardware, operating system and runtime code. Because the nature and properties of these tools are crucial for correct operation of software components and their inter-operation, it is essential to have a precise specification that may be used for verifying correctness of application software on one hand, and to verify correctness of system behavior on the other. We call such a specification a program execution model (PXM). It is evident that the properties of the PXM implemented by a computer system can have serious impact on the ability of application programmers to practice modular software construction. This paper discusses the concept of program execution models and presents a set of principles that a PXM must satisfy to provide a sound basis for modular software construction. Because parallel program execution on computer systems with many processing units is an essential part of contemporary computing environments, the expression of parallelism and modular software construction using components involving parallel operations is included in this treatment. The conclusion is that it is possible to build computer systems that implement a PXM within which any parallel program may be used, unmodified, as a component for building more substantial parallel programs. 相似文献

16.

Real-Time Execution Monitoring

Plattner Bernhard 《IEEE transactions on pattern analysis and machine intelligence》1984,(6):756-764

Today's programming methodology emphasizes the study of static aspects of programs. In practice, however, monitoring a program in execution, i.e., monitoring a process, is routinely done by any programmer whose task it is to produce a reliable piece of software. There are two reasons why one might want to examine the dynamic aspects of a program: first, to evaluate the performance of a program, and hence to assess its overall behavior; and second, to demonstrate the presence of programming errors, isolate erroneous program code, and correct it. This latter task is commonly called ``debugging a program' and requires a detailed insight into the innards of a program being executed. Today, many computer systems are being used to measure and control real-world processes. The pace of execution of these systems and their control programs is therefore bound to timing constraints imposed by the real-world process. As a step towards solving the problems associated with execution monitoring of real-time programs, we develop a set of appropriate concepts and define the basic requirements for a real-time monitoring facility. As a test case for the theoretical treatment of the topic, we design hardware and software for an experimental real-time monitoring system and describe its implementation. 相似文献

17.

Fault tolerance in supervisory control systems: a knowledge-based approach

Dimitris Th. Askounis Vassilis Assimakopoulos John Psarras 《Journal of Intelligent Manufacturing》1994,5(5):323-331

Fault tolerance in computerized systems involved in production has become an ever more important requirement. Existing fault tolerance approaches, wherever used, deal mainly with hardware faults. Nevertheless, the vast majority of contemporary system failures are software related. This paper introduces a knowledge-based approach to handling software related faults occurring in supervisory control systems. These systems are event driven and use data, stored in complex databases, to react to events coming from different kinds of devices by identifying, scheduling, initiating and monitoring operations. Failure of part of the supervisory control system's software to behave rationally when unexpected events occur is called an application fault. The approach introduced in this paper is based on a supervisory control system reference model which reveals the set of all possible application faults together with the major functions of the recovery processes associated with each fault, and leads to a high-level knowledge-based system architecture capable of handling every fault-related condition. This system is called PROFIT (Intelligent PROduction systems Fault Tolerance) and consists of three main components: the fault diagnosis module, the instant fault correction module and the learning module, co-ordinated by a PROFIT meta-level module. The prototype version of PROFIT is analysed and the development as well as the run-time environment that prove the applicability and effectiveness of the system are presented. 相似文献

18.

Optimization of VLIW compatibility systems employing dynamic rescheduling 总被引：1，自引：0，他引：1

Thomas M. Conte Sumedh W. Sathaye 《International journal of parallel programming》1997,25(2):83-112

Lack of object code compatibility in VLIW architectures is a severe limit to their adoption as a general-purpose computing paradigm. Previous approaches include hardware and software techniques, both of which have drawbacks. Hardware techniques add to the complexity of the architecture, whereas software techniques require multiple executables. This paper presents a technique called Dynamic Rescheduling that applies software techniques dynamically, using intervention by the OS: at each first-time page fault, the page of code is rescheduled for the new generation, if required. Results are presented to demonstrate the viability of the technique using the Illinois IMPACT compiler and the TINKER architectural framework. For the machine models and the workloads used in this study, performance of the rescheduled code compares well with the native scheduled code for a machine. The behavior of a subset of programs in the workload is such that they face a large number of first-time page faults. Due to this, their rescheduling overhead is higher relative to their total execution time. Such programs are calledhigh-overhead programs. Caching of translated pages across multiple invocations of the program to reduce the rescheduling overhead, using apersistent rescheduled-page cache (PRC) ⁽¹⁾ is discussed. It was found that for the workload used in this evaluation, a PRC of size between 512 to 1024 pages, and which uses anoverhead-based page replacement policy would be effective in reducing the overhead. This is a revised and expanded version of the paper presented by the authors at the28th Annual International Symposium on Microarchitecture (MICRO-28), November 1995, Ann Arbor, Michigan. 相似文献

19.

A fault-tolerant pipelined architecture for symmetric block ciphers

Min-Kyu Joo Author Vitae Author Vitae 《Computers & Electrical Engineering》2005,31(6):380-390

Secure transmission over wired/wireless networks requires encryption of data and control information. For high-speed data transmission, it would be desirable to implement the encryption algorithms in hardware. Faults in the hardware, however, may cause interruption of service. This paper presents a simple technique for achieving fault tolerance in pipelined implementation of symmetric block ciphers. It detects errors, locates the corresponding faults, and readily reconfigures during normal operation to isolate the identified faulty modules. Bypass links with some extra pipeline stages are used to achieve fault tolerance. The hardware overhead can be controlled by properly choosing the number of extra stages. Moreover, fault tolerance is achieved with negligible time overhead. 相似文献

20.

Fault-Management in P2P-MPI

Stéphane Genaud Emmanuel Jeannot Choopan Rattanapoka 《International journal of parallel programming》2009,37(5):433-461

We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions. 相似文献