期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Wei Hu Guang-Ming Liu Yan-Huang Jiang 《浙江大学学报:C卷英文版》2018,19(10):1273-1290

As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond. 相似文献

2.

Analysis and randomized design of algorithm-based fault tolerantmultiprocessor systems under an extended model

Yajnik S. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(7):757-768

Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. Algorithm based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations 相似文献

3.

Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance 总被引：1，自引：1，他引：0

Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献

4.

一种中间件服务容错配置管理方法 总被引：1，自引：0，他引：1

李军国黄罡邹键梅宏《计算机学报》2007,30(10):1696-1704

提出一种基于运行时刻软件体系结构的容错管理方法,支持开发者和管理员针对不同中间件服务失效定制合适的故障检测和修复机制.首先,运行时刻软件体系结构自动构造构件依赖视图和错误传播①视图,为理解和分析整个系统的可靠性提供全局视图;然后,操作运行时刻软件体系结构配置容错机制;最后利用AOP技术将容错机制插装到中间件中,使其具备指定的容错能力.上述过程在一个可视化工具的辅助下半自动实施,并在J2EE中间件上得到验证. 相似文献

5.

A comprehensive study on fault tolerance in stream processing systems

Xiaotong WANG Chunxi ZHANG Junhua FANG Rong ZHANG Weining QIAN Aoying ZHOU 《Frontiers of Computer Science》2022,16(2):162603

Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data. Since stream processing systems (SPSs) usually require distributed deployment on clusters of servers in face of large-scale of data, it is especially common to meet with failures of processing nodes or communication networks, but should be handled seriously considering service quality. A failed system may produce wrong results or become unavailable, resulting in a decline in user experience or even significant financial loss. Hence, a large amount of fault tolerance approaches have been proposed for SPSs. These approaches often have their own priorities on specific performance concerns, e.g., runtime overhead and recovery efficiency. Nevertheless, there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs, which will become an obstacle for the development of SPSs. Therefore, we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs. Furthermore, we propose an evaluation framework tailored for fault tolerance, demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs. Finally, we specify future research directions in this domain. 相似文献

6.

Deliberative, search-based mitigation strategies for model-based software health management

Nagabhushan Mahadevan Abhishek Dubey Daniel Balasubramanian Gabor Karsai 《Innovations in Systems and Software Engineering》2013,9(4):293-318

相似文献

7.

基于Multi-agent的实时系统运行故障监控研究

刘彦斌朱小冬《微计算机信息》2006,22(28):224-226

软件密集型装备中常常包含着许多担负监测和控制作用的嵌入式实时系统,它们常常属于安全关键或者任务关键系统(safety-critical/mission-critical system)。为了能够有效解决该类系统中的软件故障检测、诊断与修复任务,本文提出了基于Multi-agent的实时系统运行故障监控框架,旨在利用在多agent的协作构建运行故障监控系统来在系统运行当中验证系统是否满足时序逻辑描述的性质规约,并采用具体的算法进行故障定位和修复。相似文献

8.

A comparative analysis of hardware and software fault tolerance: Impact on software reliability engineering

Hany H. Ammar Bojan Cukic Ali Mili Cris Fuhrman 《Annals of Software Engineering》2000,10(1-4):103-150

Today's digital systems are growing increasingly complex, and are being used in increasingly critical functions. The first premise makes them more prone to contain faults, and the second premise makes their failure less tolerable. This widening gap highlights the need for fault tolerant techniques, which make provisions for reliable operation of digital systems despite the presence and occasional manifestation of faults. In this paper we present a brief comparative survey of fault tolerance as it arises in hardware systems and software systems. We discuss logical models as well as statistical models of fault tolerance, and use these models to analyze design tradeoffs of fault tolerant systems. This revised version was published online in June 2006 with corrections to the Cover Date. 相似文献

9.

Analyzing the techniques that improve fault tolerance of aggregation trees in sensor networks

Laukik Alin Sanjay 《Journal of Parallel and Distributed Computing》2009,69(12):950-960

Sensor networks are finding significant applications in large scale distributed systems. One of the basic operations in sensor networks is in-network aggregation. Among the various approaches to in-network aggregation, such as gossip and tree, including the hash-based techniques, the tree-based approaches have better performance and energy-saving characteristics. However, sensor networks are highly prone to failures. Numerous techniques suggested in the literature to counteract the effect of failures have not been carefully analyzed. In this paper, we focus on the performance of these tree-based aggregation techniques in the presence of failures. First, we identify a fault model that captures the important failure traits of the system. Then, we analyze the correctness of simple tree aggregation with our fault model. We then use the same fault model to analyze the techniques that utilize redundant trees to improve the variance. The impact of techniques for maintaining the correctness under faults, such as rebuilding or locally fixing the tree, is then studied under the same fault model. We also do the cost-benefit analysis of using the hash-based schemes which are based on FM sketches. We conclude that these fault tolerance techniques for tree aggregation do not necessarily result in substantial improvement in fault tolerance. 相似文献

10.

A resource management and fault tolerance services in grid computing

《Journal of Parallel and Distributed Computing》2005,65(11):1305-1317

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur. 相似文献

11.

Reliability growth modeling and optimal release policy under fuzzy environment of an N-version programming system incorporating the effect of fault removal efficiency

P. K. Kapur Anshu Gupta P.C. Jha 《国际自动化与计算杂志》2007,4(4):369-379

Failure of a safety critical system can lead to big losses.Very high software reliability is required for automating the working of systems such as aircraft controller and nuclear reactor controller software systems.Fault-tolerant softwares are used to increase the overall reliability of software systems.Fault tolerance is achieved using the fault-tolerant schemes such as fault recovery (recovery block scheme),fault masking (N-version programming (NVP)) or a combination of both (Hybrid scheme).These softwares incorporate the ability of system survival even on a failure.Many researchers in the field of software engineering have done excellent work to study the reliability of fault-tolerant systems.Most of them consider the stable system reliability.Few attempts have been made in reliability modeling to study the reliability growth for an NVP system.Recently,a model was proposed to analyze the reliability growth of an NVP system incorporating the effect of fault removal efficiency.In this model,a proportion of the number of failures is assumed to be a measure of fault generation while an appropriate measure of fault generation should be the proportion of faults removed.In this paper,we first propose a testing efficiency model incorporating the effect of imperfect fault debugging and error generation.Using this model,a software reliability growth model (SRGM) is developed to model the reliability growth of an NVP system.The proposed model is useful for practical applications and can provide the measures of debugging effectiveness and additional workload or skilled professional required.It is very important for a developer to determine the optimal release time of the software to improve its performance in terms of competition and cost.In this paper,we also formulate the optimal software release time problem for a 3VP system under fuzzy environment and discuss a the fuzzy optimization technique for solving the problem with a numerical illustration. 相似文献

12.

主动容错副本存储系统的可靠性分析模型

李静罗金飞李炳超《计算机应用》2021,41(4):1113-1121

主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据,从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题,提出几种副本存储系统的状态转换模型,然后利用蒙特卡洛仿真算法实现了该模型,从而模拟主动容错副本存储系统的运行,最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布,并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明,当预测模型的准确率达到50%时,系统的可靠性可以提高1~3倍;与二副本系统相比,三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平,从而搭建高可靠和高可用的存储系统。相似文献

13.

Fault-tolerance through scheduling of aperiodic tasks in hardreal-time multiprocessor systems

Ghosh S. Melhem R. Mosse D. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(3):272-284

Real time systems are being increasingly used in several applications which are time critical in nature. Fault tolerance is an important requirement of such systems, due to the catastrophic consequences of not tolerating faults. We study a scheme that provides fault tolerance through scheduling in real time multiprocessor systems. We schedule multiple copies of dynamic, aperiodic, nonpreemptive tasks in the system, and use two techniques that we call deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system). The paper compares the performance of our scheme with that of other fault tolerant scheduling schemes, and determines how much each of deallocation and overloading affects the acceptance ratio of tasks. The paper also provides a technique that can help real time system designers determine the number of processors required to provide fault tolerance in dynamic systems. Lastly, a formal model is developed for the analysis of systems with uniform tasks 相似文献

14.

主动容错云存储系统的可靠性评价模型

李静刘冬实《计算机应用》2018,38(9):2631-2636

除了传统的冗余机制,主动容错技术也被用来提高存储系统的可靠性。然而,当前对主动容错云存储系统可靠性的研究工作很少,而且都局限于硬盘故障服从指数分布的假设前提。针对主动容错磁盘冗余阵列RAID-5和RAID-6云存储系统提出两个可靠性状态转移模型,并基于转移模型设计了蒙特卡洛仿真算法,评价系统在一定运行周期内发生数据丢失事件的期望个数。该算法采用韦布分布函数模拟随时间变化（降低、恒定不变、或升高）的硬盘故障率,准确评价了主动容错机制、硬盘整体故障、故障修复、潜在块故障以及磁盘清洗过程对系统可靠性的影响。所提方法可以帮助系统设计者评估不同容错机制和系统参数对云存储系统可靠性的影响,有助于创建高可靠存储系统。相似文献

15.

Towards Software Architecture and Mechanisms for Improving Runtime Variability of Internetware System

下载免费PDF全文

Jiwei Liu Xinjun Mao 《International Journal of Software and Informatics》2014,8(1):67-94

As an emerging software paradigm, Internetware is proposed to handle openness, dynamism of software systems in the context of Internet, which implies that such software systems typically have runtime variability that can be improved dynamically to handle various or even unexpected changes of requirements and open environment. Though many progresses of Internetware software technologies have been made to support the adaptation, evolution, context-awareness, etc. of Internetware, how to construct Internetware systems with the ability to improve their runtime variability is still a great challenge in the literature of software engineering. In this paper, we propose software architecture and mechanisms for Internetware systems to support the improvement of their runtime variability by combining software variability and autonomic computing techniques. The Internetware system is organized as three levels that are consist of variable autonomic elements and Internetware entities, and architecture of these software entities is defined and discussed respectively. Moreover, we put forward a series of runtime mechanisms based on these levels, including module selection, intermediator and horizontal management, to realize operations upon the variation points and variants in software architectures and thus achieve the improvement of runtime variability. We develop a sample of Personal Data Resource Network to depict the requirements and scenario of improving runtime variability, and further study the case based on our proposed approach to show its effectiveness and applicability. 相似文献

16.

WSCMon: runtime monitoring of web service orchestration based on refinement checking

Mohsen Khaxar Saeed Jalili 《Service Oriented Computing and Applications》2012,6(1):33-49

相似文献

17.

Failure detection algorithm for Fail-Lagging model applied to HPC

Ye Yingjun Zhang Yongdong Ye Weicai 《The Journal of supercomputing》2022,78(12):14009-14033

It is essential to use fault tolerance techniques on exascale high-performance computing systems, but this faces many challenges such as higher probability of failure, more complex types of faults, and greater difficulty in failure detection. In this paper, we designed the Fail-Lagging model to describe HPC process-level failure. The failure model does not distinguish whether the failed process is crashed or slow, but is compatible with the possible behavior of the process due to various failures, such as crash, slow, recovery. The failure detection in Fail-Lagging model is implemented by local detection and global decision among processes, which depend on a robust and efficient communication topology. Robust means that failed processes do not easily corrupt the connectivity of the topology, and efficient means that the time complexity of the topology used for collective communication is as low as possible. For this purpose, we designed a torus-tree topology for failure detection, which is scalable even at the scale of an extremely large number of processes. The Fail-Lagging model supports common fault tolerance methods such as rollback, replication, redundancy, algorithm-based fault tolerance, etc. and is especially able to better enable the efficient forward recovery mode. We demonstrate with large-scale experiments that the torus-tree failure detection algorithm is robust and efficient, and we apply fault tolerance based on the Fail-Lagging model to iterative computation, enabling applications to react to faults in a timely manner.

相似文献

18.

Analyzing and diagnosing interconnect faults in bus-structuredsystems

《Design & Test of Computers, IEEE》2002,19(1):54-64

Testing multimodule systems presents several challenges, particularly when systems use submicron technology. The authors propose strategies to diagnose interconnect faults in bus-structured systems using several models. We propose several methods and strategies for a diagnosis using different fault models, including those applicable to submicron technology. Besides defining new features, such as the logical extent of faults, we also propose a reduction strategy that permits 100% fault detection and identification (including fault location) 相似文献

19.

An adaptive QoS-aware fault tolerance strategy for web services

Zibin Zheng Michael R. Lyu 《Empirical Software Engineering》2010,15(4):323-345

Service-Oriented Architecture (SOA) is widely adopted for building mission-critical systems, ranging from on-line stores to complex airline management systems. How to build reliable SOA systems becomes a big challenge due to the compositional nature of Web services. This paper proposes an adaptive QoS-aware fault tolerance strategy for Web services. Based on a user-collaborated QoS-aware middleware, SOA systems can dynamically adjust their optimal fault tolerance configurations to achieve optimal service reliability as well as good overall performance. Both the subjective user requirements and the objective system performance of the Web services are considered in our adaptive fault tolerance strategy. Experiments are conducted to illustrate the advantages of the proposed adaptive fault tolerance strategy. Performance and effectiveness comparisons of the proposed adaptive fault tolerance strategy and various traditional fault tolerance strategies are also provided. 相似文献

20.

Fair multi-agent task allocation for large datasets analysis

Quentin Baert Anne-Cécile Caron Maxime Morge Jean-Christophe Routier 《Knowledge and Information Systems》2018,54(3):591-615

MapReduce is a design pattern for processing large datasets distributed on a cluster. Its performances are linked to the data structure and the runtime environment. Indeed, data skew can yield an unfair task allocation, but even when the initial allocation produced by the partition function is well balanced, an unfair allocation can occur during the reduce phase due to the heterogeneous performance of nodes. For these reasons, we propose an adaptive multi-agent system. In our approach, the reducer agents interact during the job and the task reallocation is based on negotiation in order to decrease the workload of the most loaded reducer and so the runtime. In this paper, we propose and evaluate two negotiation strategies. Finally, we experiment our multi-agent system with real-world datasets over heterogeneous runtime environment. 相似文献