共查询到20条相似文献,搜索用时 125 毫秒
1.
2.
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对进程活动文件状态进行保存和恢复则是这种技术的重要方面.本文提出的延迟写策略,实现了对用户文件的检查点设置,有效地解决了在发生故障时用户文件内容与进程全局状态的不一致问题.它对用户通明,并且通过优化设置内存缓冲区大小、时延隐藏等手段,使得这种策略在空间开销、正常运行时间、恢复时间等性能指标上优于其它方法. 相似文献
3.
在大规模并行计算系统中,并行检查点触发大量结点同时保存计算状态,造成巨大文件存储空间开销,以及对通信和存储系统的巨大访问压力.数据压缩可以缩小检查点文件尺寸,从而降低存储空间开销以及对通信和存储系统的访问压力.但是,它也带来额外的压缩计算开销.本文针对异构并行计算系统,提出流水线式并行压缩检查点技术,采用一系列优化技术来降低压缩引入的计算延时,包括:流水线式双重写缓存队列、文件写操作的合并、GPU加速的流水压缩算法和GPU资源的多进程调度,等等.本文介绍了该技术在天河一号系统中的实现,并对所实现的检查点系统进行综合评测.实验数据表明该方法在大规模异构并行计算系统中是可行、高效、实用的. 相似文献
4.
片上网络中路由器发生故障势必会影响整个网络的性能,过大的容错开销也会给网络带来很大的负担.对此,本文提出了一种故障通道隔离的低开销容错路由器架构,该路由器通过减少不必要的交叉开关及合理优化各个端口VC的数目来减小路由器整体开销,同时增加一个冗余通道来达到对路由器容错的目的.当路由器中某个通道发生故障时,通道隔离检测方法使路由器能够在检测故障类型的同时进行数据传输,带回收指针的重传buffer将会进一步减少整个容错结构的开销.实验结果表明在无故障情况下本文设计的路由器较传统路由器平均延时降低45%左右,最大吞吐率提高28%左右,面积开销仅仅增加了18.24%.在故障存在的情况下,本文方案也显现出很大的优越性,能够达到很好的容错效果. 相似文献
5.
为了保证数据流中的每个元组得到可靠处理,传统的方法需要在内存中保存每个元组,直到它们被数据流处理系统正常处理,因此会带来很大的内存开销。为此提出了一种既能够保证元组得到可靠处理,又能够节省内存开销的元组跟踪方法。该方法包括内存分配策略、元组跟踪单元选择策略和校验值更新策略,这3个策略使得元组跟踪单元只保留元组标识符的异或校验值而不是元组减少内存开销,同时通过改进一致性散列变换实现元组跟踪单元的负载均衡。内存开销和负载均衡的相关实验表明,该方法能够有效实现对元组的跟踪和可靠处理。 相似文献
6.
为了克服传统的基于奇异值分解的目标检测方法存在目标强度变弱的不足之处,采用改进的奇异值分解方法用于红外弱小目标检测。根据奇异值分解的性质,对其中目标贡献最大的中序部分奇异值进行了非线性修正的改进,并将其它奇异值设置为零后通过重构图像得到背景抑制后的目标图像。结果表明,该方法不仅能够保存和增强目标能量,提高目标信号的信杂比和对比度,而且还能得到很好的背景抑制效果。 相似文献
7.
8.
9.
10.
一种基于高性能集群计算系统的检查点策略 总被引:1,自引:1,他引:0
为了提高高性能集群计算系统的容错能力,检查点设置成为一种广泛采用的手段.目前检查点设置多采用的协调式设置协议,该协议在集群规模扩展情况下,同步操作造成巨大的系统时间开销,并阻塞正常计算的执行.针对该问题,使用非协调式检查点设置协议消除同步操作,采用消息日志记录方式保证系统状态一致性,并利用线程后台执行方式达到透明性设置.最后,通过典型的系统实验,验证了该方法的有效性,并进行同协调式协议设置的时间开销对比. 相似文献
11.
矩阵LU分解的容错并行算法设计与实现 总被引:1,自引:0,他引:1
给出了容错并行算法的定义,提出了一种新的基于并行复算的容错并行算法.针对许多计算密集型任务中的矩阵LU分解设计了相应的基于并行复算的容错并行算法,并对设计的矩阵LU分解的容错并行算法的性能进行了评估并与checkpointing方法进行了对比.结果表明与checkpointing方法相比,矩阵LU分解的容错并行算法有性能上的优势. 相似文献
12.
Real-time computer systems are often used in harsh environments, such as aerospace, and in industry. Such systems are subject to many transient faults while in operation. Checkpointing enables a reduction in the recovery time from a transient fault by saving intermediate states of a task in a reliable storage facility, and then, on detection of a fault, restoring from a previously stored state. The interval between checkpoints affects the execution time of the task. Whereas inserting more checkpoints and reducing the interval between them reduces the reprocessing time after faults, checkpoints have associated execution costs, and inserting extra checkpoints increases the overall task execution time. Thus, a trade-off between the reprocessing time and the checkpointing overhead leads to an optimal checkpoint placement strategy that optimizes certain performance measures. Real-time control systems are characterized by a timely, and correct, execution of iterative tasks within deadlines. The reliability is the probability that a system functions according to its specification over a period of time. This paper reports on the reliability of a checkpointed real-time control system, where any errors are detected at the checkpointing time. The reliability is used as a performance measure to find the optimal checkpointing strategy. For a single-task control system, the reliability equation over a mission time is derived using the Markov model. Detecting errors at the checkpointing time makes reliability jitter with the number of checkpoints. This forces the need to apply other search algorithms to find the optimal number of checkpoints. By considering the properties of the reliability jittering, a simple algorithm is provided to find the optimal checkpoints effectively. Finally, the reliability model is extended to include multiple tasks by a task allocation algorithm 相似文献
13.
Hassan Motallebi 《ETRI Journal》2020,42(3):388-398
We herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead. 相似文献
14.
A number of checkpointing and message logging algorithms have been proposed to support fault tolerance of mobile computing systems. However, little attention has been paid to the optimistic message logging scheme. Optimistic logging has a lower failure-free operation cost compared to other logging schemes. It also has a lower failure recovery cost compared to the checkpointing schemes. This paper presents an efficient scheme to implement optimistic logging for the mobile computing environment. In the proposed scheme, the task of logging is assigned to the mobile support station so that volatile logging can be utilized. In addition, to reduce the message overhead, the mobile support station takes care of dependency tracking and the potential dependency between mobile hosts is inferred from the dependency between mobile support stations. The performance of the proposed scheme is evaluated by an extensive simulation study. The results show that the proposed scheme requires a small failure-free overhead and the cost of unnecessary rollback caused by the imprecise dependency is adjustable by properly selecting the logging frequency. 相似文献
15.
In this paper, we present an approach to improving the performance of timed cosimulation. Our approach applies optimistic simulation concept to timed cosimulation for the reduction of synchronization overhead. It consists of (1) a predictive method for the synchronization between optimistic and synchronous simulators and (2) a method for the reduction of the state saving overhead inherent in optimistic simulation. To reduce the synchronization overhead, the predictive synchronization method predicts the time point when the next event is transferred between hardware and software. They the optimistic simulator runs optimistically until the predicted time point. Because of prediction and optimistic simulation, it is possible for the optimistic simulator to roll back and re-execute. To support rollbacks during optimistic simulation, states of the simulator are stored at checkpoints. In optimistic simulation, state saving can cause significant overhead in run-time and memory usage. For the reduction of state saving overhead, we perform state saving on a task basis, which enables saving only the state of currently running task without saving the whole state of the simulator at each checkpoint. Especially, single checkpoint property for hardware tasks enables minimizing the number of state savings in hardware simulation. We demonstrate the efficiency of the presented approach through cosimulation of two embedded system design examples. 相似文献
16.
Employing fault tolerance often introduces a time overhead, which may cause a deadline violation in real-time systems (RTS). Therefore, for RTS it is important to optimize the fault tolerance techniques such that the probability to meet the deadlines, i.e. the Level of Confidence (LoC), is maximized. Previous studies have focused on evaluating the LoC for equidistant checkpointing. However, no studies have addressed the problem of evaluating the LoC for non-equidistant checkpointing. In this work, we provide an expression to evaluate the LoC for non-equidistant checkpointing. Further, we detail an exhaustive search approach to find the distribution of a given number of checkpoints that results in the maximal LoC. Since the exhaustive search approach is very time-consuming, we propose the Clustered Checkpointing method, a heuristic that distributes checkpoints in a number of clusters with the goal to maximize the LoC. The results show that the LoC can be improved when non-equidistant checkpointing is used. Further, the results indicate that the proposed Clustered Checkpointing method is capable to find the distribution that results in the maximal LoC in much shorter time than the exhaustive search approach, while considering only few clusters. 相似文献
17.
Kantawala K. Tao D.L. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1997,5(3):338-343
In this brief, we propose two new concurrent error-detection (CED) schemes for a class of sorting networks, e.g., odd-even transposition, bitonic, and perfect shuffle sorting networks. A probabilistic method is developed to analyze the fault coverage, and the hardware overhead is evaluated. We first propose a CED scheme by which all errors caused by single faults in a concurrent checking sorting network can be detected. This scheme is the first one available to use significantly less hardware overhead than duplication without compromising throughput. From this scheme, we develop another fault detection scheme which sharply reduces the hardware overhead (using an additional 10%~30% hardware) but still achieves virtually 1001 fault coverage 相似文献
18.
Karri R. Kaijie Wu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(6):864-875
Concurrent error detection (CED) based on time redundancy entails performing the normal computation and the re-computation at different times and then comparing their results. Time redundancy implemented can only detect transient faults. We present two algorithm-level time-redundancy-based CED schemes that exploit register transfer level (RTL) implementation diversity to detect transient and permanent faults. At the RTL, implementation diversity can be achieved either by changing the operation-to-operator allocation or by shifting the operands before re-computation. By exploiting allocation diversity and data diversity, a stuck-at fault will affect the two results in two different ways. The proposed schemes yield good fault detection probability with very low area overhead. We used the Synopsys behavior complier (BC), to validate the schemes. 相似文献
19.
Time-based coordinated checkpointing protocols are well suited for mobile computing systems because no explicit coordination message is needed while the advantages of coordinated checkpointing are kept. However, without coordination, every process has to take a checkpoint during a checkpointing process. In this paper, an efficient time-based coordinated checkpointing protocol for mobile computing systems over Mobile IP is proposed. The protocol reduces the number of checkpoints per checkpointing process to nearly minimum, so that fewer checkpoints need to be transmitted through the costly wireless link. Our protocol also performs very well in the aspects of minimizing the number and size of messages transmitted in the wireless network. In addition, the protocol is nonblocking because inconsistencies can be avoided by the piggybacked information in every message. Therefore, the protocol brings very little overhead to a mobile host with limited resource. Additionally, by taking advantage of reliable timers in mobile support stations, the time-based checkpointing protocol can adapt to wide area networks. 相似文献
20.
The advent of advanced microelectronic technologies and scale downing into nanometer dimensions has made current digital systems more susceptible to faults and increases the demand for reliable and high-performance computing. Current solutions have so far used the parity prediction scheme to increase reliability and detect fault in adder modules, but they add perceptible area overhead to the circuit. In this paper, we present two new efficient methods for fault detection and localization, in addition to the full error-correction, targeting stack-at and multi-cycle transient (MCT) faults in radix-2 signed-digit adders through a combination of time and hardware redundancy. In this study, we use the self-checking full adder that can identify a fault based on internal functionality to detect any fault in the adder modules. The detection of a fault is followed by input inversion, recomputation, and appropriate output inversion to correct the error and localize the fault. The error-correction method employs fault masking by utilizing the self-dual concept, which is based on the fact that in the presence of a fault, the designed technique results in a fault-free complement of the expected output when fed by the complement of its input operands. In addition, the existence of any fault in the input lines of the adder modules can be identified by low-cost parity checking error-detection approach, and a faulty module can be localized by comparing the faulty output from the first computation with the fault-free output from the recomputation. Based on the experimental results, the area occupied by our designs is approximately 50% that of the area used by previous designs that employ the parity prediction scheme. In addition to the area reduction, our design approaches result in a higher reliability with less power consumption and low time delay. 相似文献