首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
一种可靠高效的回卷恢复实现方法   总被引:3,自引:0,他引:3  
本文针对现有用户级进程检查点实现中的线程挂起点不确定性问题提出一种基于线程自挂的解决方案.另外,为了降低分布式回卷恢复开销,本文提出一个多线程化的回卷恢复实现基架.基于所提回卷恢复策略,开发了一个回卷恢复试验床WINDAR.试验结果表明,多线程化实现策略能够显著提高悲观消息日志协议性能.  相似文献   

2.
WOB:一种新的文件检查点设置策略   总被引:6,自引:1,他引:5       下载免费PDF全文
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对进程活动文件状态进行保存和恢复则是这种技术的重要方面.本文提出的延迟写策略,实现了对用户文件的检查点设置,有效地解决了在发生故障时用户文件内容与进程全局状态的不一致问题.它对用户通明,并且通过优化设置内存缓冲区大小、时延隐藏等手段,使得这种策略在空间开销、正常运行时间、恢复时间等性能指标上优于其它方法.  相似文献   

3.
刘勇鹏  王锋  卢凯  刘勇燕 《电子学报》2012,40(2):223-229
在大规模并行计算系统中,并行检查点触发大量结点同时保存计算状态,造成巨大文件存储空间开销,以及对通信和存储系统的巨大访问压力.数据压缩可以缩小检查点文件尺寸,从而降低存储空间开销以及对通信和存储系统的访问压力.但是,它也带来额外的压缩计算开销.本文针对异构并行计算系统,提出流水线式并行压缩检查点技术,采用一系列优化技术来降低压缩引入的计算延时,包括:流水线式双重写缓存队列、文件写操作的合并、GPU加速的流水压缩算法和GPU资源的多进程调度,等等.本文介绍了该技术在天河一号系统中的实现,并对所实现的检查点系统进行综合评测.实验数据表明该方法在大规模异构并行计算系统中是可行、高效、实用的.  相似文献   

4.
片上网络中路由器发生故障势必会影响整个网络的性能,过大的容错开销也会给网络带来很大的负担.对此,本文提出了一种故障通道隔离的低开销容错路由器架构,该路由器通过减少不必要的交叉开关及合理优化各个端口VC的数目来减小路由器整体开销,同时增加一个冗余通道来达到对路由器容错的目的.当路由器中某个通道发生故障时,通道隔离检测方法使路由器能够在检测故障类型的同时进行数据传输,带回收指针的重传buffer将会进一步减少整个容错结构的开销.实验结果表明在无故障情况下本文设计的路由器较传统路由器平均延时降低45%左右,最大吞吐率提高28%左右,面积开销仅仅增加了18.24%.在故障存在的情况下,本文方案也显现出很大的优越性,能够达到很好的容错效果.  相似文献   

5.
为了保证数据流中的每个元组得到可靠处理,传统的方法需要在内存中保存每个元组,直到它们被数据流处理系统正常处理,因此会带来很大的内存开销。为此提出了一种既能够保证元组得到可靠处理,又能够节省内存开销的元组跟踪方法。该方法包括内存分配策略、元组跟踪单元选择策略和校验值更新策略,这3个策略使得元组跟踪单元只保留元组标识符的异或校验值而不是元组减少内存开销,同时通过改进一致性散列变换实现元组跟踪单元的负载均衡。内存开销和负载均衡的相关实验表明,该方法能够有效实现对元组的跟踪和可靠处理。  相似文献   

6.
基于改进的奇异值分解的红外弱小目标检测   总被引:1,自引:0,他引:1       下载免费PDF全文
冯洋 《激光技术》2016,40(3):335-338
为了克服传统的基于奇异值分解的目标检测方法存在目标强度变弱的不足之处,采用改进的奇异值分解方法用于红外弱小目标检测。根据奇异值分解的性质,对其中目标贡献最大的中序部分奇异值进行了非线性修正的改进,并将其它奇异值设置为零后通过重构图像得到背景抑制后的目标图像。结果表明,该方法不仅能够保存和增强目标能量,提高目标信号的信杂比和对比度,而且还能得到很好的背景抑制效果。  相似文献   

7.
针对Ad Hoc网络中的节点位置估计和路由控制问题,在基于OLSR(最优链路状态路由)协议的基础上提出了一种能够同时实现路由和定位的综合协议OLSR-P(最优链路状态路由和定位)。该协议对OLSR协议进行了改进,能够将路由开销和定位有效结合,并对数据包结构进行改良,利用原协议的路由开销实现定位。仿真实验结果表明:OLSR-P协议不仅能够同时实现路由和定位,还能有效地控制开销。  相似文献   

8.
单粒子翻转发生在数据流驱动的计算通道中时,会产生错误结果甚至计算通道无法正常工作。针对这一问题,文中提出了一种基于地址回卷技术的双模冗余容错方案。该方案采用双计算通道比较,检测结果是否发生错误;利用地址回卷技术还原地址信息,驱动原数据重新计算;对相关性数据流计算进行特定优化回卷。文中基于数据流驱动运算器对方案进行具体实现,实验结果显示,该方案实现了99%以上的可靠率,面积开销仅为三模冗余的76%,时间开销比双模冗余节约最多50%,实现了面积和时间消耗的平衡。  相似文献   

9.
基于分块消息日志的回卷恢复策略   总被引:5,自引:0,他引:5       下载免费PDF全文
杨金民  张大方 《电子学报》2004,32(5):857-859
本文给出了一种基于分块消息日志的回卷恢复协议,建立了其性能模型,评估了协议的平均开销.分块消息日志方法是一种可配置的一般化方法,悲观消息日志方法和协同检查点方法是其两个特例.性能分析结果表明,协议配置参数具有可优化特性,采用分块消息日志策略能够优化协议性能.  相似文献   

10.
一种基于高性能集群计算系统的检查点策略   总被引:1,自引:1,他引:0  
为了提高高性能集群计算系统的容错能力,检查点设置成为一种广泛采用的手段.目前检查点设置多采用的协调式设置协议,该协议在集群规模扩展情况下,同步操作造成巨大的系统时间开销,并阻塞正常计算的执行.针对该问题,使用非协调式检查点设置协议消除同步操作,采用消息日志记录方式保证系统状态一致性,并利用线程后台执行方式达到透明性设置.最后,通过典型的系统实验,验证了该方法的有效性,并进行同协调式协议设置的时间开销对比.  相似文献   

11.
矩阵LU分解的容错并行算法设计与实现   总被引:1,自引:0,他引:1  
给出了容错并行算法的定义,提出了一种新的基于并行复算的容错并行算法.针对许多计算密集型任务中的矩阵LU分解设计了相应的基于并行复算的容错并行算法,并对设计的矩阵LU分解的容错并行算法的性能进行了评估并与checkpointing方法进行了对比.结果表明与checkpointing方法相比,矩阵LU分解的容错并行算法有性能上的优势.  相似文献   

12.
Real-time computer systems are often used in harsh environments, such as aerospace, and in industry. Such systems are subject to many transient faults while in operation. Checkpointing enables a reduction in the recovery time from a transient fault by saving intermediate states of a task in a reliable storage facility, and then, on detection of a fault, restoring from a previously stored state. The interval between checkpoints affects the execution time of the task. Whereas inserting more checkpoints and reducing the interval between them reduces the reprocessing time after faults, checkpoints have associated execution costs, and inserting extra checkpoints increases the overall task execution time. Thus, a trade-off between the reprocessing time and the checkpointing overhead leads to an optimal checkpoint placement strategy that optimizes certain performance measures. Real-time control systems are characterized by a timely, and correct, execution of iterative tasks within deadlines. The reliability is the probability that a system functions according to its specification over a period of time. This paper reports on the reliability of a checkpointed real-time control system, where any errors are detected at the checkpointing time. The reliability is used as a performance measure to find the optimal checkpointing strategy. For a single-task control system, the reliability equation over a mission time is derived using the Markov model. Detecting errors at the checkpointing time makes reliability jitter with the number of checkpoints. This forces the need to apply other search algorithms to find the optimal number of checkpoints. By considering the properties of the reliability jittering, a simple algorithm is provided to find the optimal checkpoints effectively. Finally, the reliability model is extended to include multiple tasks by a task allocation algorithm  相似文献   

13.
We herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.  相似文献   

14.
A number of checkpointing and message logging algorithms have been proposed to support fault tolerance of mobile computing systems. However, little attention has been paid to the optimistic message logging scheme. Optimistic logging has a lower failure-free operation cost compared to other logging schemes. It also has a lower failure recovery cost compared to the checkpointing schemes. This paper presents an efficient scheme to implement optimistic logging for the mobile computing environment. In the proposed scheme, the task of logging is assigned to the mobile support station so that volatile logging can be utilized. In addition, to reduce the message overhead, the mobile support station takes care of dependency tracking and the potential dependency between mobile hosts is inferred from the dependency between mobile support stations. The performance of the proposed scheme is evaluated by an extensive simulation study. The results show that the proposed scheme requires a small failure-free overhead and the cost of unnecessary rollback caused by the imprecise dependency is adjustable by properly selecting the logging frequency.  相似文献   

15.
In this paper, we present an approach to improving the performance of timed cosimulation. Our approach applies optimistic simulation concept to timed cosimulation for the reduction of synchronization overhead. It consists of (1) a predictive method for the synchronization between optimistic and synchronous simulators and (2) a method for the reduction of the state saving overhead inherent in optimistic simulation. To reduce the synchronization overhead, the predictive synchronization method predicts the time point when the next event is transferred between hardware and software. They the optimistic simulator runs optimistically until the predicted time point. Because of prediction and optimistic simulation, it is possible for the optimistic simulator to roll back and re-execute. To support rollbacks during optimistic simulation, states of the simulator are stored at checkpoints. In optimistic simulation, state saving can cause significant overhead in run-time and memory usage. For the reduction of state saving overhead, we perform state saving on a task basis, which enables saving only the state of currently running task without saving the whole state of the simulator at each checkpoint. Especially, single checkpoint property for hardware tasks enables minimizing the number of state savings in hardware simulation. We demonstrate the efficiency of the presented approach through cosimulation of two embedded system design examples.  相似文献   

16.
Employing fault tolerance often introduces a time overhead, which may cause a deadline violation in real-time systems (RTS). Therefore, for RTS it is important to optimize the fault tolerance techniques such that the probability to meet the deadlines, i.e. the Level of Confidence (LoC), is maximized. Previous studies have focused on evaluating the LoC for equidistant checkpointing. However, no studies have addressed the problem of evaluating the LoC for non-equidistant checkpointing. In this work, we provide an expression to evaluate the LoC for non-equidistant checkpointing. Further, we detail an exhaustive search approach to find the distribution of a given number of checkpoints that results in the maximal LoC. Since the exhaustive search approach is very time-consuming, we propose the Clustered Checkpointing method, a heuristic that distributes checkpoints in a number of clusters with the goal to maximize the LoC. The results show that the LoC can be improved when non-equidistant checkpointing is used. Further, the results indicate that the proposed Clustered Checkpointing method is capable to find the distribution that results in the maximal LoC in much shorter time than the exhaustive search approach, while considering only few clusters.  相似文献   

17.
In this brief, we propose two new concurrent error-detection (CED) schemes for a class of sorting networks, e.g., odd-even transposition, bitonic, and perfect shuffle sorting networks. A probabilistic method is developed to analyze the fault coverage, and the hardware overhead is evaluated. We first propose a CED scheme by which all errors caused by single faults in a concurrent checking sorting network can be detected. This scheme is the first one available to use significantly less hardware overhead than duplication without compromising throughput. From this scheme, we develop another fault detection scheme which sharply reduces the hardware overhead (using an additional 10%~30% hardware) but still achieves virtually 1001 fault coverage  相似文献   

18.
Concurrent error detection (CED) based on time redundancy entails performing the normal computation and the re-computation at different times and then comparing their results. Time redundancy implemented can only detect transient faults. We present two algorithm-level time-redundancy-based CED schemes that exploit register transfer level (RTL) implementation diversity to detect transient and permanent faults. At the RTL, implementation diversity can be achieved either by changing the operation-to-operator allocation or by shifting the operands before re-computation. By exploiting allocation diversity and data diversity, a stuck-at fault will affect the two results in two different ways. The proposed schemes yield good fault detection probability with very low area overhead. We used the Synopsys behavior complier (BC), to validate the schemes.  相似文献   

19.
Time-based coordinated checkpointing protocols are well suited for mobile computing systems because no explicit coordination message is needed while the advantages of coordinated checkpointing are kept. However, without coordination, every process has to take a checkpoint during a checkpointing process. In this paper, an efficient time-based coordinated checkpointing protocol for mobile computing systems over Mobile IP is proposed. The protocol reduces the number of checkpoints per checkpointing process to nearly minimum, so that fewer checkpoints need to be transmitted through the costly wireless link. Our protocol also performs very well in the aspects of minimizing the number and size of messages transmitted in the wireless network. In addition, the protocol is nonblocking because inconsistencies can be avoided by the piggybacked information in every message. Therefore, the protocol brings very little overhead to a mobile host with limited resource. Additionally, by taking advantage of reliable timers in mobile support stations, the time-based checkpointing protocol can adapt to wide area networks.  相似文献   

20.
The advent of advanced microelectronic technologies and scale downing into nanometer dimensions has made current digital systems more susceptible to faults and increases the demand for reliable and high-performance computing. Current solutions have so far used the parity prediction scheme to increase reliability and detect fault in adder modules, but they add perceptible area overhead to the circuit. In this paper, we present two new efficient methods for fault detection and localization, in addition to the full error-correction, targeting stack-at and multi-cycle transient (MCT) faults in radix-2 signed-digit adders through a combination of time and hardware redundancy. In this study, we use the self-checking full adder that can identify a fault based on internal functionality to detect any fault in the adder modules. The detection of a fault is followed by input inversion, recomputation, and appropriate output inversion to correct the error and localize the fault. The error-correction method employs fault masking by utilizing the self-dual concept, which is based on the fact that in the presence of a fault, the designed technique results in a fault-free complement of the expected output when fed by the complement of its input operands. In addition, the existence of any fault in the input lines of the adder modules can be identified by low-cost parity checking error-detection approach, and a faulty module can be localized by comparing the faulty output from the first computation with the fault-free output from the recomputation. Based on the experimental results, the area occupied by our designs is approximately 50% that of the area used by previous designs that employ the parity prediction scheme. In addition to the area reduction, our design approaches result in a higher reliability with less power consumption and low time delay.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号