首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
刘勇鹏  王锋  卢凯  刘勇燕 《电子学报》2012,40(2):223-229
在大规模并行计算系统中,并行检查点触发大量结点同时保存计算状态,造成巨大文件存储空间开销,以及对通信和存储系统的巨大访问压力.数据压缩可以缩小检查点文件尺寸,从而降低存储空间开销以及对通信和存储系统的访问压力.但是,它也带来额外的压缩计算开销.本文针对异构并行计算系统,提出流水线式并行压缩检查点技术,采用一系列优化技术来降低压缩引入的计算延时,包括:流水线式双重写缓存队列、文件写操作的合并、GPU加速的流水压缩算法和GPU资源的多进程调度,等等.本文介绍了该技术在天河一号系统中的实现,并对所实现的检查点系统进行综合评测.实验数据表明该方法在大规模异构并行计算系统中是可行、高效、实用的.  相似文献   

2.
提高用任务重复的检查点方案的性能   总被引:4,自引:0,他引:4       下载免费PDF全文
设置检查点是减少程序在故障条件下执行时间的一种常用技术.将检查点与任务重复技术相结合,不仅能够完成有效的故障恢复,而且还能进行完善的故障检测.上述系统的开销主要来自两方面:其一是每个检查点的比较和保存开销,其二是因故障而引起的卷回.本文利用增量检查点对Ziv和Bruck提出的方法进行了改进,改进后的方法不仅能够有效地减少比较、保存检查点的开销,而且还能够避免潜伏故障引起的卷回.分析表明改进后的方法与Ziv和Bruck的方法相比表现出更好的性能.  相似文献   

3.
为解决无人机(UAV)集群任务调度时面临各节点动态、不稳定的情况,该文提出一种面向多计算节点的可尽量避免任务中断且具有容错性的任务调度方法。该方法首先为基于多计算节点构建了一个以最小化任务平均完成时间为优化目标的任务分配策略;然后基于任务的完成时间和边缘计算节点的存留时间两者的概率分布,将任务计算节点上的执行风险量化成额外开销时间;最后以任务的完成时间与额外开销时间之和替换原本的完成时间,设计了风险感知的任务分配策略。在仿真环境下将该文提出的任务调度方法与3种基准调度方法进行了对比实验,实验结果表明该方法能够有效地降低任务平均响应时间、任务平均执行次数以及任务截止时间错失率。证明该文提出的方法降低了任务重调度和重新执行带来的额外开销,可实现分布式协同计算任务的调度工作,为复杂场景下的无人机集群网络提供新的技术支持。  相似文献   

4.
WOB:一种新的文件检查点设置策略   总被引:6,自引:1,他引:5       下载免费PDF全文
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对进程活动文件状态进行保存和恢复则是这种技术的重要方面.本文提出的延迟写策略,实现了对用户文件的检查点设置,有效地解决了在发生故障时用户文件内容与进程全局状态的不一致问题.它对用户通明,并且通过优化设置内存缓冲区大小、时延隐藏等手段,使得这种策略在空间开销、正常运行时间、恢复时间等性能指标上优于其它方法.  相似文献   

5.
EDF调度算法抢占行为的研究及其改进   总被引:9,自引:0,他引:9       下载免费PDF全文
通过对采用抢占式EDF算法的嵌入式系统中各实时任务抢占行为的分析,建立了一个周期性任务集的抢占模型,从数学上描述了抢占关系、可调度性、调度开销与实时任务的周期、执行时间、最终期限、启动时间等属性之间的关系.依据该抢占模型,提出了一个改进的抢占式EDF调度算法,通过将基于遗传算法的优化方法离线计算得到的实时任务启动时间作为目标系统的一个调度参数,减少抢占次数,改变抢占关系,从而提高系统的可调度能力和实时性能.最后用实验验证了改进的抢占式EDF调度算法的有效性.  相似文献   

6.
基于分块消息日志的回卷恢复策略   总被引:5,自引:0,他引:5       下载免费PDF全文
杨金民  张大方 《电子学报》2004,32(5):857-859
本文给出了一种基于分块消息日志的回卷恢复协议,建立了其性能模型,评估了协议的平均开销.分块消息日志方法是一种可配置的一般化方法,悲观消息日志方法和协同检查点方法是其两个特例.性能分析结果表明,协议配置参数具有可优化特性,采用分块消息日志策略能够优化协议性能.  相似文献   

7.
为满足大规模无线传感器网络局部区域信息收集时的QoS需求,在保证网络连通性的条件下最大限度降低网络能耗与控制开销,本文提出一种基于能耗均衡的按需QoS协议OQBED.该协议采用近似静态分簇和按需信息收集策略减少控制开销,通过有效选择工作节点数目以及执行簇头轮换机制来保障能耗均衡,并在簇头间选取满足QoS条件的最优能源路由来实现数据融合与传送.仿真结果表明,OQBED协议能有效减小控制开销,显著延长网络生命期,大幅度提高信息传输成功率.  相似文献   

8.
文中提出了一个基于SMS移动支付系统的体系结构,该系统采用双公钥密码体制的安全协议.通过分析后证明该安全协议实现方式简单可行,执行效率高.  相似文献   

9.
针对当前雾计算环境下终端节点的切换认证协议在存储量、计算量和安全性等方面还存在缺陷,该文提出一种高效的终端节点切换认证协议。在该协议中,采用双因子组合公钥(TF-CPK)和认证Ticket相结合的方式,实现雾节点和终端节点的相互认证和会话密钥协商。安全性和性能分析结果表明,该协议支持不可跟踪性,可以抵抗众多已知攻击和安全威胁,且具有较小的系统开销。  相似文献   

10.
为了提高代理系统的整体性能,基于内部网络用户访问时间的局部性和相似性,并结合现有的分布式缓存系统,本文提出了一种新型的分布式代理缓存系统——双层缓存集群.双层缓存集群系统分为网内集群缓存层和代理集群缓存层,采用双层代理缓存结构,充分利用现有内部网络资源,分散了代理的负担.降低了代理之间的通信开销,还增强了缓存资源的利用...  相似文献   

11.
Time-based coordinated checkpointing protocols are well suited for mobile computing systems because no explicit coordination message is needed while the advantages of coordinated checkpointing are kept. However, without coordination, every process has to take a checkpoint during a checkpointing process. In this paper, an efficient time-based coordinated checkpointing protocol for mobile computing systems over Mobile IP is proposed. The protocol reduces the number of checkpoints per checkpointing process to nearly minimum, so that fewer checkpoints need to be transmitted through the costly wireless link. Our protocol also performs very well in the aspects of minimizing the number and size of messages transmitted in the wireless network. In addition, the protocol is nonblocking because inconsistencies can be avoided by the piggybacked information in every message. Therefore, the protocol brings very little overhead to a mobile host with limited resource. Additionally, by taking advantage of reliable timers in mobile support stations, the time-based checkpointing protocol can adapt to wide area networks.  相似文献   

12.
We herein propose a heuristic redundancy selection algorithm that combines resubmission, replication, and checkpointing redundancies to reduce the resiliency overhead in fault‐tolerant workflow scheduling. The appropriate combination of these redundancies for workflow tasks is obtained in two consecutive phases. First, to compute the replication vector (number of task replicas), we apportion the set of provisioned resources among concurrently executing tasks according to their needs. Subsequently, we obtain the optimal checkpointing interval for each task as a function of the number of replicas and characteristics of tasks and computational environment. We formulate the problem of obtaining the optimal checkpointing interval for replicated tasks in situations where checkpoint files can be exchanged among computational resources. The results of our simulation experiments, on both randomly generated workflow graphs and real‐world applications, demonstrated that both the proposed replication vector computation algorithm and the proposed checkpointing scheme reduced the resiliency overhead.  相似文献   

13.
Mobility database that stores the users’ location records is very important to connect calls to mobile users on personal communication networks. If the mobility database fails, calls to mobile users may not be set up in time. This paper studies failure restoration of mobility database. We study per-user location record checkpointing schemes that checkpoint a user’s record into a non-volatile storage from time to time on a per-user basis. When the mobility database fails, the user location records can be restored from the backup storage. Numeric analysis has been used to choose the optimum checkpointing interval so that the overall cost is minimized. The cost function that we consider includes the cost of checkpointing a user’s location record and the cost of paging a user due to an invalid location record. Our results indicate that when user registration intervals are exponentially distributed, the user record should never be checkpointed if checkpointing costs more than paging. Otherwise, if paging costs more, the user record should be always checkpointed when a user registers.  相似文献   

14.
矩阵LU分解的容错并行算法设计与实现   总被引:1,自引:0,他引:1  
给出了容错并行算法的定义,提出了一种新的基于并行复算的容错并行算法.针对许多计算密集型任务中的矩阵LU分解设计了相应的基于并行复算的容错并行算法,并对设计的矩阵LU分解的容错并行算法的性能进行了评估并与checkpointing方法进行了对比.结果表明与checkpointing方法相比,矩阵LU分解的容错并行算法有性能上的优势.  相似文献   

15.
Real-time computer systems are often used in harsh environments, such as aerospace, and in industry. Such systems are subject to many transient faults while in operation. Checkpointing enables a reduction in the recovery time from a transient fault by saving intermediate states of a task in a reliable storage facility, and then, on detection of a fault, restoring from a previously stored state. The interval between checkpoints affects the execution time of the task. Whereas inserting more checkpoints and reducing the interval between them reduces the reprocessing time after faults, checkpoints have associated execution costs, and inserting extra checkpoints increases the overall task execution time. Thus, a trade-off between the reprocessing time and the checkpointing overhead leads to an optimal checkpoint placement strategy that optimizes certain performance measures. Real-time control systems are characterized by a timely, and correct, execution of iterative tasks within deadlines. The reliability is the probability that a system functions according to its specification over a period of time. This paper reports on the reliability of a checkpointed real-time control system, where any errors are detected at the checkpointing time. The reliability is used as a performance measure to find the optimal checkpointing strategy. For a single-task control system, the reliability equation over a mission time is derived using the Markov model. Detecting errors at the checkpointing time makes reliability jitter with the number of checkpoints. This forces the need to apply other search algorithms to find the optimal number of checkpoints. By considering the properties of the reliability jittering, a simple algorithm is provided to find the optimal checkpoints effectively. Finally, the reliability model is extended to include multiple tasks by a task allocation algorithm  相似文献   

16.
The selection of an optimal checkpointing strategy has most often been considered in the transaction processing environment where systems are allowed unlimited repairs. In this environment an optimal strategy maximizes the time spent in the normal operating state and consequently the rate of transaction processing. This paper seeks a checkpoint strategy which maximizes the probability of critical-task completion on a system with limited repairs. These systems can undergo failure and repair only until a repair time exceeds a specified threshold, at which time the system is deemed to have failed completely. For such systems, a model is derived which yields the probability of completing the critical task when each checkpoint operation has fixed cost. The optimal number of checkpoints can increase as system reliability improves. The model is extended to include a constraint which enforces timely completion of the critical task  相似文献   

17.
Command operation procedures (COPs) are protocols operating in the transfer layer of the telecommand and telemetry architectures designed by the International Consultative Committee for Space Data Systems. The authors present an evaluation of virtual frame transmission times, throughput efficiency, buffer occupancy, and waiting times for the COP-2. COP-2 represents an ARQ-type (automatic repeat request) protocol with checkpointing that is comparable to the checkpoint mode protocol. A discrete two-class priority service model with random server interruptions is considered. The analysis is based on the concepts of level crossing analysis as defined in J.W. Cohen (1977) and J.G. Shanthikumar (1981). The results of this work will be used by the European Space Agency as design guidelines for the European Data Network  相似文献   

18.
This paper studies the failure restoration of mobility database for Universal Mobile Telecommunications System (UMTS). We consider a per-user checkpointing approach for the home location register (HLR) database. In this approach, individual HLR records are saved into a backup database from time to time. When a failure occurs, the backup record is restored back to the mobility database. We first describe a commonly used basic checkpoint algorithm. Then, we propose a new checkpoint algorithm. An analytic model is developed to compare these two algorithms in terms of the checkpoint cost and the probability that an HLR backup record is obsolete. This analytic model is validated against simulation experiments. Numerical examples indicate that our new algorithm may significantly outperform the basic algorithm in terms of both performance measures.  相似文献   

19.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号