期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Checkpoint Management with Double Modular Redundancy Based on the Probability of Task Completion

Seong Woo Kwak Kwan-Ho You Jung-Min Yang 《计算机科学技术学报》2012,27(2):273-280

This paper proposes a checkpoint rollback strategy for real-time systems with double modular redundancy.Without built-in fault-detection and spare processors,our scheme is able to recover from both transient and permanent faults.Two comparisons are conducted at each checkpoint.First,the states stored in two consecutive checkpoints of one processor are compared for checking integrity of the processor.The states of two processors are also compared for detecting faults and the system rolls back to the previous checkpoint whenever required by logic of the proposed scheme.A Markov model is induced by the fault recovery scheme and analyzed to provide the probability of task completion within its deadline.The optimal number of checkpoints is selected so as to maximize the probability of task completion. 相似文献

2.

Probabilistic optimisation of checkpoint intervals for real-time multi-tasks

Seong Woo Kwak 《International journal of systems science》2013,44(4):595-603

This article considers the checkpoint placement problem for real-time systems. In our environment, multiple real-time tasks with arbitrary periods are scheduled in the system by the rate monotonic algorithm, and checkpoints are inserted at a constant interval in each task while the width of the interval is different with respect to the task. We derive an explicit formula of the probability that all the tasks are successfully completed with a given set of checkpoint intervals. Then we determine the optimal checkpoint intervals that maximise the probability of task completion. The probability computation includes the schedulability analysis with respect to the numbers of re-executed checkpoint intervals. Our method does not necessitate any algebraic condition on the periods of the scheduled tasks. 相似文献

3.

A reliable checkpoint storage strategy for grid

Sana Malik Babar Nazir Kalim Qureshi Imran Ali Khan 《Computing》2013,95(7):611-632

Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %. 相似文献

4.

Scheduling divisible loads with time and cost constraints

Drozdowski M. Shakhlevich N. V. 《Journal of Scheduling》2021,24(5):507-521

In distributed computing, divisible load theory provides an important system model for allocation of data-intensive computations to processing units working in parallel. The main task is to define how a computation job should be split into parts, to which processors those parts should be allocated and in which sequence. The model is characterized by multiple parameters describing processor availability in time, transfer times of job parts to processors, their computation times and processor usage costs. The main criteria are usually the schedule length and cost minimization. In this paper, we provide the generalized formulation of the problem, combining key features of divisible load models studied in the literature, and prove its NP-hardness even for unrestricted processor availability windows. We formulate a linear program for the version of the problem with a fixed number of processors. For the case with an arbitrary number of processors, we close the gaps in the study of special cases, developing efficient algorithms for single criterion and bicriteria versions of the problem, when transfer times are negligible.

相似文献

5.

Overload management in real-time control applications using (m,k)-firm guarantee 总被引：1，自引：0，他引：1

Ramanathan P. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(6):549-559

Tasks in a real-time control application are usually periodic and they have deadline constraints by which each instance of a task is expected to complete its computation, even in the adverse circumstances caused by component failures. Techniques to recover from processor failures often involve a reconfiguration in which all tasks are assigned to fault-free processors. This reconfiguration may result in processor overload where it is no longer possible to meet the deadlines of all tasks. In this paper, we discuss an overload management technique which discards selected task instances in such a way that the performance of the control loops in the system remain satisfactory even after a failure. The technique is based on the rationale that real-time control applications can tolerate occasional misses of the control law updates, especially if the control law is modified to account for these missed updates. The paper devises a scheduling policy which deterministically guarantees when and where the misses will occur. The paper also proposes a methodology for modifying the control law to minimize the deterioration in the control system behavior as a result of these missed control law updates 相似文献

6.

On coordinated checkpointing in distributed systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(12):1213-1225

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems 相似文献

7.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

8.

Fault-tolerant rate-monotonic first-fit scheduling inhard-real-time systems

Bertossi A.A. Mancini L.V. Rossini F. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(9):934-945

Hard-real-time systems require predictable performance despite the occurrence of failures. In this paper, fault tolerance is implemented by using a novel duplication technique where each task scheduled on a processor has either an active backup copy or a passive backup copy scheduled on a different processor. An active copy is always executed, while a passive copy is executed only in the case of a failure. First, the paper considers the ability of the widely-used rate-monotonic scheduling algorithm to meet the deadlines of periodic tasks in the presence of a processor failure. In particular, the completion time test is extended so as to check the schedulability on a single processor of a task set including backup copies. Then, the paper extends the well-known rate-monotonic first-fit assignment algorithm, where all the task copies, included the backup copies, are considered by rate-monotonic priority order and assigned to the first processor in which they fit. The proposed algorithm determines which tasks must use the active duplication and which can use the passive duplication. Passive duplication is preferred whenever possible, so as to overbook each processor with many passive copies whose primary copies are assigned to different processors. Moreover, the space allocated to active copies is reclaimed as soon as a failure is detected. Passive copy overbooking and active copy deallocation allow many passive copies to be scheduled sharing the same time intervals on the same processor, thus reducing the total number of processors needed. Simulation studies reveal a remarkable saving of processors with respect to those needed by the usual active duplication approach in which the schedule of the non-fault-tolerant case is duplicated on two sets of processors 相似文献

9.

Reliability and performance optimization of pipelined real-time systems

Anne Benoit Fanny Dufossé Alain Girault Yves Robert 《Journal of Parallel and Distributed Computing》2013

We consider pipelined real-time systems that consist of a chain of tasks executing on a distributed platform. The processing of the tasks is pipelined: each processor executes only one interval of consecutive tasks. We are interested in minimizing both the input–output latency and the period of application mapping. For dependability reasons, we are also interested in maximizing the reliability of the system. We therefore assign several processors to each interval of tasks, so as to increase the reliability of the system. Both processors and communication links are unreliable and subject to transient failures. We assume that the arrival of the failures follows a constant parameter Poisson law, and that the failures are statistically independent events. We study several variants of this multiprocessor mapping problem, with several hypotheses on the target platform (homogeneous/heterogeneous speeds and/or failure rates). We provide NP-hardness complexity results, and optimal mapping algorithms for polynomial problem instances. Efficient heuristics are presented to solve the general case, and experimental results are provided. 相似文献

10.

A metaheuristic for the fixed job scheduling problem under spread time constraints

André Rossi Alok Singh Marc Sevaux 《Computers & Operations Research》2010,37(6):1045-1054

In a scheduling problem, a job is said to be fixed when its due date corresponds exactly to its release date plus its processing time. This paper addresses the fixed job scheduling problem where processors are subject to spread time constraints, i.e., the amount of time spent between the starting time of the first job on a processor and the completion time of the last job on the same processor should not exceed a given duration. Existing exact approaches are tested on medium size instances. As large instances are clearly intractable with these approaches, a greedy heuristic and a grouping genetic algorithm are proposed. Computational results show the effectiveness of the proposed heuristics. 相似文献

11.

Static and decentralized-adaptive load balancing in a starconfigured distributed computing system

Dattatreya G.R. Venkatesh R. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》1996,26(1):91-104

The problem of minimizing the average job response time (sojourn time) in a star configured distributed computing system (DCS) with N peripheral processors and one central processor is studied. Each peripheral processor (satellite) receives a stream of Poisson arrivals. A job arriving at a satellite S_i will be processed with a probability (1-P_i) by itself and routed to the central site for remote processing with probability P_i. The service time distribution at each processor is arbitrary with means 1/μ_i and second moments E[Y_i²], i=0,…,N. We first develop an algorithm to compute the optimal routing probabilities when the arrival and the service parameters are known a priori. We then develop an adaptive estimator to be used by each satellite to compute its optimal routing probability, when the parameters (everywhere) are unknown a priori, a copy of the algorithm runs on each peripheral processor and uses the sojourn time measurements that the peripheral processor can gather on its own from those jobs that are sent to the central processor and returned to the originating site after service 相似文献

12.

Checkpointing and Rollback-Recovery for Distributed Systems

《IEEE transactions on pattern analysis and machine intelligence》1987,(1):23-31

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions. 相似文献

13.

云环境下周期和非周期混合实时任务双容错调度算法

曹洁曾国荪《计算机应用》2015,35(3):648-653

云环境中的处理机故障已成为云计算不可忽视的问题,容错成为设计和发展云计算系统的关键需求。针对一些容错调度算法在任务调度过程中调度效率低下以及任务类型单一的问题,提出一种处理机和任务主副版本分组的容错调度方法;并给出了副版本可重叠执行的判定方法,以及任务最坏响应时间的计算公式。通过实验和分析表明,和以前算法相比,将处理机分成两组分别执行任务主版本和任务副版本,减少了任务调度所需进行可调度测试的时间,增加了副版本重叠执行的机会,减少了所需的处理机个数,对提高系统处理机的利用率和容错调度的效率具有重要的意义。相似文献

14.

On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Qin Zheng Bharadwaj Veeravalli 《Journal of Parallel and Distributed Computing》2009

Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms. 相似文献

15.

异构计算中的负载共享 总被引：18，自引：0，他引：18

曾国荪陆鑫达《软件学报》2000,11(4):551-556

在基于消息传递的异构并行计算系统中 ,各处理器或计算机具有自制和独立地调度、执行作业的能力 .当一个可划分的作业初始位于一个处理器上时 ,为了提高计算性能 ,该处理器可以请求其他异构处理器负载共享 ,参与协同计算 ,减少作业的完成时间 .该文提出了异构计算负载共享的一种方案 .首先 ,调用负载共享协议 ,收集当前各处理器参与负载共享的许可数据 ,包括共享时间段、计算能力等 .然后 ,构造一个作业量与作业完成时间之间的关系函数 .该函数是选择一组合适的处理器群、优化作业划分、作业完成时间最小的理论基础 .最相似文献

16.

The decision model of task allocation for constrained stochastic distributed systems

Chichang Jou 《Computers & Industrial Engineering》2010,58(3):344-351

In distributed systems, an application program is divided into several software modules, which need to be allocated to processors connected by communication links. The distributed system reliability (DSR) could be defined as the probability of successfully completing the distributed program. Previous studies about optimal task allocation with respect to DSR focused on the effects of the inter-connectivity of processors, the failure rates of the processors, and the failure rates of the communication links. We are the first to study the effects of module software reliabilities and module execution frequencies on the optimal task allocation. By viewing each module as a state in the Markov process, we build a task allocation decision model to maximize DSR for distributed systems with 100% reliable network. In this model, the DSR is derived from the module software reliabilities, the processor hardware reliabilities, the transition probabilities between modules, and the task allocation matrix. Resource constraints of memory space limitation and computation load limitation on each processor are considered. The constraint of total system cost, including the execution cost, the communication cost, and the failure cost, is also considered. We solve the problem by Constraint Programming using the ILOG SOLVER library. We then apply the proposed model to a case extended from previous studies. Finally, a sensitivity analysis is performed to verify the effects of module software reliabilities and processor hardware reliabilities on the DSR and on the task allocation decision. 相似文献

17.

Fault-tolerant parallel programming in Argus

Henri E. Bal 《Concurrency and Computation》1992,4(1):37-55

Fault tolerance is an issue ignored in most parallel languages. The overhead of making parallel, high-performance programs resilient to processor crashes is often too high, given the low probability of such events. If parallel systems become more large-scaled, however, processor failures will become likely, so they should be dealt with. Two approaches to this problem are feasible. First, the system can make programs fault-tolerant transparently. It can log messages, make checkpoints, and so on. Second, the programmer can write explicit code for handling failures in an application-specific way. The latter approach is potentially more efficient, but also requires more work from the programmer. In this paper, we intend to get some initial insight into how hard and efficient explicit fault-tolerant parallel programming is. We do so by implementing four parallel applications in Argus, a language supporting parallelism as well as fault tolerance. Our experiences indicate that the extra effort needed for fault tolerance varies much between different applications. Also, trade-offs can frequently be made between programming effort and efficiency. One lesson we learned is that fault tolerance should not be added as an afterthought, but is best taken into account from the start. As another result, the ability to integrate transparent and explicit mechanisms for fault tolerance would sometimes be highly useful. 相似文献

18.

Increasing processor utilization in hard-real-time systems with checkpoints

Alan A. Bertossi Massimo Bonometto Luigi V. Mancini 《Real-Time Systems》1995,9(1):5-29

Often hard real-time systems require results that are produced on time despite the occurrence of processor failures. This paper considers a distributed system where tasks are periodic and each task occurs in multiple copies which are periodically synchronized in order to handle failures. The problem of preemptively scheduling a set of such tasks is discussed where every occurrence of a task has to be completely executed before the next occurrence of the same task. First, a static scheduling algorithm is proposed which uses periodic checkpoints to tolerate processor failures. Then, the performance of the algorithm is substancially improved employing a mixed strategy which constructs a schedule where high frequency tasks are duplicated, and low frequency tasks are periodically checkpointed. The performance of the solution proposed is evaluated in terms of the minimum achievable processor utilization due to the useful computation of the tasks. Moreover, analytical and simulation studies are used to reveal interesting trade-offs associated with the scheduling algorithm. In particular, if high frequency tasks are less than 70 percent of the total number of tasks then the mixed strategy yields a higher processor utilization than the task duplication scheme. 相似文献

19.

The solution algorithms for the multiprocessor scheduling with workspan criterion

Radosław Rudek Agnieszka Rudek Andrzej Kozik 《Expert systems with applications》2013,40(8):2799-2806

In this paper, we consider multiprocessor scheduling problems, where each job (task) must be executed simultaneously by the specified number of processors, but the indices of the processors allotted to each job do not have to be contiguous (i.e., jobs can be fragmentable). Unlike other research in this domain, we analyse the problem under the workspan criterion, which is defined as the product of the maximum job completion time (makespan) and the number of used processors. Moreover, the job processing times can be described by non-increasing or non-decreasing functions dependent on the start times of jobs that model improvement (learning) or degradation (deteriorating), respectively. To solve the problems, we construct some polynomial time algorithms and analyse numerically their efficiency. 相似文献

20.

Scheduling parallelizable tasks to minimize make-span and weighted response time

Monte J.D. Pattipati K.R. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2002,32(3):335-345

This paper presents a generalization to classical scheduling theory by removing the restriction that only one processor can work on a given task at a particular time. Instead, it is assumed that each task can be allocated any number of identical processors from one to the maximum number available, with each task's completion time being a function of the number of processors allocated. Tasks may be started any time, but once started, a task must not have its processor allocation altered or be preempted. Two objective functions are considered: minimizing the overall completion time for the tasks (make-span) and minimizing a weighted sum of the task completion times (weighted response). Both are considered subject to a constraint on the total number of processors available. Suboptimal algorithms are developed for both of these NP-hard problems using Lagrangian relaxation, and their performances are analyzed through extensive simulations. Duality gaps for all problems tested ranged from under 1% to 92%, depending more on the problem size than the specific problem. 相似文献