期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Surviving sensor node failures by MMU-less incremental checkpointing

《Journal of Systems and Software》2014

For some critical safety applications, sensor nodes embed valuable information, and they should be able to operate unattended and unfailing for several months or years. One promising solution is to adopt a checkpointing that periodically saves the state of a sensor node, thereby maintaining node reliability and network availability. Thus, this study first shows the design and implementation of a full checkpointing for WSNs. However, checkpointing is expensive. Therefore, incremental checkpointing was previously proposed to eliminate the checkpoint overhead by relying on the page protection hardware to identify dirty pages. Because sensor nodes are resource-constrained and do not equip with the page protection hardware, previous incremental checkpointings cannot be directly applied. To address this issue, this paper proposes three incremental checkpointings for WSNs. These three methods differ in the granularity of the checkpoint memory data unit and module execution overhead. In addition, we designed an incremental checkpoint file format that simultaneously supports proposed three different incremental checkpointings and accommodates them with sensor network characteristics. We implemented the full and three incremental checkpointings on SOS in the mica2 sensor motes. A performance evaluation of the three incremental checkpointings is presented. We also discuss and evaluate a method for selecting the appropriate incremental checkpointing. To the best of our knowledge, this study is the first to design and implement incremental checkpointing in MMU-less WSNs. 相似文献

2.

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Qiangfeng Yi D. 《Journal of Parallel and Distributed Computing》2008,68(12):1575-1589

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献

3.

A new approach for high performance computing systems with various checkpointing schemes

Gyung-Leen Park Hee Yong Youn 《The Journal of supercomputing》2005,33(1-2):65-78

Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes thesample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. This research was supported in part by the MIC (Ministry of Information and Communication), Korea, under the ITRC support program supervised by the UTA and CUCN 21st Century Frontier R&D Program. 相似文献

4.

Visions for application development on hybrid computing systems

Roger D. Chamberlain Joseph M. Lancaster Ron K. Cytron 《Parallel Computing》2008,34(4-5):201-216

Hybrid computing systems (incorporating FPGAs, GPUs, etc.) have received considerable attention recently as an approach to significant performance gains in many problem domains. Deploying applications on these systems, however, has proven to be difficult and very labor intensive. In this paper we review the current state of practice for application development on hybrid systems. We also present our vision of the application development languages and tools that we believe would greatly benefit the process of designing, implementing, and deploying applications on hybrid systems. 相似文献

5.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

6.

Differential evaluation: A cache-based technique for incremental update of graphical displays of structures

Michael R. Dunlavey 《Software》1993,23(8):871-893

Many applications of graphical man-machine interfaces require incremental update of an image, with controllable time granularity, having multiple views of the information, generated by modular, maintainable software having minimal resource overhead. A technique called differential evaluation is presented that addresses these needs. Instead of driving the display from a data structure, it uses an ad-hoc display procedure in conjunction with a FIFO cache to generate and maintain the display. Maintaining displays of variable structure is accomplished through the use of conditional mechanisms in the display procedure. The technique has been extended to meet many different needs, such as the use of graphic contexts, display of overlapping objects, double-buffering, and a variety of user input schemes. The efficiency of the technique, in time and storage, permits its use with modest equipment. It has been used for several years in a number of industrial user interfaces. 相似文献

7.

An efficient checkpointing method for multicomputers with wormhole routing

Kai Li Jeffrey F. Naughton James S. Plank 《International journal of parallel programming》1991,20(3):159-180

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements. 相似文献

8.

Energy minimization for reliability-guaranteed real-time applications using DVFS and checkpointing techniques

《Journal of Systems Architecture》2015,61(2):71-81

This paper addresses the energy minimization issue when executing real-time applications that have stringent reliability and deadline requirements. To guarantee the satisfaction of the application’s reliability and deadline requirements, checkpointing, Dynamic Voltage Frequency Scaling (DVFS) and backward fault recovery techniques are used. We formally prove that if using backward fault recovery, executing an application with a uniform frequency or neighboring frequencies if the desired frequency is not available, not only consumes the minimal energy but also results in the highest system reliability. Based on this theoretical conclusion, we develop a strategy that utilizes DVFS and checkpointing techniques to execute real-time applications so that not only the applications reliability and deadline requirements are guaranteed, but also the energy consumption for executing the applications is minimized. The developed strategy needs at most one execution frequency change during the execution of an application, hence, the execution overhead caused by frequency switching is small, which makes the strategy particularly useful for processors with a large frequency switching overhead. We empirically compare the developed real-time application execution strategy with recently published work. The experimental results show that, without sacrificing reliability and deadline satisfaction guarantees, the proposed approach can save up to 12% more energy when compared with other approaches. 相似文献

9.

On the assessment of input streams for incremental network computing

Cho-Chin Lin Hao-Yun YinAuthor Vitae 《Journal of Parallel and Distributed Computing》2008

Network latency is an adverse factor for computations performed across the network. Overlapping computation with communication is an important technique for hiding latency. It has been shown that network latency cannot be effectively hidden without considering the order of sending data [C.-C. Lin, Strategies for achieving high performance incremental computing on a network environment, in: Proc.18th Int’l Conf. on Advanced Information Networking and Applications 1, 2004, pp. 113–118]. However, finding a data-sending order for the input to a task which minimizes the remote execution time for any network traffic pattern is NP-hard [C.-C. Lin, D.-W. Wang, T.-S. Hsu, Bounds on the client-server incremental computing, IEICE Trans. Fundamentals E89-A (5) (2006) 1198–1206]. Thus, heuristic algorithms are often employed to search an optimal input stream. The performance of algorithms relies on an effective mechanism for guiding the search toward a promising direction. In this paper, the computation-progress graph is proposed for transforming an input stream of a task to its corresponding pattern of progressive computations. Then, the assessing function is defined for assigning scores to the found input streams based on the computation-progress graph. Based on the scores, the promising search directions can be determined. Finally, the effectiveness of our assessing function is also demonstrated by the search of the optimal orders for computing the product of two polynomials, matrix multiplication and FFT. 相似文献

10.

Self-stabilizing algorithm for checkpointing in a distributed system

Partha Sarathi Mandal Krishnendu Mukhopadhyaya 《Journal of Parallel and Distributed Computing》2007

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n)

O (n)

steps, where n

n

is the number of processes. The algorithm can be extended to work for general topologies too. 相似文献

11.

The effects of checkpointing on program execution time

Andrzej Duda 《Information Processing Letters》1983,16(5):221-229

相似文献

12.

Bounds on the multi-clients incremental computing for homogeneous decreasing computation sequences

Cho-Chin Lin Hao-Yun Yin 《Information Processing Letters》2007,102(1):35-40

This paper studies the performance of network-based incremental computing under various message sequences. We show the bounds on the time needed to compute the tasks requested by multiple clients. Our simulation result shows that the expected performance of random message sequences is close to the optimal performance. 相似文献

13.

Large-scale incremental processing with MapReduce

《Future Generation Computer Systems》2014

An important property of today’s big data processing is that the same computation is often repeated on datasets evolving over time, such as web and social network data. While repeating full computation of the entire datasets is feasible with distributed computing frameworks such as Hadoop, it is obviously inefficient and wastes resources. In this paper, we present HadUP (Hadoop with Update Processing), a modified Hadoop architecture tailored to large-scale incremental processing with conventional MapReduce algorithms. Several approaches have been proposed to achieve a similar goal using task-level memoization. However, task-level memoization detects the change of datasets at a coarse-grained level, which often makes such approaches ineffective. Instead, HadUP detects and computes the change of datasets at a fine-grained level using a deduplication-based snapshot differential algorithm (D-SD) and update propagation. As a result, it provides high performance, especially in an environment where task-level memoization has no benefit. HadUP requires only a small amount of extra programming cost because it can reuse the code for the map and reduce functions of Hadoop. Therefore, the development of HadUP applications is quite easy. 相似文献

14.

Adaptive Remus: adaptive checkpointing for Xen-based virtual machine replication

Marcelo Pereira da Silva Rafael Rodrigues Obelheiro 《International Journal of Parallel, Emergent and Distributed Systems》2017,32(4):348-367

With the ever increasing dependence on computers and networks, many systems are required to be continuously available in order to fulfil their mission. Virtualization technology enables high availability to be offered in a convenient, cost-effective manner: with the encapsulation provided by virtual machines (VMs), entire systems can be replicated transparently in software, obviating the need for expensive fault-tolerant hardware. Remus is a VM replication mechanism for the Xen hypervisor that provides high availability despite crash failures. Replication is performed by checkpointing the VM at fixed intervals. However, there is an antagonism between processing and communication regarding the optimal checkpoint interval: while longer intervals benefit processor-intensive applications, shorter intervals favour network-intensive applications. Thus, any chosen interval may not always be suitable for the hosted applications, limiting Remus usage in many scenarios. This work introduces Adaptive Remus, a proposal for adaptive checkpointing in Remus that dynamically adjusts the replication frequency according to the characteristics of running applications. Experimental results indicate that our proposal improves performance for applications that require both processing and communication, without harming applications that use only one type of resource. 相似文献

15.

FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Yi Luo D. Manivannan 《Journal of Parallel and Distributed Computing》2009

Communication-Induced Checkpointing (CIC) protocols are classified into two categories in the literature: Index-based and Model-based. In this paper, we discuss two data structures being used in these two kinds of CIC protocols, and their different roles in helping the checkpointing algorithms to enforce Z-cycle Free (ZCF) property. Then, we present our Fully Informed aNd Efficient (FINE) communication-induced checkpointing algorithm, which not only has less checkpointing overhead than the well-known Fully Informed (FI) CIC protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the other existing CIC protocols. 相似文献

16.

Hierarchical Krylov and nested Krylov methods for extreme-scale computing

Lois Curfman McInnes Barry Smith Hong Zhang Richard Tran Mills 《Parallel Computing》2014

The solution of large, sparse linear systems is often a dominant phase of computation for simulations based on partial differential equations, which are ubiquitous in scientific and engineering applications. While preconditioned Krylov methods are widely used and offer many advantages for solving sparse linear systems that do not have highly convergent, geometric multigrid solvers or specialized fast solvers, Krylov methods encounter well-known scaling difficulties for over 10,000 processor cores because each iteration requires at least one vector inner product, which in turn requires a global synchronization that scales poorly because of internode latency. To help overcome these difficulties, we have developed hierarchical Krylov methods and nested Krylov methods in the PETSc library that reduce the number of global inner products required across the entire system (where they are expensive), though freely allow vector inner products across smaller subsets of the entire system (where they are inexpensive) or use inner iterations that do not invoke vector inner products at all. 相似文献

17.

Periodicity in optimal hierarchical checkpointing schemes for adjoint computations

Guillaume Aupy Julien Herrmann 《Optimization methods & software》2017,32(3):594-624

We reexamine the work of Aupy et al. on optimal algorithms for hierarchical adjoint computations, where two levels of memories are available. The previous optimal algorithm had a quadratic execution time. Here, with structural arguments, namely periodicity, on the optimal solution, we provide an optimal algorithm in constant time and space, with appropriate pre-processing. We also provide an asymptotically optimal algorithm for the online problem, when the adjoint chain size is not known before-hand. Again, these algorithms rely on the proof that the optimal solution for hierarchical adjoint computations is weakly periodic. We conjecture a closed-form formula for the period. Finally, we assess the convergence speed of the approximation ratio for the online problem through simulations. 相似文献

18.

Parallel processing of incremental ray tracing on a shared-memory multiprocessor

Susumu Horiguchi Masayuki Katahira Takeo Nakada 《The Visual computer》1993,9(7):371-380

This paper presents a novel parallel-processing method for image synthesis using incremental ray tracing on a shared-memory multiprocessor workstation. The most efficient technique for image synthesis is ray tracing, proposed by Whitted in 1980. Ray-tracing algorithms are simple and can generate realistic images. However, they are time-consuming, since calculations of the intersections between objects and ray increase exponentially as the complexity of scenes increases. Fast image synthesis for animation is one of the most important topics in computer graphics. As the area of computer applications has broadened, the complexity of images to be synthesized has increased. Parallel processing of computer graphics is one way of achieving fast image synthesis. This paper describes a parallel processing technique for incremental ray tracing, which recalculates only the rays changed by moving objects in successive scenes of continuous image synthesis. The performance of parallel ray tracing was evaluated on the multiprocessor workstation TOP-1. Strategies for allocating pixels to processes under a multiprocess operating system on this workstation are discussed. 相似文献

19.

Unified model for assessing checkpointing protocols at extreme‐scale

George Bosilca Aurlien Bouteiller Elisabeth Brunet Franck Cappello Jack Dongarra Amina Guermouche Thomas Herault Yves Robert Frdric Vivien Dounia Zaidouni 《Concurrency and Computation》2014,26(17):2772-2791

In this paper, we present a unified model for several well‐known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

20.

Deploying fault tolerance and taks migration with NetSolve

James S. Plank Henri Casanova Micah Beck Jack J. Dongarra 《Future Generation Computer Systems》1999,15(5-6):745-755

Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve’s structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve. 相似文献