期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

魏晓辉鞠九滨《计算机科学技术学报》2000,15(2):169-175

A consistent checkpointing algorithm with short freezing time(SFT) is presented in this paper.It supports fault-tolerance in distributed systems,The algorithm has shorter freezing time,lower overhead,and simplicity of recovery.To make checkpoint time shorter,a special control message(Munblock)is used to ensure that a process can respond the checkpoint event quickly at any given time.Moreover,main memory algorithm is used to improve the concurrency of checkpointing.By using SFT,the freezing time resulted by checkpointing is less than 0.03s.Furthermore,the control message number of SFT is only O(n). 相似文献

2.

Checkpointing Distributed Shared Memory

Silva Luis M. Silva João Gabriel 《The Journal of supercomputing》1997,11(2):137-158

相似文献

3.

一种面向CPU-GPU 异构系统的容错方法

下载免费PDF全文

徐新海杨学军林宇斐林一松唐滔《软件学报》2011,22(10):2538-2552

近年来,为了缓解日益严重的功耗问题,异构并行体系结构已成为超级计算机发展的一个重要趋势.图形处理器(graphics processing unit,简称GPU)凭借其超高的计算性能和性能功耗比,作为一种高效的加速部件已被广泛应用于高性能计算领域.但是,GPU先天的可靠性缺陷势必加剧超级计算机的可靠性问题.目前,国际上关于CPU-GPU异构系统容错技术的研究工作主要将GPU从异构系统中独立出来,以每次调用为粒度对其进行容错处理.设计了一种面向CPU-GPU异构系统的Lazy容错方法,给出了基于编译指导命令的容错框架及其约束,并讨论了相关的编译实现和优化方法,最后通过实验验证了该方法的正确性.实验结果表明,与现有的容错方法相比,利用所设计的LazyFT容错方法对GPGPU(general purpose computation on graphics hardware)程序进行容错处理,可以明显降低容错代价. 相似文献

4.

面向大规模MPI程序的应用级checkpointing技术

王攀峰杜云飞周海芳杨学军《计算机研究与发展》2009,46(Z2)

应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术.但是应用级checkpointing技术要求用户决定哪些是需要保存的关键数据,这增加了用户的负担.介绍一个基于MPI并行程序活跃变量分析的源到源的预编译工具ALEC,它可用于辅助应用级checkpointing.在一个512处理器的Cluster系统上,对经过ALEC编译的5个Fortran/MPI应用进行了性能评测.结果表明,ALEC能够有效减小checkpoint的大小和应用级checkpointing保存和恢复的开销. 相似文献

5.

The STAR fault manager for distributed operating environments. design,implementation and performance

Pierre Sens Bertil Folliot 《Software》1998,28(10):1079-1099

This paper presents the design, implementation and performance evaluation of a software fault manager for distributed applications. Dubbed Star, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, Star implements non-blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, Star is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIX™-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. © 1998 John Wiley & Sons, Ltd. 相似文献

6.

Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

Ahn Jinho Min Sung-Gi Hwang Chong-Sun Yu Heonchang 《The Journal of supercomputing》2002,22(2):175-196

This paper presents three garbage collection schemes for causal message logging with independent checkpointing. The first scheme allows each process to autonomously remove useless log information in its volatile storage by piggybacking only some additional information without requiring any extra message and forced checkpoint. Additionally, it supports faster output commit than traditional schemes. The second scheme enables each process to remove a part of log information in the storage if more empty space is required. It reduces the number of processes participating in the garbage collection by using the size of the log information of each process. The third scheme is a hybrid scheme having the advantages of the two proposed schemes. Simulation results show that the third scheme significantly reduces the garbage collection overhead compared with the traditional schemes regardless of specific communication patterns of distributed applications. 相似文献

7.

Probabilistic optimisation of checkpoint intervals for real-time multi-tasks

Seong Woo Kwak 《International journal of systems science》2013,44(4):595-603

This article considers the checkpoint placement problem for real-time systems. In our environment, multiple real-time tasks with arbitrary periods are scheduled in the system by the rate monotonic algorithm, and checkpoints are inserted at a constant interval in each task while the width of the interval is different with respect to the task. We derive an explicit formula of the probability that all the tasks are successfully completed with a given set of checkpoint intervals. Then we determine the optimal checkpoint intervals that maximise the probability of task completion. The probability computation includes the schedulability analysis with respect to the numbers of re-executed checkpoint intervals. Our method does not necessitate any algebraic condition on the periods of the scheduled tasks. 相似文献

8.

Accelerating incremental checkpointing for extreme-scale computing

《Future Generation Computer Systems》2014

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. 相似文献

9.

移动Ad Hoc网络混合检查点策略

下载免费PDF全文

廖国琼熊安晋狄国强万常选夏家莉《计算机研究与发展》2014,51(6):1176-1184

考虑到移动Ad Hoc网络无固定中心节点、多跳路由和资源有限等特点,基于分簇移动Ad Hoc网络结构,提出了一种结合同步和异步检查点技术的混合检查点策略,即同簇终端检查点必须保持同步,而异簇终端检查点保持独立.首先讨论了混合检查点模型及其正确性准则.然后,基于簇内及簇间检查点依赖图,讨论了不同类型检查点清除规则.最后,给出了相应的检查点及回滚恢复算法,并证明了回滚恢复的正确性.所提出的混合检查点策略既能避免同簇进程级联回滚所引起的资源浪费、又能避免异簇终端之间过多跨簇消息传递及减少无线通信延迟.实验结果表明,与单纯的同步及异步检查点策略相比,所提出的检查点策略是一种综合考虑移动Ad Hoc网络各种资源约束的较好折中方案,且具有恢复时间短、对簇头依赖小、灵活性好等优点. 相似文献

10.

Paolo Cappellari Mark Roantree Soon Ae Chun 《Software》2018,48(9):1607-1641

Stream processing systems are designed to analyze data arriving in real time and using continuous queries and respond when a specific event or sequence of events are detected. An important aspect of these systems is Streaming Analytics, which facilitates statistical calculations on continuous data within the stream. These systems must be designed to handle high volumes of data, be scalable, and accommodate a multitude of long‐lived concurrently running analytics. The challenges involved in the development of stream processing include on‐the‐fly transformation of data streams to match the query needs of users and the ability to model stream transformations to detect overlaps and possibilities for optimizations and to specify a methodology to deliver optimizations. In particular, this work focuses on exposing data stream application internals in order to detect reusable parts and then consolidate applications to optimize computational resource usage. The Streaming Data Analytics Model presented in this paper adopts a declarative approach that enables processing and manipulation of data streams in a simple manner while facilitating powerful optimizations necessary for managing high volumes of streaming data in real time. An evaluation is provided to demonstrate in both theoretical and quantitative aspects the high performance offered by our approach. 相似文献

11.

滑动窗口连续查询结果存储优化

唐向红李国徽《计算机科学》2010,37(6):191-195

在数据流滑动窗口查询研究领域中,考虑查询结果失效的连续查询成为了一个新的研究热点.查询结果的维护代价直接影响连续查询效率.根据对不同更新模式连续查询结果的分析,提出了一种带分支链表的梯队列来维护滑动窗口连续查询结果.它利用分支链表结构收集具有相同截止期的数据,采用梯队列的\"产卵\"机制,能适应具有各种不同分布的数据维护,且能达到O(1)的均摊(amortized)时间复杂度.实验表明,该结构显著提高了滑动窗口连续查询效率,明显优于同类结构. 相似文献

12.

Robert Soulé Martin Hirzel Buğra Gedik Robert Grimm 《Software》2016,46(7):891-929

This paper presents both a calculus for stream processing, named Brooklet, and its realization as an intermediate language, named River. Because River is based on Brooklet, it has a formal semantics that enables reasoning about the correctness of source translations and optimizations. River builds on Brooklet by addressing the real‐world details that the calculus elides. We evaluated our system by implementing front‐ends for three streaming languages, and three important optimizations, and a back‐end for the System S distributed streaming runtime. Overall, we significantly lower the barrier to entry for new stream‐processing languages and thus grow the ecosystem of this crucial style of programming. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

13.

分布式系统中的检查点算法 总被引：12，自引：0，他引：12

魏晓辉鞠九滨《计算机学报》1998,21(4):367-375

检查点能够保存和恢复程序的运行状态．它在进程迁移、容错、卷回调试等领域都有重要的应用．本文对分布式系统中的检查点算法进行了详细的分类评述．检查点算法可分为单进程和分布式程序检查点算法，分布式程序检查点算法又可分为异步检查点算法和一致检查点算法．同时本文系统介绍了改进检查点算法性能的典型方法．这些改进算法主要采用两个策略来减少算法的开销与延迟：一是减少检查点文件中需要存储的信息量，如增量算法等；二是提高检查点操作与目标程序运行的并行性，如主存算法等．最后，文章讨论了目前检查点算法的局限性和进一步的工作．相似文献

14.

Imagine流处理器上流的优化组织方法

杨学军曾丽芳邓宇唐玉华《计算机学报》2008,31(7)

流应用的特点以及传统处理器在处理流应用上的不足,使得支持数据并行的流处理器的设计成为当前体系结构研究领域的一个热点.文中针对Imagine流处理器体系结构的特点,提出了流分割和流压缩两种流的优化组织方法.模拟结果表明,流分割和流压缩使得流应用程序能充分利用Imagine的并行结构、流水结构和多级带宽存储结构,从而减少流程序的执行时间. 相似文献

15.

异步检查点容错PVM 总被引：1，自引：0，他引：1

余洋陆鑫达《计算机工程与应用》1999,35(11):34-37

以工作站簇为代表的计算环境是当前分布式系统和并行计算的研究重点之一,ＰＶＭ所提供的消息传递机制支持了高效的异构网络计算。但标准ＰＶＭ缺乏对系统容错的支持,这可以通过使用检查点的回滚恢复方式予以弥补。该文对如何在用户级实现ＰＶＭ全局容错,分析其设计思想和实现技术。主要思想是使用进行消息记录的异步检查点算法,并利用ＰＶＭ守护进程和全局调度进程进行控制,所有操作对应用程序都是透明的。利用该系统还可以进一步实现ＰＶＭ的透明进程迁移和负载均衡。相似文献

16.

支持实时流计算应用的关键技术研究进展

下载免费PDF全文

徐志榛徐辰丁光耀陈梓浩周傲英《软件学报》2024,35(1):430-454

信息系统在进行知识的挖掘和管理时,需要处理各种形式的数据,流数据便是其中之一.流数据具有数据规模大、产生速度快且蕴含的知识具有较强时效性等特点,因而发展支持实时处理应用的流计算技术对于信息系统的知识管理十分重要.流计算系统可以追溯到 29 世纪 90 年代,至今已经经历了长足的发展.然而,当前多样化的知识管理需求和新一代的硬件架构为流计算系统带来了全新的挑战和机遇,催生出了一系列流计算领域的技术研究.首先介绍流计算系统的基本需求以及发展脉络,再按照编程接口、执行计划、资源调度和故障容错 4 个层次分别分析流计算系统领域的相关技术;最后,展望流计算技术在未来可能的研究方向和发展趋势. 相似文献

17.

An on-line algorithm for cluster detection of mobile nodes through complex event processing

《Information Systems》2017

Clusters of mobile elements, such as vehicles and humans, are a common mobility pattern of interest for many applications. The on-line detection of them from large position streams of mobile entities is a challenging task because it requires algorithms that are capable of continuously and efficiently processing the high volume of position updates in a timely manner. Currently, the majority of approaches for cluster detection operate in batch mode, where position updates are recorded during time periods of certain length and then batch processed by an external routine, thus delaying the result of the cluster detection until the end of the time period. However, if the monitoring application requires results at a higher frequency than the one delivered by batch algorithms, then results might not reflect the current clustering state of the entities. To overcome this limitation, in this paper we propose DG2CEP, an algorithm that combines the well-known density-based clustering algorithm DBSCAN with the data stream processing paradigm Complex Event Processing (CEP) to achieve continuous, on-line detection of clusters. Our experiments with synthetic and real world datasets indicate that DG2CEP is able to detect the formation and dispersion of clusters with small latency and higher similarity to DBSCAN׳s output than batch-based approaches. 相似文献

18.

MADSPM:一种基于移动代理的分布式多流数据处理模型

陈鹏吕卫锋《计算机工程与应用》2005,41(10):11-14,18

对主要的流数据模型进行了比较分析,讨论了基于概要结构的流数据处理模型---Synopsis模型。在Synopsis模型的基础上引入移动代理,提出了一种基于移动代理的分布式多流数据处理模型MADSPM。最后对基于MADSPM模型的流数据关联规则挖掘问题中需注意的一些问题进行了阐述与分析。相似文献

19.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

20.

分布式流处理技术综述 总被引：7，自引：0，他引：7

下载免费PDF全文

崔星灿禹晓辉刘洋吕朝阳《计算机研究与发展》2015,52(2):318-332

随着计算机和网络技术的迅猛发展以及数据获取手段的不断丰富,在越来越多的领域出现了对海量、高速数据进行实时处理的需求.由于此类需求往往超出传统数据处理技术的能力,分布式流处理模式应运而生.首先回顾分布式流处理技术产生的背景以及技术演进过程,然后将其与其他相关大数据处理技术进行对比,以界定分布式流数据处理的外延.进而对分布式流处理所需要考虑的数据模型、系统模型、存储管理、语义保障、负载控制、系统容错等主要问题进行深入分析,指出现有解决方案的优势和不足.随后,介绍S4,Storm,Spark Streaming等几种具有代表性的分布式流处理系统,并对它们进行系统地对比.最后,给出分布式流处理在社交媒体处理等领域的几种典型应用,并探讨分布式流处理领域进一步的研究方向. 相似文献