首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
在分布式计算环境中经常使用检查点/恢复策略来进行容错。文中主要研究在信道不可靠的环境中通过协调使相互通信的各进程所做的检查点保持全局一致性的方法。通过分析中途消息与信道可靠性之闯的关系以及已有检查点协议对于中途消息处理方法,提出了一种应用于信道不可靠环境下的协调式检查点方法,其消息复杂度为O(N)且不引入其他的计算负担,只通过一次同步即可达到全局一致性状态,相比于以往的协调式检查点协议大大减小了时间开销,提高了在不可靠信道环境中做全局一致检查点的效率。  相似文献   

2.
Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.  相似文献   

3.
分布式计算技术提供了充分利用现有网络资源的有效途径。该文论述了基于解决生物计算中难解问题的具有开放接口的分布式并行计算系统的设计与实现技术。系统兼有开放式、异构性、容错性与易用性等特点。讨论了系统的容错性机制、检查点策略及任务调度算法。对Motif Finding问题的求解验证表明,分布式并行计算机制能大大缩短问题的求解时间,为计算领域的难解问题提供有效的解决途径。  相似文献   

4.
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.  相似文献   

5.
一个适合大规模集群并行计算的检查点系统   总被引:4,自引:1,他引:4  
分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错.  相似文献   

6.
Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.  相似文献   

7.
李煜  李汉菊 《计算机工程》2003,29(8):47-48,61
CompactPCI是基于PCI电气规范开发标准的高性能工业总线。作为一项提高系统可用性的技术被广泛应用于电信、实时机器控制、军事系统,特别适用于系统控制器、定制I/O卡等嵌入式处理应用。文章提出了一种基于CompactPCI技术的并行防火墙模型,在此基础上讨论了动态负载均衡和容错的实现技术。  相似文献   

8.
节点崩溃或者仿真资源不足导致的分布式仿真系统故障,降低了仿真系统可靠性。为保证系统容错效果,降低容错开销,提出了一种基于虚拟化技术的仿真系统容错方法,按照系统故障发生的位置,对不同类型故障动态采用不同类型的容错策略。分析了检查点容错策略的优化方法,给出了最优设置间隔;结合虚拟化技术的优势,解决了副本容错策略的节点选择、副本数量以及位置分布问题;同时,引入基于虚拟机迁移的容错策略,并将其作为检查点容错策略和副本容错策略的补充,以降低容错开销。通过仿真实验数据对比,分析了动态容错策略与普通容错策略的性能,可知动态容错策略保证了系统容错性能,容错开销也保持在较低水平。  相似文献   

9.
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.  相似文献   

10.
杨娜  刘靖 《软件学报》2019,30(4):1191-1202
通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP(N-version programming)等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力.  相似文献   

11.
大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点,已成为超级计算机的发展方向,但其复杂的异构结构和庞大的系统规模,也使系统的可用性面临巨大挑战,因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题,在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制,兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现,支持众核系统,可扩展到百万以上并行规模;编译指导与自动分析的检查点通过程序员插入简单的编译指示,由编译器进行分析,提示不需要保留的数据,可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明,两种容错措施相对于传统容错方法效果良好,轻量级降级的容错开销小于1%,相对于传统回卷容错方法单次故障执行时间可减少3.5%以上,编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10,具有很好的实用性。  相似文献   

12.
Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce.  相似文献   

13.
王一拙  陈旭  计卫星  苏岩  王小军  石峰 《软件学报》2016,27(7):1789-1804
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.  相似文献   

14.
并行文件系统中适度贪婪的Cache预取一体化算法   总被引:3,自引:0,他引:3  
卢凯  金士尧  卢锡城 《计算机学报》1999,22(11):1172-1177
传统文件系统中的Cache和预取技术是两种降低访问延迟的有效方法。在并行科学计算应用的I/O访问模式下,简单的Cache和预取技术已无法提供较高的Cache命中率,该文在分析该I/O模式的基础上提出了适度贪婪的Cache和预取一体化算法(PGI),该算法充分利用了并行文件系统环境的特点,采用了适度贪婪的动态滑模技术,可以有铲地消除预取时的抖动,降低系统处理开锁,并同时采用了Cache和预取一体化的  相似文献   

15.
杨丽鹏  车永刚 《计算机应用》2013,33(9):2423-2427
大规模计算流体动力学(CFD)计算对数据I/O能力提出了很高需求。层次式文件格式(HDF5)可有效管理大规模科学数据,并对并行I/O具有良好的支持。针对结构网格CFD并行程序,设计了其数据文件的HDF5存储模式,并基于HDF5并行I/O编程接口实现了其数据文件的并行I/O,在并行计算机系统上进行了性能测试与分析。结果表明,在使用4~32个进程时,基于HDF5并行I/O方式的写文件性能比每进程独立写普通文件的方式高6.9~16.1倍;基于HDF5并行I/O方式的读文件性能不及后者,为后者的20%~70%,但是读文件的时间开销远小于写文件的时间开销,因此对总体性能的影响较小。  相似文献   

16.
新型非易失相变存储器PCM应用研究   总被引:1,自引:0,他引:1  
并行I/O技术有效优化了I/O性能,但对访问延迟却难以控制.相变存储器(phase change memory,PCM)作为一种SCM(storage class memory),具有非易失性、随机可读写、低延迟、高吞吐率、体积小和低功耗的特点,为I/O性能优化提供了最直接有效的途径.研究了PCM的特性与存在的问题,总结了目前PCM的应用研究进展,针对高性能计算中的并行I/O问题,提出了一种基于相变存储器PCM的层次式并行混合存储模型,能够有效提高并行文件系统元数据服务效率和并行I/O吞吐率.  相似文献   

17.
基于Linux的SMP机群环境中并行I/O模型研究   总被引:1,自引:0,他引:1  
提出了一个基于数据通路的波浪推进式并行I/O模型框架,并在基于Linux的SMP机群系统中,根据波浪推进式并行I/O模型框架对各个数据通路进行建模,具体分析了这个波浪推进式并行I/O模型,从思路上解决了刻画并行I/O性能的并行I/O模型问题。  相似文献   

18.
Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.  相似文献   

19.
MegaBlast is one of the most important programs in NCBI BLAST (Basic Local Alignment Search Tool) toolkits, tIowever, MegaBlast is computation and I/O intensive. It consumes a great deal of memory which is proportional to the size of the query sequences set and subject (database) sequences set of product. This paper proposes a new strategy for optimizing MegaBlast. The new strategy exchanges the query and subject sequences sets, and builds a hash table based on new subject sequences. It overlaps I/O with computation, shortens the overall time and reduces the cost of memory, since the memory here is only proportional to the size of subject sequences set. The optimized algorithm is suitable to be parallelized in cluster systems. The parallel algorithm uses query segmentation method. As our experiments shown, the parallel program which is implemented with MPI has fine scalability.  相似文献   

20.
This paper presents a new distributed disk-array architecture for achieving high I/O performance in scalable cluster computing. In a serverless cluster of computers, all distributed local disks can be integrated as a distributed-software redundant array of independent disks (ds-RAID) with a single I/O space. We report the new RAID-x design and its benchmark performance results. The advantage of RAID-x comes mainly from its orthogonal striping and mirroring (OSM) architecture. The bandwidth is enhanced with distributed striping across local and remote disks, while the reliability comes from orthogonal mirroring on local disks at the background. Our RAID-x design is experimentally compared with the RAID-5, RAID-10, and chained-declustering RAID through benchmarking on a research Linux cluster at USC. Andrew and Bonnie benchmark results are reported on all four disk-array architectures. Cooperative disk drivers and Linux extensions are developed to enable not only the single I/O space, but also the shared virtual memory and global file hierarchy. We reveal the effects of traffic rate and stripe unit size on I/O performance. Through scalability and overhead analysis, we find the strength of RAID-x in three areas: 1) improved aggregate I/O bandwidth especially for parallel writes, 2) orthogonal mirroring with low software overhead, and 3) enhanced scalability in cluster I/O processing. Architectural strengths and weakness of all four ds-RAID architectures are evaluated comparatively. The optimal choice among them depends on parallel read/write performance desired, the level of fault tolerance required, and the cost-effectiveness in specific I/O processing applications  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号