首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 429 毫秒
1.
基于Lustre文件系统的MPI检查点系统实现技术与性能测试   总被引:1,自引:0,他引:1  
基于协同式检查点的回卷恢复是在大规模并行计算机系统中得到采用的一项重要容错技术,其性能开销主要为协同协议和检查点映像存储所决定.描述了一个在MPICH2中实现的应用透明的并行检查点系统,相比已有的技术,该系统有以下特点:1)协同协议操作利用了并行应用的近邻通信特性,通过虚连接方法减少协议的处理开销;2)采用Lustre文件系统简化检查点映像文件管理的复杂性;3)通过并行I/O操作提高性能,优化检查点映像的存储过程.实际应用的测试表明,该检查点系统具有较小的运行时间开销和良好的可扩展性.  相似文献   

2.
Application Level Fault Tolerance in Heterogeneous Networks of Workstations   总被引:2,自引:0,他引:2  
We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems.  相似文献   

3.
Diskless checkpointing   总被引:4,自引:0,他引:4  
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures  相似文献   

4.
在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。  相似文献   

5.
Describes a nonblocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g, event list update, event execution) with the aim of removing the cost of recording state information from the completion time of the parallel simulation application. We present an implementation of a C library supporting nonblocking checkpointing on a myrinet based cluster, which demonstrates the practical viability of this checkpointing mode on standard off-the-shelf hardware. By the results of an empirical study on classical parameterized synthetic benchmarks, we show that, except for the case of minimal state granularity applications, nonblocking checkpointing allows improvement of the speed of the parallel execution, as compared to commonly adopted, optimized checkpointing methods based on the classical blocking mode. A performance study for the case of a personal communication system (PCS) simulation is additionally reported to point out the benefits from nonblocking checkpointing for a real world application.  相似文献   

6.
王一拙  陈旭  计卫星  苏岩  王小军  石峰 《软件学报》2016,27(7):1789-1804
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.  相似文献   

7.
A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.  相似文献   

8.
运行数据是大数据系统中增长最快、最为复杂也是最有价值的数据资源之一。基于运行数据,软件开发者可以分析关于软件质量和开发模型的重要信息。Spark作为一个分布式系统,在运行过程中会产生大量的运行数据,包括日志数据、监控数据以及任务图数据。开发者可以基于运行数据对系统进行参数调优。然而该系统所涉及的参数种类繁多、影响多样且难以评估,若对系统了解不足,进行参数调优存在较大的困难。提出运行数据历史库的概念,历史库中存储的是以往运行任务的特征信息以及运行配置信息。同时提出了基于历史库搜索的参数优化模型,并实验验证了本文提出的参数优化模型对用户任务性能提升具有较好的效果。  相似文献   

9.
As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.  相似文献   

10.
This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state-dependent checkpoint intervals, and a performance metric which is coupled with failure-free performance and the speedup functions associated with implementation of parallel algorithms. The analytic results for synchronous checkpointing without load redistribution are compared to measurements of a synthetic parallel algorithm with user-level checkpointing. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are used to determine when load redistribution is advantageous.  相似文献   

11.
In this article, we study the effects of network topology and load balancing on the performance of a new parallel algorithm for solving triangular systems of linear equations on distributed-memory message-passing multiprocessors. The proposed algorithm employs novel runtime data mapping and workload redistribution methods on a communication network which is configured as a toroidal mesh. A fully parameterized theoretical model is used to predict communication behaviors of the proposed algorithm relevant to load balancing, and the analytical performance results correctly determine the optimal dimensions of the toroidal mesh, which vary with the problem size, the number of available processors, and the hardware parameters of the machine. Further enhancement to the proposed algorithm is then achieved through redistributing the arithmetic workload at runtime. Our FORTRAN implementation of the proposed algorithm as well as its enhanced version has been tested on an Intel iPSC/2 hypercube, and the same code is also suitable for executing the algorithm on the iPSC/860 hypercube and the Intel Paragon mesh multiprocessor. The actual timing results support our theoretical findings, and they both confirm the very significant impact a network topology chosen at runtime can have on the computational load distribution, the communication behaviors and the overall performance of parallel algorithms.  相似文献   

12.
We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI’s ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are explored: work sharing using a centralized work server and distributed work stealing using explicit polling to service steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional techniques to manage the volume of communication and mitigate runtime overhead. Using additional parameters, we observed an improvement of up to 3–4X in parallel performance. We report results for three distributed memory parallel computer systems and use UTS to characterize the performance and scalability on these systems. Overall, we find that the simpler work sharing approach with a single work server achieves good performance on hundreds of processors and that our distributed work stealing implementation scales to thousands of processors and delivers more robust performance that is less sensitive to the particular workload and load balancing parameters.  相似文献   

13.
The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.  相似文献   

14.
In this paper, we present parallel multilevel algorithms for the hypergraph partitioning problem. In particular, we describe for parallel coarsening, parallel greedy k-way refinement and parallel multi-phase refinement. Using an asymptotic theoretical performance model, we derive the isoefficiency function for our algorithms and hence show that they are technically scalable when the maximum vertex and hyperedge degrees are small. We conduct experiments on hypergraphs from six different application domains to investigate the empirical scalability of our algorithms both in terms of runtime and partition quality. Our findings confirm that the quality of partition produced by our algorithms is stable as the number of processors is increased while being competitive with those produced by a state-of-the-art serial multilevel partitioning tool. We also validate our theoretical performance model through an isoefficiency study. Finally, we evaluate the impact of introducing parallel multi-phase refinement into our parallel multilevel algorithm in terms of the trade off between improved partition quality and higher runtime cost.  相似文献   

15.
Modern network processor systems require the ability to adapt their processing capabilities at runtime to changes in network traffic. Traditionally, network processor applications have been optimized for a single static workload scenario, but recently several approaches for runtime adaptation have been proposed. Comparing these approaches and developing novel runtime support algorithms is difficult due to the multicore system-on-a-chip nature of network processors. In this paper, we present a model for network processors that can aid in evaluating different runtime support systems. The model considers workload characteristics of applications and network traffic using a queuing network abstraction. The accuracy of this analytical approach to modeling runtime systems is validated through simulation. We illustrate the effectiveness of our model by comparing the performance of two existing workload adaptation algorithms.  相似文献   

16.
刘勇燕  刘勇鹏  冯华  迟万庆 《计算机科学》2011,38(5):287-289,305
检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。  相似文献   

17.
CCL (checkpointing and communication library) is a software layer in support of optimistic parallel discrete event simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, non-blocking (asynchronous) checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. These functionalities are unique since optimistic simulation systems conventionally rely on checkpointing implemented as a synchronous, CPU-based data copy. Releases of CCL up to v2.4 only support monoprogrammed non-blocking checkpoints. This forces re-synchronization between CPU and DMA activities, which is a potential source of overhead, each time a new checkpoint request must be issued at the simulation application level while the last issued one is still being carried out by the DMA engine. In this paper we present a redesigned release of CCL (v3.0) that, exploiting hardware capabilities of more advanced myrinet clusters, supports multiprogrammed non-blocking checkpoints. The multiprogrammed approach allows higher degree of concurrency between checkpointing and other simulation specific operations carried out by the CPU, with benefits on performance. We also report the results of the experimental evaluation of those benefits for the case of a Personal Communication System (PCS) simulation application, selected as a real world test-bed.  相似文献   

18.
Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.
Alan D. GeorgeEmail:
  相似文献   

19.
Firmware是IA64体系结构的一个重要组成部分。在系统启动时,首先运行Firmware,它对系统的各部件进行初始化、配置和测试,同时还对操作系统的引导进行管理。在大规模并行计算机系统中,节点问的Firmware结构是设计系统时要解决的重要问题。本文提出一种基于IA64的大规模并行系统的Firmware的结构,并对其作了简要分析。  相似文献   

20.
The increased complexity and scale of high performance computing and future extreme-scale systems have made resilience a key issue, since it is expected that future systems will have various faults during critical operations. It is also expected that current solutions for resiliency, mainly counting on checkpointing in hardware and applications, will become infeasible because of unacceptable recovery time for checkpointing and restarting. In this paper, we present innovative concepts for anomaly detection and identification, analyzing the duration of pattern transition sequences of an execution window. We use a three-dimensional array of features to capture spatial and temporal variability to be used by an anomaly analysis system to immediately generate an alert and identify the source of faults when an abnormal behavior pattern is captured, indicating some kind of software or hardware failure. The main contributions of this paper include the innovative analysis methodology and feature selection to detect and identify anomalous behavior. Evaluating the effectiveness of this approach to detect faults injected asynchronously shows a detection rate of above 99.9% with no occurrences of false alarms for a wide range of scenarios, and accuracy rate of 100% with short root cause analysis time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号