首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents P2P-MPI, a middleware aimed at computational Grids. From the programmer point of view, P2P-MPI provides a message-passing programming model which enables the development of MPI applications for Grids. Its originality lies in its adaptation to unstable environments. First, the peer-to-peer design of P2P-MPI allows for a dynamic discovery of collaborating resources. Second, it gives the user the possibility to adjust the robustness of an execution thanks to an internal process replication mechanism. Finally, we measure the performance of the integrated message passing library on several benchmarks and on different hardware platforms.  相似文献   

2.
As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.  相似文献   

3.
任务管理和错误恢复是CFD网格应用平台的核心部分,它对运行在平台上的CFD程序进行管理,并提供错误恢复机制。为了更好地管理CFD任务,平台通过建立CFD程序抽象模型,对任务管理和错误恢复模块进行设计上的优化。因此该模块采用了双层架构,上层实现了对服务的管理,下层负责对并行程序的管理。同时该模块还提供了守护进程CFD-Daemon来实现错误恢复,以及应用开发包供CFD程序员使用。  相似文献   

4.
In this paper, we present techniques for providing on-demand structural redundancy for Coarse-Grained Reconfigurable Array (CGRAs) and a calculus for determining the gains of reliability when applying these replication techniques from the perspective of safety-critical parallel loop program applications. Here, for protecting massively parallel loop computations against errors like soft errors, well-known replication schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) must be applied to each single Processor Element (PE) rather than one based on application requirements for reliability and Soft Error Rates (SERs). Moreover, different voting options and signal replication schemes are investigated. It will be shown that hardware voting may be accomplished at negligible hardware cost, i. e. less than two percent area overhead per PE, for a class of reconfigurable processor arrays called Tightly Coupled Processor Arrays (TCPAs). As a major contribution of this paper, a formal analysis of the reliability achievable by each combination of replication and voting scheme for parallel loop executions on CGRAs in dependence of a given SER and application timing characteristics (schedule) is elaborated. Using this analysis, error detection latencies may be computed and proper decisions which replication scheme to choose at runtime to guarantee a maximal probability of failure on-demand can be derived. Finally, fault-simulation results are provided and compared with the formal analysis of reliability.  相似文献   

5.
Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by an acyclic directed graph called a task graph. The execution times of tasks are considered to be independently identically distributed (i.i.d.) random variables. The performance measure of interest is the overall execution of the considered parallel program (task graph). Stochastic bound methods are applied to obtain lower and upper bounds on this measure. Another upper bound is obtained for parallel programs having `new better than used in expectation' (NBUE) random variables as task execution times. NBUE random variables are replaced with exponential random variables of the same mean to derive this upper bound  相似文献   

6.
As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.  相似文献   

7.
With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.  相似文献   

8.
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.  相似文献   

9.
10.
Test set size in terms of the number of test cases is an important consideration when testing software systems. Using too few test cases might result in poor fault detection and using too many might be very expensive and suffer from redundancy. We define the failure rate of a program as the fraction of test cases in an available test pool that result in execution failure on that program. This paper investigates the relationship between failure rates and the number of test cases required to detect the faults. Our experiments based on 11 sets of C programs suggest that an accurate estimation of failure rates of potential fault(s) in a program can provide a reliable estimate of adequate test set size with respect to fault detection and should therefore be one of the factors kept in mind during test set construction. Furthermore, the model proposed herein is fairly robust to incorrect estimations in failure rates and can still provide good predictive quality. Experiments are also performed to observe the relationship between multiple faults present in the same program using the concept of a failure rate. When predicting the effectiveness against a program with multiple faults, results indicate that not knowing the number of faults in the program is not a significant concern, as the predictive quality is typically not affected adversely.  相似文献   

11.
GOP is a graph‐oriented programming model which aims at providing high‐level abstractions for configuring and programming cooperative parallel processes. With GOP, the programmer can configure the logical structure of a parallel/distributed program by constructing a logical graph to represent the communication and synchronization between the local programs in a distributed processing environment. This paper describes a visual programming environment, called VisualGOP, for the design, coding, and execution of GOP programs. VisualGOP applies visual techniques to provide the programmer with automated and intelligent assistance throughout the program design and construction process. It provides a graphical interface with support for interactive graph drawing and editing, visual programming functions and automation facilities for program mapping and execution. VisualGOP is a generic programming environment independent of programming languages and platforms. GOP programs constructed under VisualGOP can run in heterogeneous parallel/distributed systems. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

12.
Recent techniques for fault localization statistically analyze coverage information of a set of test runs to measure the correlations between program entities and program failures. However, coverage information cannot identify those program entities whose execution affects the output and therefore weakens the aforementioned correlations. This paper proposes a slice-based statistical fault localization approach to address this problem. Our approach utilizes program slices of a set of test runs to capture the influence of a program entity's execution on the output, and uses statistical analysis to measure the suspiciousness of each program entity being faulty. In addition, this paper presents a new slicing approach called approximate dynamic backward slice to balance the size and accuracy of a slice, and applies this slice to our statistical approach. We use two standard benchmarks and three real-life UNIX utility programs as our subjects, and compare our approach with a sufficient number of fault localization techniques. The experimental results show that our approach can significantly improve the effectiveness of fault localization.  相似文献   

13.
Predicting where faults can hide from testing   总被引:1,自引:0,他引:1  
Voas  J. Morell  L. Miller  K. 《Software, IEEE》1991,8(2):41-48
Sensitivity analysis, which estimates the probability that a program location can hide a failure-causing fault, is addressed. The concept of sensitivity is discussed, and a fault/failure model that accounts for fault location is presented. Sensitivity analysis requires that every location be analyzed for three properties: the probability of execution occurring, the probability of infection occurring, and the probability of propagation occurring. One type of analysis is required to handle each part of the fault/failure model. Each of these analyses is examined, and the interpretation of the resulting three sets of probability estimates for each location is discussed. The relationship of the approach to testability is considered  相似文献   

14.
Execution patterns and fault distribution characteristics of a program will affect the failure process and thus reliability estimates. The failure process of a software system is influenced by many factors, and traditional software reliability engineering has found it difficult to isolate the effect of each individual factor. A simulation approach is used to investigate the effects of fault distribution, execution pattern and program structure on software reliability estimates. A reliability simulation environment (RSIM) is extended by introducing variable fault distribution patterns in its code generation phase. Flow control points allow varying the execution frequency of different parts of a program. The simulation results show that fault distribution patterns and execution patterns have dramatic effects on fault exposure rate. If the fault distribution is non‐uniform, a non‐uniform code execution exposes faults more efficiently and effectively than uniform execution. Results also show that the structure of a program affects fault exposure rate and testing time required. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

15.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

16.
17.
基于条件概率模型的缺陷定位方法   总被引:1,自引:0,他引:1  
舒挺  黄明献  丁佐华  王磊  夏劲松 《软件学报》2018,29(6):1756-1769
缺陷定位是软件调试的重要阶段,依赖程序频谱信息实现软件缺陷定位,是当前比较行之有效的方法.基于频谱缺陷定位方法应用的前提是,程序频谱和执行结果之间存在的潜在关联.通过经验性分析两者之间的内在关联,借助于统计学的条件概率思想,构建了用以量化分析两者关系强弱的P模型,并基于此提出了基于条件概率的缺陷定位方法.以Siemens套件中的7个程序、Space程序和3个Unix工具程序为基准评测对象,与已有的15种经典缺陷定位方法进行了对比实验.实证研究结果表明,该方法总体上具有更好的缺陷定位效果.  相似文献   

18.
We propose a new technique combining dynamic and static analysis of programs to find linear invariants. We use a statistical tool, called simple component analysis, to analyze partial execution traces of a given program. We get a new coordinate system in the vector space of program variables, which is used to specialize numerical abstract domains. As an application, we instantiate our technique to interval analysis of simple imperative programs and show some experimental evaluations.  相似文献   

19.
We consider the time-dependent demands for data movement that a parallel program makes on the architecture that executes it. The result is an architecture-independent metric that represents the temporal behavior of data-movement requirements. Programs are described as series of computations and data movements, and while message passing is not ruled out, we focus on explicit parallel programs using a fixed number of processes in a distributed shared-memory environment. Operations are assumed to be explicitly allocated to processors when the metric is applied, which might correspond to intermediate code in a parallelizing compiler. The metric is called the interprocess read (IR) temporal metric. A key to developing an architecture-independent temporal metric is modeling program execution time in an architecture-independent way. This is possible because well-synchronized parallel programs make coordinated progress above a certain level of granularity. Our execution time characterization takes into account barrier synchronization and critical sections. We illustrate the metric using instruction count on simple code fragments and then from multiprocessor program traces (Splash benchmarks). Results of running the benchmarks on simulated network architectures show that the IR metric for the time scale of network response predicts performance better than whole program measures.  相似文献   

20.
In parallel and distributed applications, it is very likely that object‐oriented languages, such as Java and Ruby, and large‐scale semistructured data written in XML will be employed. However, because of their inherent dynamic memory management, parallel and distributed applications must sometimes suspend the execution of all tasks running on the processors. This adversely affects their execution on the parallel and distributed platform. In this paper, we propose a new task scheduling method called CP/MM (Critical Path/Memory Management) which can efficiently schedule tasks for applications requiring memory management. The underlying concept is to consider the cost due to memory management when the task scheduling system allocates ready (executable) coarse‐grain tasks, or macro‐tasks, to processors. We have developed three task scheduling modules, including CP/MM, for a task scheduling system which is implemented on a Java RMI (Remote Method Invocation) communication infrastructure. Our experimental results show that CP/MM can successfully prevent high‐priority macro‐tasks from being affected by the garbage collection arising from memory management, so that CP/MM can efficiently schedule distributed programs whose critical paths are relatively long. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号