首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A major requirement imposed on the operation of a real-time computing (RTC) system is that the deadlines for the operation of application programs must be met. The violation of an operational deadline leads to a failure of an RTC system. In this context, the problem arises of ensuring the required accuracy of estimating the execution time of application programs. An approach is developed for the design of iterative scheduling algorithms in which the execution times of application programs are estimated using simulation models with a different degree of detail, which ensures the required accuracy of estimating the execution time of programs. The approach can be used to design iterative algorithms of the following classes: genetic, evolutionary, simulated annealing, random-search, and locally optimal algorithms.  相似文献   

2.
As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.  相似文献   

3.
通过作业日志分析和考核实验方式,对超级计算机并行作业运行稳定性进行了分析。日志分析结果表明,并行作业运行的稳定性会随作业执行时间的增长、作业使用CPU数的增多而下降;当并行作业的计算量达到105CPU小时量级,超过20%的作业会因系统故障而中止。考核实验结果表明,使用数千CPU的并行作业很容易受到多种因素的干扰而中止,很难持续运行超过24小时。最后给出了有关超级计算机稳定性改进、系统管理使用和并行程序研制的几点建议。  相似文献   

4.
Dynamical concurrent execution makes it possible to adapt programs for their execution on computing environments with parallel architecture. In the paper, a formal model of dynamical concurrent execution of programs written in functional style is presented. The model is proven to possess a feature that guarantees correctness of concurrent execution.  相似文献   

5.
Presents a modeling approach based on stochastic Petri nets to estimate the reliability and availability of programs in a distributed computing system environment. In this environment, successful execution of programs is conditioned on the successful access of related files distributed throughout the system. The use of stochastic Petri nets is demonstrated by extending a basic reliability model to account for repair actions when faults occur. To this end, two possible models are discussed: the global repair model, which assumes a centralized repair team that restores the system to its original status when a failure state is reached, and the local repair model, which assumes that repairs are localized to the node where they occur. The former model is useful in evaluating the availability of programs (or the availability of the hardware support) subject to hardware faults that are repaired globally; therefore, the programs of interest can be interrupted. On the other hand, the latter model can be used to evaluate program reliability in the presence of hardware faults subject to repair, without interrupting the normal operation of the system  相似文献   

6.
The dataflow program graph execution model, or dataflow for short, is an alternative to the stored-program (von Neumann) execution model. Because it relies on a graph representation of programs, the strengths of the dataflow model are very much the complements of those of the stored-program one. In the last thirty or so years since it was proposed, the dataflow model of computation has been used and developed in very many areas of computing research: from programming languages to processor design, and from signal processing to reconfigurable computing. This paper is a review of the current state-of-the-art in the applications of the dataflow model of computation. It focuses on three areas: multithreaded computing, signal processing and reconfigurable computing.  相似文献   

7.
The mobile agent‐based computational steering (MACS) for distributed applications is presented in this article. In the MACS, a mobile agent platform, Mobile‐C, is embedded in a program through the Mobile‐C library to support C/C++ mobile agent code. Runtime replaceable algorithms of a program are represented as agent services in C/C++ source code and can be replaced with new ones through mobile agents. In the MACS, a mobile agent created and deployed by a user from the steering host migrates to computing hosts successively to replace algorithms of running programs that constitute a distributed application without the need of stopping the execution and recompiling the programs. The methodology of dynamic algorithm alteration in the MACS is described in detail with an example of matrix operation. The Mobile‐C library enables the integration of Mobile‐C into any C/C++ programs to carry out computational steering through mobile agents. The source code level execution of mobile agent code facilitates handling issues such as portability and secure execution of mobile agent code. In the MACS, the network load between the steering and computing hosts can be reduced, and the successive operations of a mobile agent on multiple computing hosts are not affected whether the steering host stays online or not. The employment of the middle‐level language C/C++ enables the MACS to accommodate the diversity of scientific and engineering fields to allow for runtime interaction and steering of distributed applications to match the dynamic requirements imposed by the user or the execution environment. An experiment is used to validate the feasibility of the MACS in real‐world mobile robot applications. The experiment replaces a mobile robot's behavioral algorithm with a mobile agent at runtime. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

8.
A set of programs running under a multiprogramming batch operating system on the CDC 6600 which provide remote users with a time sharing service is described. The basis for the system is the ability of a user program to create job control statements during execution, thereby tricking the operating system into treating it as an ordinary batch job. The text editor and the interactive debugging facilities are described. The performance of the system, known as the People's Time Sharing System (PTSS), and user reaction to it are also described.  相似文献   

9.
The paper presents a platform for distributed computing, developed using the latest software technologies and computing paradigms to enable big data mining. The platform, called ClowdFlows, is implemented as a cloud-based web application with a graphical user interface which supports the construction and execution of data mining workflows, including web services used as workflow components. As a web application, the ClowdFlows platform poses no software requirements and can be used from any modern browser, including mobile devices. The constructed workflows can be declared either as private or public, which enables sharing the developed solutions, data and results on the web and in scientific publications. The server-side software of ClowdFlows can be multiplied and distributed to any number of computing nodes. From a developer’s perspective the platform is easy to extend and supports distributed development with packages. The paper focuses on big data processing in the batch and real-time processing mode. Big data analytics is provided through several algorithms, including novel ensemble techniques, implemented using the map-reduce paradigm and a special stream mining module for continuous parallel workflow execution. The batch mode and real-time processing mode are demonstrated with practical use cases. Performance analysis shows the benefit of using all available data for learning in distributed mode compared to using only subsets of data in non-distributed mode. The ability of ClowdFlows to handle big data sets and its nearly perfect linear speedup is demonstrated.  相似文献   

10.
In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.  相似文献   

11.
A new approach to estimating the fault-tolerance of the parallel control computing systems relies on the mathematical model-based determination of the probability of successful completion in a given schedule time of an arbitrary set of interdependent jobs (tasks) with random times of job execution and asynchronous job redundancy. The estimates were determined both for the standard execution of a set of tasks and for the case of single malfunction (fault or failure) of any computing system processor detected at execution of any job from the set. The basic distinction of this approach lies in that here the numerical values of the reliability parameters (probabilities or intensities of faults or failures) of the computing resources are neither given nor used.  相似文献   

12.
频繁模式挖掘是最重要的数据挖掘任务之一,传统的频繁模式挖掘算法是以"批处理"方式执行的,即一次性对所有数据进行挖掘,无法满足不断增长的大数据挖掘的需要。MapReduce是一种流行的并行计算模式,在并行数据挖掘领域已得到了广泛的应用。将传统频繁模式增量挖掘算法CanTree向MapReduce计算模型进行了迁移,实现了并行的频繁模式增量挖掘。实验结果表明,提出的算法实现了较好的负载均衡,执行效率有明显提升。  相似文献   

13.
线程级推测(TLS)技术可挖掘程序并行执行潜能,提高多核资源利用率,但目前TACLeBench的内核基准仍未在TLS并行化中得到有效分析。针对该问题设计了循环级推测执行的剖析方案和剖析工具。选取7个代表性的TACLeBench内核基准程序,首先对程序进行初始化分析,选取程序热点片段插入循环标识;其次对这些片段进行交叉编译,记录程序推测线程与内存地址相关数据,剖析其循环级最大潜在并行性;最后综合探讨程序运行时的特征(线程粒度、可并行化覆盖率、依赖特征)以及源码对加速比的影响。实验结果表明:1)该类程序适合采用TLS加速,与串行执行结果相比,循环结构的推测执行下的大部分程序的加速比在2以上,其中最高加速比达到20.79;2)利用TLS加速TACLeBench内核程序时,多数应用可有效利用4核到16核的计算资源。  相似文献   

14.
目前,随着大规模并行计算的高速发展,并行程序性能分析与建模的地位日益重要,而并行程序性能数据的采集是进行性能分析的基础。硬件计数器的使用使人们能够更加便利地在程序执行过程中采集性能数据。文章讨论了OpenMP并行程序的性能数据采集技术,并介绍一种利用PAPI进行数据采集的实现方法。  相似文献   

15.
Science is increasingly becoming more and more data-driven. The ability of a geographically distributed community of scientists to access and analyze large amounts of data has emerged as a significant requirement for furthering science. In data intensive computing environment with uncountable numeric nodes, resource is inevitably unreliable, which has a great effect on task execution and scheduling. Novel algorithms are needed to schedule the jobs on the trusty nodes to execute, assure the high speed of communication, reduce the jobs execution time, lower the ratio of failure execution, and improve the security of execution environment of important data. In this paper, a kind of trust mechanism-based task scheduling model was presented. Referring to the trust relationship models of social persons, trust relationship is built among computing nodes, and the trustworthiness of nodes is evaluated by utilizing the Bayesian cognitive method. Integrating the trustworthiness of nodes into a Dynamic Level Scheduling (DLS) algorithm, the Trust-Dynamic Level Scheduling (Trust-DLS) algorithm is proposed. Moreover, a benchmark is structured to span a range of data intensive computing characteristics for evaluation the proposed method. Theoretical analysis and simulations prove that the Trust-DLS algorithm can efficiently meet the requirement of data intensive workloads in trust, sacrificing fewer time costs, and assuring the execution of tasks in a security way in large-scale data intensive computing environment.  相似文献   

16.
ROK SOSI   DAVID ABRAMSON 《Software》1997,27(2):185-206
A significant amount of software development is evolutionary, involving the modification of already existing programs. To a large extent, the modified programs produce the same results as the original program. This similarity between the original program and the development program is utilized by relative debugging. Relative debugging is a new concept that enables the user to compare the execution of two programs by specifying the expected correspondences between their states. A relative debugger concurrently executes the programs, verifies the correspondences, and reports any differences found. We describe our novel debugger, called Guard, and its relative debugging capabilities. Guard is implemented by using our library of debugging routines, called Dynascope, which provides debugging primitives in heterogeneous networked environments. To demonstrate the capacity of Guard for debugging in heterogeneous environments, we describe an experiment in which the execution of two programs is compared across Internet. The programs are written in different programming languages and executing on different computing platforms. © 1997 by John Wiley & Sons, Ltd.  相似文献   

17.
We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions.  相似文献   

18.
Execution profiles are important in analyzing the performance of computer programs on a given computer system. However, accurate and complete profiles are difficult to arrive at for programs that follow the client-server model of computing, as in the popular X Window System. In X Window applications, considerable computation is invoked at the display server and this computation is an important part of the overall execution profile. The profiler presented in this paper generates meaningful profiles for X Window applications by estimating the time spent in servicing the messages in the display server. The central idea is to analyze a protocol-level trace of the interaction between the application and the display server and thereby construct an execution profile from the trace and a set of metrics about the target display server. Experience using the profiler for examining bottlenecks is presented.  相似文献   

19.
Memory‐related program failures in multithreaded programs can be caused by a variety of bugs. Concurrency bugs can occur due to unexpected or incorrect thread interleavings during execution. Other kinds of memory bugs, such as buffer overflows and uninitialized reads, may also occur in multithreaded as well as single‐threaded programs. Most prior techniques for isolating these bugs are specialized, addressing only one type of concurrency bug or certain types of other memory bugs. The memory corruption caused by these bugs can also undergo significant propagation during program execution. When a program failure finally occurs due to memory corruption, the true root cause of the failure may be effectively concealed as significant portions of memory may have become corrupted. We propose a general framework that can isolate the root cause of any failure in a multithreaded program that involves memory corruption and reveals at least a subset of this memory corruption. This includes three important types of concurrency bugs—data races, atomicity violations, and order violations—as well as other kinds of memory bugs. To account for propagation of memory corruption, our approach uses a dynamic technique called ‘execution suppression’ that iteratively reveals memory corruption in a failing execution to isolate the true root cause of the failure. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

20.
本文简要介绍了智能操作系统KZ3,其主要功能包括大规模并行处理系统中资源管理、友善人机接口、支持知识处理和并行处理能力等.通过一实例,讨论了KZ3的实现.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号