期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Campus Grids Meet Applications: Modeling, Metascheduling and Integration

Yonghong Yan Barbara M. Chapman 《Journal of Grid Computing》2006,4(2):159-175

Air Quality Forecasting (AQF) is a new discipline that attempts to reliably predict atmospheric pollution. An AQF application has complex workflows and in order to produce timely and reliable forecast results, each execution requires access to diverse and distributed computational and storage resources. Deploying AQF on Grids is one option to satisfy such needs, but requires the related Grid middleware to support automated workflow scheduling and execution on Grid resources. In this paper, we analyze the challenges in deploying an AQF application in a campus Grid environment and present our current efforts to develop a general solution for Grid-enabling scientific workflow applications in the GRACCE project. In GRACCE, an application’s workflow is described using GAMDL, a powerful dataflow language for describing application logic. The GRACCE metascheduling architecture provides the functionalities required for co-allocating Grid resources for workflow tasks, scheduling the workflows and monitoring their execution. By providing an integrated framework for modeling and metascheduling scientific workflow applications on Grid resources, we make it easy to build a customized environment with end-to-end support for application Grid deployment, from the management of an application and its dataset, to the automatic execution and analysis of its results.The work has been performed as part of the University of Houston’s Sun Microsystems Center of Excellence in Geosciences [38]. 相似文献

2.

A modelling and simulation based process for dependable systems design

Miriam Zia Sadaf Mustafiz Hans Vangheluwe Jörg Kienzle 《Software and Systems Modeling》2007,6(4):437-451

Complex real-time system design needs to address dependability requirements, such as safety, reliability, and security. We introduce a modelling and simulation based approach which allows for the analysis and prediction of dependability constraints. Dependability can be improved by making use of fault tolerance techniques. The de-facto example, in the real-time system literature, of a pump control system in a mining environment is used to demonstrate our model-based approach. In particular, the system is modelled using the Discrete EVent system Specification (DEVS) formalism, and then extended to incorporate fault tolerance mechanisms. The modularity of the DEVS formalism facilitates this extension. The simulation demonstrates that the employed fault tolerance techniques are effective. That is, the system performs satisfactorily despite the presence of faults. This approach also makes it possible to make an informed choice between different fault tolerance techniques. Performance metrics are used to measure the reliability and safety of the system, and to evaluate the dependability achieved by the design. In our model-based development process, modelling, simulation and eventual deployment of the system are seamlessly integrated. 相似文献

3.

Dependable Grid Workflow Scheduling Based on Resource Availability

Yongcai Tao Hai Jin Song Wu Xuanhua Shi Lei Shi 《Journal of Grid Computing》2013,11(1):47-61

Due to the highly dynamic feature, dependable workflow scheduling is critical in the Grid environment. Various scheduling algorithms have been proposed, but seldom consider the resource reliability. Current Grid systems mainly exploit fault tolerance mechanism to guarantee the dependable workflow execution, which, however, wastes system resources. The paper proposes a dependable Grid workflow scheduling system (called DGWS). It introduces a Markov Chain-based resource availability prediction model. Based on the model, a reliability cost driven workflow scheduling algorithm is presented. The performance evaluation results, including the simulation on both parametric randomly generated DAGs and two real scientific workflow applications, demonstrate that compared to present workflow scheduling algorithms, DGWS improves the success ratio of tasks and diminishes the makespan of workflow, so improves the dependability of workflow execution in the dynamic Grid environments. 相似文献

4.

Parallelizing XML data-streaming workflows via MapReduce

Daniel Zinn Shawn Bowers Sven Köhler Bertram Ludäscher 《Journal of Computer and System Sciences》2010,76(6):447-463

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to “black-box” (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows. 相似文献

5.

Mapping Abstract Complex Workflows onto Grid Environments 总被引：18，自引：0，他引：18

Ewa Deelman James Blythe Yolanda Gil Carl Kesselman Gaurang Mehta Karan Vahi Kent Blackburn Albert Lazzarini Adam Arbree Richard Cavanaugh Scott Koranda 《Journal of Grid Computing》2003,1(1):25-39

In this paper we address the problem of automatically generating job workflows for the Grid. These workflows describe the execution of a complex application built from individual application components. In our work we have developed two workflow generators: the first (the Concrete Workflow Generator CWG) maps an abstract workflow defined in terms of application-level components to the set of available Grid resources. The second generator (Abstract and Concrete Workflow Generator, ACWG) takes a wider perspective and not only performs the abstract to concrete mapping but also enables the construction of the abstract workflow based on the available components. This system operates in the application domain and chooses application components based on the application metadata attributes. We describe our current ACWG based on AI planning technologies and outline how these technologies can play a crucial role in developing complex application workflows in Grid environments. Although our work is preliminary, CWG has already been used to map high energy physics applications onto the Grid. In one particular experiment, a set of production runs lasted 7 days and resulted in the generation of 167,500 events by 678 jobs. Additionally, ACWG was used to map gravitational physics workflows, with hundreds of nodes onto the available resources, resulting in 975 tasks, 1365 data transfers and 975 output files produced. 相似文献

6.

Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm

Abdulhamid Shafi’i Muhammad Abd Latiff Muhammad Shafie Madni Syed Hamid Hussain Abdullahi Mohammed 《Neural computing & applications》2018,29(1):279-293

In cloud computing, resources are dynamically provisioned and delivered to users in a transparent manner automatically on-demand. Task execution failure is no longer accidental but a common characteristic of cloud computing environment. In recent times, a number of intelligent scheduling techniques have been used to address task scheduling issues in cloud without much attention to fault tolerance. In this research article, we proposed a dynamic clustering league championship algorithm (DCLCA) scheduling technique for fault tolerance awareness to address cloud task execution which would reflect on the current available resources and reduce the untimely failure of autonomous tasks. Experimental results show that our proposed technique produces remarkable fault reduction in task failure as measured in terms of failure rate. It also shows that the DCLCA outperformed the MTCT, MAXMIN, ant colony optimization and genetic algorithm-based NSGA-II by producing lower makespan with improvement of 57.8, 53.6, 24.3 and 13.4 % in the first scenario and 60.0, 38.9, 31.5 and 31.2 % in the second scenario, respectively. Considering the experimental results, DCLCA provides better quality fault tolerance aware scheduling that will help to improve the overall performance of the cloud environment.

相似文献

7.

Dynamic Partitioning of GATE Monte-Carlo Simulations on EGEE

Sorina Camarasu-Pop Tristan Glatard Jakub T. Mościcki Hugues Benoit-Cattin David Sarrut 《Journal of Grid Computing》2010,8(2):241-259

The EGEE Grid offers the necessary infrastructure and resources for reducing the running time of particle tracking Monte-Carlo applications like GATE. However, efforts are required to achieve reliable and efficient execution and to provide execution frameworks to end-users. This paper presents results obtained with porting the GATE software on the EGEE Grid, our ultimate goal being to provide reliable, user-friendly and fast execution of GATE to radiation therapy researchers. To address these requirements, we propose a new parallelization scheme based on a dynamic partitioning and its implementation in two different frameworks using pilot jobs and workflows. Results show that pilot jobs bring strong improvement w.r.t. regular gLite submission, that the proposed dynamic partitioning algorithm further reduces execution time by a factor of two and that the genericity and user-friendliness offered by the workflow implementation do not introduce significant overhead. 相似文献

8.

An adaptive algorithm for tolerating value faults and crashfailures

Yansong Ren Cukier M. Sanders W.H. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):173-192

The AQuA architecture provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of dependability. This paper presents the algorithms that AQUA uses, when an application's dependability requirements can change at runtime, to tolerate both value faults in applications and crash failures simultaneously. In particular, we provide an active replication communication scheme that maintains data consistency among replicas, detects crash failures, collates the messages generated by replicated objects, and delivers the result of each vote. We also present an adaptive majority voting algorithm that enables the correct ongoing vote while both the number of replicas and the majority size dynamically change. Together, these two algorithms form the basis of the mechanism for tolerating and recovering from value faults and crash failures in AQuA 相似文献

9.

DAGMap: efficient and dependable scheduling of DAG workflow job in Grid 总被引：1，自引：1，他引：0

Haijun Cao Hai Jin Xiaoxin Wu Song Wu Xuanhua Shi 《The Journal of supercomputing》2010,51(2):201-223

DAG has been extensively used in Grid workflow modeling. Since Grid resources tend to be heterogeneous and dynamic, efficient and dependable workflow job scheduling becomes essential. It poses great challenges to achieve minimum job accomplishing time and high resource utilization efficiency, while providing fault tolerance. Based on list scheduling and group scheduling, in this paper, we propose a novel scheduling heuristic called DAGMap. DAGMap consists of two phases, namely Static Mapping and Dependable Execution. Four salient features of DAGMap are: (1) Task grouping is based on dependency relationships and task upward priority; (2) Critical tasks are scheduled first; (3) Min-Min and Max-Min selective scheduling are used for independent tasks; and (4) Checkpoint server with cooperative checkpointing is designed for dependable execution. The experimental results show that DAGMap can achieve better performance than other previous algorithms in terms of speedup, efficiency, and dependability. 相似文献

10.

A Dynamic Cloud Dimensioning Approach for Parallel Scientific Workflows: a Case Study in the Comparative Genomics Domain

Rafaelli Coutinho Yuri Frota Kary Ocaña Daniel de Oliveira Lúcia M. A. Drummond 《Journal of Grid Computing》2016,14(3):443-461

Usually, scientists need to execute experiments that demand high performance computing environments and parallel techniques. This is the scenario found in many bioinformatics experiments modeled as scientific workflows, such as phylogenetic and phylogenomic analyses. To execute these experiments, scientists have adopted virtual machines (VMs) instantiated in clouds. Estimating the number of VMs to instantiate is a crucial task to avoid negative impacts on the execution performance and on the financial costs with under or overestimations. Previously, the necessary number of VMs to execute bioinformatics workflows have been estimated by a GRASP heuristic and have been coupled to a Cloud-based Parallel Scientific Workflow Management System. Although this work was a step forward, this approach only provided a static dimensioning. If the characteristics of the environment change (processing capacity, network speed), this static dimensioning may not be suitable. In this way, it is of interest that the dimensioning is adjusted at runtime. To achieve this, we developed a novel framework for monitoring and dynamically dimensioning resources during the execution of parallel scientific workflows in clouds, called Dynamic Dimensioning of Cloud Computing Framework (DDC-F). We have evaluated DDC-F in real executions of bioinformatics workflows. Experiments showed that DDC-F is able to efficiently calculate the number of VMs necessary to execute bioinformatics workflows of Comparative Genomics (CG), also reducing the financial costs, when compared with other works of the related literature. 相似文献

11.

Dynamic Instrumentation, Performance Monitoring and Analysis of Grid Scientific Workflows

Hong-Linh Truong Thomas Fahringer Schahram Dustdar 《Journal of Grid Computing》2005,3(1-2):1-18

While existing work concentrates on developing QoS models of business workflows and Web services, few tools have been developed to support the monitoring and performance analysis of scientific workflows in Grids. This paper describes novel Grid services for dynamic instrumentation of Grid-based applications, performance monitoring and analysis of Grid scientific workflows. We describe a Grid dynamic instrumentation service that provides a widely accessible interface for other services and users to conduct the dynamic instrumentation of Grid applications during the runtime. We introduce a Grid performance analysis service for Grid scientific workflows. The analysis service utilizes various types of data including workflow graphs, monitoring data of resources, execution status of activities, and performance measurements obtained from the dynamic instrumentation of invoked applications, and provides a rich set of functionalities and features to support the online monitoring and performance analysis of scientific workflows. Workflows and their relevant information including performance metrics are stored and utilized for comparing the performance of constructs of different workflows and for supporting multi-workflow analysis. The work described in this paper is supported in part by the Austrian Science Fund as part of the Aurora Project under contract SFBF1104 and by the European Union through the IST-2002-511385 project K-WfGrid. 相似文献

12.

Achieving cost-efficient fail-operational behavior based on inherent redundancy at the system level

《Microprocessors and Microsystems》2021

To fulfill their safety requirements, modern embedded systems are increasingly often expected to deliver a guaranteed minimum level of functionality at all times. In practice, such fail-operational systems are often based on fault tolerance mechanisms that are inadequate for use in cost-driven environments such as the automotive domain. In this work, we consider safety-critical embedded systems with a certain degree of spare resources at the system level and propose a cost-efficient fault tolerance approach that protects a pair of execution units from severe hardware faults. The concept requires no replication of an execution unit. Instead, it employs a state-preserving proxy unit that communicates with low-level devices such as sensors or actuators and handles faults of one execution unit by dynamically migrating the safety-critical portion of its functionality to the redundant counterpart. Based on the application of this concept to an example scenario from the automotive domain, we analyze the resource overhead of the proxy unit and evaluate both the achieved fault handling time and the generated computational overhead experimentally. 相似文献

13.

G‐BLAST: a Grid‐based solution for mpiBLAST on computational Grids

Chao‐Tung Yang Tsu‐Fen Han Heng‐Chuan Kan 《Concurrency and Computation》2009,21(2):225-255

Over the past few years, research and development in bioinformatics (e.g. genomic sequence alignment) has grown with each passing day fueling continuing demands for vast computing power to support better performance. This trend usually requires solutions involving parallel computing techniques because cluster computing technology reduces execution times and increases genomic sequence alignment efficiency. One example, mpiBLAST is a parallel version of NCBI BLAST that combines NCBI BLAST with message passing interface (MPI) standards. However, as most laboratories cannot build up powerful cluster computing environments, Grid computing framework concepts have been designed to meet the need. Grid computing environments coordinate the resources of distributed virtual organizations and satisfy the various computational demands of bioinformatics applications. In this paper, we report on designing and implementing a BioGrid framework, called G‐BLAST, that performs genomic sequence alignments using Grid computing environments and accessible mpiBLAST applications. G‐BLAST is also suitable for cluster computing environments with a server node and several client nodes. G‐BLAST is able to select the most appropriate work nodes, dynamically fragment genomic databases, and self‐adjust according to performance data. To enhance G‐BLAST capability and usability, we also employ a WSRF Grid Service Portal and a Grid Service GUI desk application for general users to submit jobs and host administrators to maintain work nodes. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

14.

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Sebastian Weis Arne Garbade Bernhard Fechner Avi Mendelson Roberto Giorgi Theo Ungerer 《International journal of parallel programming》2016,44(2):208-232

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates. 相似文献

15.

Execution coordination in mobile agent-based distributed job workflow execution 总被引：1，自引：0，他引：1

Yuhong Wentong 《Journal of Systems Architecture》2008,54(10):944-956

Mobile agent-based distributed job workflow execution requires the use of execution coordination techniques to ensure that an agent executing a subjob can locate its predecessors’ execution results. This paper describes the classification, implementation, and evaluation of execution coordination techniques in the mobile agent-based distributed job workflow execution system. First, a classification of the existing execution coordination techniques is developed for mobile agent systems. Second, to put the discussion into perspective, our framework for mobile agent-based distributed job workflow execution over the Grid (that is, MCCF: Mobile Code Collaboration Framework) is described. How the existing coordination techniques can be applied in the MCCF is also discussed. Finally, a performance study has been conducted to evaluate three coordination techniques using real and simulated job workflows. The results are presented and discussed in the paper. 相似文献

16.

Performance metrics and ontologies for Grid workflows

《Future Generation Computer Systems》2007,23(6):760-772

Many Grid workflow middleware services require knowledge about the performance behavior of Grid applications/services in order to effectively select, compose, and execute workflows in dynamic and complex Grid systems. To provide performance information for building such knowledge, Grid workflow performance tools have to select, measure, and analyze various performance metrics of workflows. However, there is a lack of a comprehensive study of performance metrics which can be used to evaluate the performance of a workflow executed in the Grid. Moreover, given the complexity of both Grid systems and workflows, semantics of essential performance-related concepts and relationships, and associated performance data in Grid workflows should be well described. In this paper, we analyze performance metrics that performance monitoring and analysis tools should provide during the evaluation of the performance of Grid workflows. Performance metrics are associated with multiple levels of abstraction. We introduce an ontology for describing performance data of Grid workflows and illustrate how the ontology can be utilized for monitoring and analyzing the performance of Grid workflows. 相似文献

17.

A selective protection scheme of applications using asymmetrically reliable caches

《Journal of Systems Architecture》2017

Cache structures in a multicore system are highly vulnerable to soft errors. Enabling fault tolerance capabilities on all cache structures in a system is inefficient in terms of performance and power consumption. In this study, we propose an enhanced protection mechanism for code segments, which are critical in terms of reliability, by utilizing asymmetrically reliable cores under performance and power constraints. Our proposed system contains at least one high-reliability core, which has an ECC-protected L1 cache, and several low-reliability cores, which have no protection mechanisms. Reliability-based critical code regions are assumed to be high-priority functions, which are extracted by examining the execution time percentages and the program’s call graph in our framework, statically. Software threads that invoke one of the high-priority functions are bound to the high-reliability cores dynamically during the execution, while the threads that execute the remaining functions are bound to the low-reliability cores. As part of the experimental analysis, our proposed framework is compared with traditional fully protected and unprotected configurations with respect to performance, power and reliability metrics for various applications of the benchmarks. Our framework exploits the benefits of providing the reliability-based critical regions of the applications exclusively by offering notable power and cost savings with close performance and reliability values for the set of functions reported in the experimental results. 相似文献

18.

An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415

Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献

19.

基于网格服务的工作流技术 总被引：6，自引：3，他引：3

应宏《计算机工程与设计》2005,26(10):2671-2673

网格工作流是将工作流技术应用到网格环境中。研究了在OGSA框架下,网格工作流的概念、特点,基于OGSA工作流层次结构和网格工作流执行的基本过程,提出了网格工作流研究中所要解决的几个关键问题,包括网格工作流描述语言、规划与调度、执行与管理、监测与错误处理、动态与优化处理、使用与开发环境。相似文献

20.

A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Rakhi Anand Troy LeBlanc Edgar Gabriel Jaspal Subhlok 《Journal of Grid Computing》2011,9(3):325-344

The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper presents VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changing networks and node execution speeds. Process replication is employed to provide robustness. The central challenge in the design of VolpexMPI is to efficiently and automatically manage dynamically varying number of process replicas in different states of execution progress. The key fault tolerance technique employed is fully distributed sender based logging. The paper presents the design and an implementation of VolpexMPI. Preliminary results validate that the overhead of providing robustness is modest for applications with a favorable ratio of communication to computation and a low degree of communication. 相似文献