期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Large improvements in application throughput of long‐running multi‐component applications using batch grids

Sivagama Sundari M Sathish S. Vadhiyar Ravi S. Nanjundiah 《Concurrency and Computation》2012,24(15):1775-1791

Computational grids with multiple batch systems (batch grids) can be powerful infrastructures for executing long‐running multi‐component parallel applications. In this paper, we evaluate the potential improvements in throughput of long‐running multi‐component applications when the different components of the applications are executed on multiple batch systems of batch grids. We compare the multiple batch executions with executions of the components on a single batch system without increasing the number of processors used for executions. We perform our analysis with a foremost long‐running multi‐component application for climate modeling, the Community Climate System Model (CCSM). We have built a robust simulator that models the characteristics of both the multi‐component application and the batch systems. By conducting large number of simulations with different workload characteristics and queuing policies of the systems, processor allocations to components of the application, distributions of the components to the batch systems and inter‐cluster bandwidths, we show that multiple batch executions lead to 55% average increase in throughput over single batch executions for long‐running CCSM. We also conducted real experiments with a practical middleware infrastructure and showed that multi‐site executions lead to effective utilization of batch systems for executions of CCSM and give higher simulation throughput than single‐site executions. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

2.

Strategies for Rescheduling Tightly-Coupled Parallel Applications in Multi-Cluster Grids 总被引：1，自引：0，他引：1

H. A. Sanjay Sathish S. Vadhiyar 《Journal of Grid Computing》2011,9(3):379-403

As computational Grids are increasingly used for executing long running multi-phase parallel applications, it is important to develop efficient rescheduling frameworks that adapt application execution in response to resource and application dynamics. In this paper, three strategies or algorithms have been developed for deciding when and where to reschedule parallel applications that execute on multi-cluster Grids. The algorithms derive rescheduling plans that consist of potential points in application execution for rescheduling and schedules of resources for application execution between two consecutive rescheduling points. Using large number of simulations, it is shown that the rescheduling plans developed by the algorithms can lead to large decrease in application execution times when compared to executions without rescheduling on dynamic Grid resources. The rescheduling plans generated by the algorithms are also shown to be competitive when compared to the near-optimal plans generated by brute-force methods. Of the algorithms, genetic algorithm yielded the most efficient rescheduling plans with 9–12% smaller average execution times than the other algorithms. 相似文献

3.

Grids with multiple batch systems for performance enhancement of multi-component and parameter sweep parallel applications

M. Sathish S. Ravi S. 《Future Generation Computer Systems》2010,26(2):217-227

In this work, we evaluate the benefits of using Grids with multiple batch systems to improve the performance of multi-component and parameter sweep parallel applications by reduction in queue waiting times. Using different job traces of different loads, job distributions and queue waiting times corresponding to three different queuing policies (FCFS, conservative and EASY backfilling), we conducted a large number of experiments using simulators of two important classes of applications. The first simulator models Community Climate System Model (CCSM), a prominent multi-component application and the second simulator models parameter sweep applications. We compare the performance of the applications when executed on multiple batch systems and on a single batch system for different system and application configurations. We show that there are a large number of configurations for which application execution using multiple batch systems can give improved performance over execution on a single system. 相似文献

4.

New Grid Scheduling and Rescheduling Methods in the GrADS Project 总被引：1，自引：0，他引：1

F. Berman H. Casanova A Chien K. Cooper H. Dail A. Dasgupta W. Deng J. Dongarra L. Johnsson K. Kennedy C. Koelbel B. Liu X. Liu A. Mandal G. Marin M. Mazina J. Mellor-Crummey C. Mendes A. Olugbile M. Patel D. Reed Z. Shi O. Sievert H. Xia A. YarKhan 《International journal of parallel programming》2005,33(2-3):209-229

The goal of the Grid Application Development Software (GrADS) Project is to provide programming tools and an execution environment to ease program development for the Grid. This paper presents recent extensions to the GrADS software framework: a new approach to scheduling workflow computations, applied to a 3-D image reconstruction application; a simple stop/migrate/restart approach to rescheduling Grid applications, applied to a QR factorization benchmark; and a process-swapping approach to rescheduling, applied to an N-body simulation. Experiments validating these methods were carried out on both the GrADS MacroGrid (a small but functional Grid) and the MicroGrid (a controlled emulation of the Grid). 相似文献

5.

P-GRADE: A Grid Programming Environment

P. Kacsuk G. Dózsa J. Kovács R. Lovas N. Podhorszki Z. Balaton G. Gombás 《Journal of Grid Computing》2003,1(2):171-197

P-GRADE provides a high-level graphical environment to develop parallel applications transparently both for parallel systems and the Grid. P-GRADE supports the interactive execution of parallel programs as well as the creation of a Condor, Condor-G or Globus job to execute parallel programs in the Grid. In P-GRADE, the user can generate either PVM or MPI code according to the underlying Grid where the parallel application should be executed. PVM applications generated by P-GRADE can migrate between different Grid sites and as a result P-GRADE guarantees reliable, fault-tolerant parallel program execution in the Grid. The GRM/PROVE performance monitoring and visualisation toolset has been extended towards the Grid and connected to a general Grid monitor (Mercury) developed in the EU GridLab project. Using the Mercury/GRM/PROVE Grid application monitoring infrastructure any parallel application launched by P-GRADE can be remotely monitored and analysed at run time even if the application migrates among Grid sites. P-GRADE supports workflow definition and co-ordinated multi-job execution for the Grid. Such workflow management can provide parallel execution at both inter-job and intra-job level. Automatic checkpoint mechanism for parallel programs supports the migration of parallel jobs inside the workflow providing a fault-tolerant workflow execution mechanism. The paper describes all of these features of P-GRADE and their implementation concepts. 相似文献

6.

EMPEROR: An OGSA Grid Meta-Scheduler Based on Dynamic Resource Predictions

Lazar Adzigogov John Soldatos Lazaros Polymenakos 《Journal of Grid Computing》2005,3(1-2):19-37

Scheduling constitutes an integral feature of Grid computing infrastructures, being also a key to realizing several of the Grid promises. In particular, scheduling can maximize the resources available to end users, accelerate the execution of jobs, while also supporting scalable and autonomic management of the resources comprising a Grid. Grid scheduling functionality hinges on middleware components called meta-schedulers, which undertake to automatically distribute jobs across the dispersed heterogeneous resources of a Grid. In this paper we present the design and implementation of a Grid meta-scheduler, which we call EMPEROR. EMPEROR provides a framework for implementing scheduling algorithms based on performance criteria. In implementing a particular instantiation of this framework, we have devised models for predicting host load and memory resources, and accordingly for estimating the running time of a task. These models hinge on time series analysis techniques and take into account results of the cluster computing literature. Apart from incorporating these models, EMPEROR provides fully fledged Grid scheduling functionality, which complies with OGSA standards as the later are reflected in the Globus toolkit. Specifically, EMPEROR interfaces to Globus middleware services (i.e., GSI, MDS, GRAM) towards discovering resources, implementing the scheduling algorithm and ultimately submitting jobs to local scheduling systems. By and large, EMPEROR is one of the few standards based meta-schedulers making use of dynamic scheduling information. 相似文献

7.

Implementation of Fault-Tolerant GridRPC Applications

Yusuke Tanimura Tsutomu Ikegami Hidemoto Nakada Yoshio Tanaka Satoshi Sekiguchi 《Journal of Grid Computing》2006,4(2):145-157

A task parallel application is implemented with Ninf-G, a GridRPC system. A series of experiments are conducted on the Grid testbed in Asia Pacific for three months. Through tens of long executions, typical fault patterns were collected, and instability of the network throughput was determined to be a major reason of the faults. Several important points are stressed to avoid task throughput decline due to the fault-recovery operations: Timeout minimization for fault detection, background recovery, redundant task assignments, and so on. This study also issues a steer for design of the automated fault-tolerant mechanism in an upper layer of the GridRPC framework. 相似文献

8.

Coordinated rescheduling of Bag‐of‐Tasks for executions on multiple resource providers

Marco A.S. Netto Rajkumar Buyya 《Concurrency and Computation》2012,24(12):1362-1376

Metaschedulers can distribute parts of a Bag‐of‐Tasks (BoT) application among various resource providers in order to speed up its execution. The expected completion time of the user application is then calculated based on the run‐time estimates of all applications running and waiting for resources. However, because of inaccurate run time estimates, initial schedules are not those that provide users with the earliest completion time. These estimates increase the time distance between the first and last tasks of a BoT application, which increases average user response time, especially in multi‐provider environments. This paper proposes a coordinated rescheduling algorithm to handle inaccurate run‐time estimates when executing BoT applications in multi‐provider environments. The coordinated rescheduling defines which tasks can have start time updated based on the expected completion time of the entire BoT application. We have also evaluated the impact of system‐generated run‐time estimates to schedule BoT applications on multiple providers. We performed experiments using simulations and a real distributed platform, Grid'5000. From our experiments, we obtained reductions of up to 5 and 10% for response time and slowdown metrics, respectively, by using coordinated rescheduling over a traditional rescheduling solution. Moreover, coordinated rescheduling requires little modification of existing scheduling systems. System‐generated predictions, on the other hand, are more complex to be deployed and may not reduce response times as much as coordinated rescheduling. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

9.

HPC on the Grid: The Theophys Experience

Roberto Alfieri Silvia Arezzini Alberto Ciampa Roberto De Pietri Enrico Mazzoni 《Journal of Grid Computing》2013,11(2):265-280

The Grid Virtual Organization (VO) “Theophys”, associated to the INFN (Istituto Nazionale di Fisica Nucleare), is a theoretical physics community with various computational demands, spreading from serial, SMP, MPI and hybrid jobs. That has led, in the past 20 years, towards the use of the Grid infrastructure for serial jobs, while the execution of multi-threaded, MPI and hybrid jobs has been performed in several small-medium size clusters installed in different sites, with access through standard local submission methods. This work analyzes the support for parallel jobs in the scientific Grid middlewares, then describes how the community unified the management of most of its computational need (serial and parallel ones) using the Grid through the development of a specific project which integrates serial e parallel resources in a common Grid based framework. A centralized national cluster is deployed inside this framework, providing “Wholenodes” reservations, CPU affinity, and other new features supporting our High Performance Computing (HPC) applications in the Grid environment. Examples of the cluster performance for relevant parallel applications in theoretical physics are reported, focusing on the different kinds of parallel jobs that can be served by the new features introduced in the Grid. 相似文献

10.

Dynamic I/O-Aware Scheduling for Batch-Mode Applications on Chip Multiprocessor Systems of Cluster Platforms简

下载免费PDF全文

吕方崔慧敏王蕾刘磊武成岗冯晓兵游本中《计算机科学技术学报》2014,29(1):21-37

Efficiency of batch processing is becoming increasingly important for many modern commercial service centers, e.g., clusters and cloud computing datacenters. However, periodical resource contentions have become the major performance obstacles for concurrently running applications on mainstream CMP servers. I/O contention is such a kind of obstacle, which may impede both the co-running performance of batch jobs and the system throughput seriously. In this paper, a dynamic I/O-aware scheduling algorithm is proposed to lower the impacts of I/O contention and to enhance the co-running performance in batch processing. We set up our environment on an 8-socket, 64-core server in Dawning Linux Cluster. Fifteen workloads ranging from 8 jobs to 256 jobs are evaluated. Our experimental results show significant improvements on the throughputs of the workloads, which range from 7% to 431%. Meanwhile, noticeable improvements on the slowdown of workloads and the average runtime for each job can be achieved. These results show that a well-tuned dynamic I/O-aware scheduler is beneficial for batch-mode services. It can also enhance the resource utilization via throughput improvement on modern service platforms. 相似文献

11.

Dynamic Partitioning of GATE Monte-Carlo Simulations on EGEE

Sorina Camarasu-Pop Tristan Glatard Jakub T. Mościcki Hugues Benoit-Cattin David Sarrut 《Journal of Grid Computing》2010,8(2):241-259

The EGEE Grid offers the necessary infrastructure and resources for reducing the running time of particle tracking Monte-Carlo applications like GATE. However, efforts are required to achieve reliable and efficient execution and to provide execution frameworks to end-users. This paper presents results obtained with porting the GATE software on the EGEE Grid, our ultimate goal being to provide reliable, user-friendly and fast execution of GATE to radiation therapy researchers. To address these requirements, we propose a new parallelization scheme based on a dynamic partitioning and its implementation in two different frameworks using pilot jobs and workflows. Results show that pilot jobs bring strong improvement w.r.t. regular gLite submission, that the proposed dynamic partitioning algorithm further reduces execution time by a factor of two and that the genericity and user-friendliness offered by the workflow implementation do not introduce significant overhead. 相似文献

12.

A simulator for adaptive parallel applications

Basile Schaeli Sebastian Gerlach Roger D. Hersch 《Journal of Computer and System Sciences》2008,74(6):983-999

Dynamically allocating computing nodes to parallel applications is a promising technique for improving the utilization of cluster resources. Detailed simulations can help identify allocation strategies and problem decomposition parameters that increase the efficiency of parallel applications. We describe a simulation framework supporting dynamic node allocation which, given a simple cluster model, predicts the running time of parallel applications taking CPU and network sharing into account. Simulations can be carried out without needing to modify the application code. Thanks to partial direct execution, simulation times and memory requirements are reduced. In partial direct execution simulations, the application's parallel behavior is retrieved via direct execution, and the duration of individual operations is obtained from a performance prediction model or from prior measurements. Simulations may then vary cluster model parameters, operation durations and problem decomposition parameters to analyze their impact on the application performance and identify the limiting factors. We implemented the proposed techniques by adding direct execution simulation capabilities to the Dynamic Parallel Schedules parallelization framework. We introduce the concept of dynamic efficiency to express the resource utilization efficiency as a function of time. We verify the accuracy of our simulator by comparing the effective running time, respectively the dynamic efficiency, of parallel program executions with the running time, respectively the dynamic efficiency, predicted by the simulator under different parallelization and dynamic node allocation strategies. 相似文献

13.

Lark: An effective approach for software-defined networking in high throughput computing clusters

《Future Generation Computer Systems》2017

相似文献

14.

Service scheduling and rescheduling in an applications integration framework

Lei Yu Frédéric Magoulès 《Advances in Engineering Software》2009,40(9):941-946

Grid technologies are evolving towards a service oriented architecture (SOA) and the traditional client/server architecture of heterogeneous computing (HC) can be transformed into a grid service oriented architecture. In this architecture, when more than one service fulfills the user request, a service which can make scheduling decisions is essential. A scheduling service has been proposed in a framework which achieves the dynamic deployment and scheduling of scientific and engineering applications. The framework treats all components (resource service and scheduler service) as WSRF-compliant services which support the applications integration with underlying native platform facilities and facilitate the construction of the hierarchical scheduling system. In order to enhance the system performance, we replace the MWL scheduling algorithm with an MCT algorithm and integrate a rescheduling mechanism in the framework. The experiments show that the MCT algorithm can achieve a smaller makespan and the rescheduling mechanism ensures the task execution even if an application is removed from the Resource Service. 相似文献

15.

Adaptive service scheduling for workflow applications in Service-Oriented Grid 总被引：1，自引：1，他引：0

Sung Ho Chin Taeweon Suh Heon Chang Yu 《The Journal of supercomputing》2010,52(3):253-283

When the workflow application is executed in Service-Oriented Grid (SOG), performance issues such as service scheduling should be considered, to achieve high and stable performance in execution. However, most of the prior works on workflow management neither study the performance issues nor provide evaluation methodologies on the performance of Grid Services. Therefore, it is infeasible to apply for the service scheduling problem in SOG. In this paper, we propose and model evaluation metrics for the Grid Service performance. The metrics are extracted based on common properties of Grid Services and are used to quantify and evaluate the performance of an individual Grid Service. With these metrics, we develop a service scheduling scheme with a list scheduling heuristic, to choose proper and optimal Grid Services for tasks in workflow applications. It ensures high performance in the execution of the workflow applications. In addition, we propose a low-overhead rescheduling method, referred to as Adaptive List Scheduling for Service (ALSS), to adapt to the dynamic nature of a grid environment. ALSS provides stable performance for workflow applications, even in abnormal circumstances. Finally, we design an experimental environment with actual traces and perform simulations to quantify the benefits of our approach. Throughout the experiments, we demonstrate that ALSS outperforms conventional scheduling methods. Our scheme produces a scheduling performance that is superior to AHEFT by 50.2%, SLACK by 50.8%, HEFT by 68.3%, MaxMin by 72.0%, MinMin by 71.0%, and Myopic by 69.8%. 相似文献

16.

Qespera: an adaptive framework for prediction of queue waiting times in supercomputer systems

Prakash Murali Sathish Vadhiyar 《Concurrency and Computation》2016,28(9):2685-2710

Production parallel systems are space‐shared, and resource allocation on such systems is usually performed using a batch queue scheduler. Jobs submitted to the batch queue experience a variable delay before the requested resources are granted. Predicting this delay can assist users in planning experiment time‐frames and choosing sites with less turnaround times and can also help meta‐schedulers make scheduling decisions. In this paper, we present an integrated adaptive framework, Qespera, for prediction of queue waiting times on parallel systems. We propose a novel algorithm based on spatial clustering for predictions using history of job submissions and executions. The framework uses adaptive set of strategies for choosing either distributions or summary of features to represent the system state and to compare with history jobs, varying the weights associated with the features for each job prediction, and selecting a particular algorithm dynamically for performing the prediction depending on the characteristics of the target and history jobs. Our experiments with real workload traces from different production systems demonstrate up to 22% reduction in average absolute error and up to 56% reduction in percentage prediction error over existing techniques. We also report prediction errors of less than 1 h for a majority of the jobs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

17.

Rescheduling Manufacturing Systems: A Framework of Strategies, Policies, and Methods 总被引：14，自引：0，他引：14

Guilherme E. Vieira Jeffrey W. Herrmann Edward Lin 《Journal of Scheduling》2003,6(1):39-62

Many manufacturing facilities generate and update production schedules, which are plans that state when certain controllable activities (e.g., processing of jobs by resources) should take place. Production schedules help managers and supervisors coordinate activities to increase productivity and reduce operating costs. Because a manufacturing system is dynamic and unexpected events occur, rescheduling is necessary to update a production schedule when the state of the manufacturing system makes it infeasible. Rescheduling updates an existing production schedule in response to disruptions or other changes. Though many studies discuss rescheduling, there are no standard definitions or classification of the strategies, policies, and methods presented in the rescheduling literature. This paper presents definitions appropriate for most applications of rescheduling manufacturing systems and describes a framework for understanding rescheduling strategies, policies, and methods. This framework is based on a wide variety of experimental and practical approaches that have been described in the rescheduling literature. The paper also discusses studies that show how rescheduling affects the performance of a manufacturing system, and it concludes with a discussion of how understanding rescheduling can bring closer some aspects of scheduling theory and practice. 相似文献

18.

An improved load-balancing mechanism based on deadline failure recovery on GridSim

Deepak Kumar Patel Devashree Tripathy Chitaranjan Tripathy 《Engineering with Computers》2016,32(2):173-188

Grid computing has emerged a new field, distinguished from conventional distributed computing. It focuses on large-scale resource sharing, innovative applications and in some cases, high performance orientation. The Grid serves as a comprehensive and complete system for organizations by which the maximum utilization of resources is achieved. The load balancing is a process which involves the resource management and an effective load distribution among the resources. Therefore, it is considered to be very important in Grid systems. For a Grid, a dynamic, distributed load balancing scheme provides deadline control for tasks. Due to the condition of deadline failure, developing, deploying, and executing long running applications over the grid remains a challenge. So, deadline failure recovery is an essential factor for Grid computing. In this paper, we propose a dynamic distributed load-balancing technique called “Enhanced GridSim with Load balancing based on Deadline Failure Recovery” (EGDFR) for computational Grids with heterogeneous resources. The proposed algorithm EGDFR is an improved version of the existing EGDC in which we perform load balancing by providing a scheduling system which includes the mechanism of recovery from deadline failure of the Gridlets. Extensive simulation experiments are conducted to quantify the performance of the proposed load-balancing strategy on the GridSim platform. Experiments have shown that the proposed system can considerably improve Grid performance in terms of total execution time, percentage gain in execution time, average response time, resubmitted time and throughput. The proposed load-balancing technique gives 7 % better performance than EGDC in case of constant number of resources, whereas in case of constant number of Gridlets, it gives 11 % better performance than EGDC. 相似文献

19.

A dynamic rescheduling algorithm for resource management in large scale dependable distributed systems

Alexandra Olteanu Florin Pop Ciprian Dobre Valentin Cristea 《Computers & Mathematics with Applications》2012,63(9):1409-1423

Scheduling is a key component for performance guarantees in the case of distributed applications running in large scale heterogeneous environments. Another function of the scheduler in such system is the implementation of resilience mechanisms to cope with possible faults. In this case resilience is best approached using dedicated rescheduling mechanisms. The performance of rescheduling is very important in the context of large scale distributed systems and dynamic behavior. The paper proposes a generic rescheduling algorithm. The algorithm can use a wide variety of scheduling heuristics that can be selected by users in advance, depending on the system’s structure. The rescheduling component is designed as a middleware service that aims to increase the dependability of large scale distributed systems. The system was evaluated in a real-world implementation for a Grid system. The proposed approach supports fault tolerance and offers an improved mechanism for resource management. The evaluation of the proposed rescheduling algorithm was performed using modeling and simulation. We present experimental results confirming the performance and capabilities of the proposed rescheduling algorithm. 相似文献

20.

BMC via on-the-fly determinization 总被引：1，自引：0，他引：1

Toni Jussila Keijo Heljanko Ilkka Niemelä 《International Journal on Software Tools for Technology Transfer (STTT)》2005,7(2):89-101

This paper develops novel bounded model checking (BMC) techniques for asynchronous parallel systems. The aim is to increase the efficiency of BMC by exploiting the inherent concurrency in such systems. This added efficiency is gained by covering more reachable states within a given bound using two techniques. Firstly, a nonstandard execution model, step executions, where multiple actions can take place simultaneously is applied. Secondly, the number of executions the system can have is reduced by modeling the execution of the system components as if they were determinized. This determinization technique also enables the removal of the internal transitions of the components. Step executions can be further restricted to a subclass called process executions without losing any reachable states.The paper presents a translation scheme for BMC of reachability properties. The translation is from an asynchronous system where the components are modeled as labeled transition systems (LTSs) to a propositional formula. The models of the formula correspond to the step executions of the original system where each component is replaced with its determinized counterpart. The formula for step executions can be easily extended in such a way that its models correspond to the process executions of the system. The translation scheme has been implemented and some experimental comparisons performed. The results show that the bound needed to detect a violation of a reachability property is, for step and process executions, in most cases lower than in interleaving executions and that the running time of the model checker using process executions is smaller than of that using steps. Moreover, the performance compares favorably to a state-of-the-art interleaving BMC implementation in the NuSMV system. 相似文献