期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Modeling and measuring multiprogramming and system overheads on a shared-memory multiprocessor: Case study

R. T. Dimpsey R. K. Iyer 《Journal of Parallel and Distributed Computing》1991,12(4)

This paper presents methodologies capable of quantifying multiprogramming (MP) overhead on a computer system. Two methods which quantify the lower bound on MP overhead, along with a method to determine MP overhead present in real workloads, are introduced. The techniques are illustrated by determining the percentage of parallel processing time consumed by MP overhead on Alliant multiprocessors. The real workload MP overhead measurements, as well as measurements of other overhead components such as kernel lock spinning, are then used in a comprehensive case study of performance degradation due to overheads. It is found that MP overhead accounts for well over half of the total system overhead. Kernel lock spinning is determined to be a major component of both MP and total system overhead. Correlation analysis is used to uncover underlying relationships between overheads and workload characteristics. It is found that for the workloads studied, MP overhead in the parallel environment is not statistically dependent on the number of parallel jobs being multiprogrammed. However, because of increased kernel contention, serial jobs, even those executing on peripheral processors, are responsible for variation in MP overhead. 相似文献

2.

A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters

《Journal of Parallel and Distributed Computing》2005,65(8):885-900

In this paper, a heuristic dynamic scheduling scheme for parallel real-time jobs executing on a heterogeneous cluster is presented. In our system model, parallel real-time jobs, which are modeled by directed acyclic graphs, arrive at a heterogeneous cluster following a Poisson process. A job is said to be feasible if all its tasks meet their respective deadlines. The scheduling algorithm proposed in this paper takes reliability measures into account, thereby enhancing the reliability of heterogeneous clusters without any additional hardware cost. To make scheduling results more realistic and precise, we incorporate scheduling and dispatching times into the proposed scheduling approach. An admission control mechanism is in place so that parallel real-time jobs whose deadlines cannot be guaranteed are rejected by the system. For experimental performance study, we have considered a real world application as well as synthetic workloads. Simulation results show that compared with existing scheduling algorithms in the literature, our scheduling algorithm reduces reliability cost by up to 71.4% (with an average of 63.7%) while improving schedulability over a spectrum of workload and system parameters. Furthermore, results suggest that shortening scheduling times leads to a higher guarantee ratio. Hence, if parallel scheduling algorithms are applied to shorten scheduling times, the performance of heterogeneous clusters will be further enhanced. 相似文献

3.

Processor Allocation in Multiprogrammed Distributed-Memory Parallel Computer Systems

《Journal of Parallel and Distributed Computing》1997,46(1):28-47

In this paper, we examine three general classes of space-sharing scheduling policies under a workload representative of large-scale scientific computing. These policies differ in the way processors are partitioned among the jobs as well as in the way jobs are prioritized for execution on the partitions. We consider new static, adaptive and dynamic policies that differ from previously proposed policies by exploiting user-supplied information about the resource requirements of submitted jobs. We examine the performance characteristics of these policies from both the system and user perspectives. Our results demonstrate that existing static schemes do not perform well under varying workloads, and that the system scheduling policy for such workloads must distinguish between jobs with large differences in execution times. We show that obtaining good performance under adaptive policies requires somea prioriknowledge of the job mix in these systems. We further show that a judiciously parameterized dynamic space-sharing policy can outperform adaptive policies from both the system and user perspectives. 相似文献

4.

Co-scheduling algorithms for high-throughput workload execution

Guillaume Aupy Manu Shantharam Anne Benoit Yves Robert Padma Raghavan 《Journal of Scheduling》2016,19(6):627-640

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several applications concurrently. We partition the original application set into a series of packs, which are executed one by one. A pack comprises several applications, each of them with an assigned number of processors, with the constraint that the total number of processors assigned within a pack does not exceed the maximum number of available processors. The objective is to determine a partition into packs, and an assignment of processors to applications, that minimize the sum of the execution times of the packs. We thoroughly study the complexity of this optimization problem, and propose several heuristics that exhibit very good performance on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We show that co-scheduling leads to faster workload completion time (40 % improvement on average over traditional scheduling) and to faster response times (50 % improvement). Hence, co-scheduling increases system throughput and saves energy, leading to significant benefits from both the user and system perspectives. 相似文献

5.

Static processor allocation in a soft real-time multiprocessorenvironment

Carlson B.M. Dowdy L.W. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(3):316-320

Soft real-time environments consist of jobs that must receive service within a particular time interval. If service for a specific job is not completed by the end of its time interval, it is said to be lost; in addition, the computation time expended on the job is wasted, and any further computation for the job is discontinued. The goal of a system designer is to provide an environment that minimizes the number of jobs that are lost. If a parallel environment is available, the system designer has two options: Allow each processor to execute a job individually, or let multiple processors cooperate in executing a job. This article shows, for two classes of static allocation policies, that simple comparative analytical models may be used to indicate which option minimizes the number of lost jobs, as a function of workload intensity. The first class of policies, called equal partitions, statically decomposes the system into equal-size sets of processors and executes one job per partition. These policies are frequently employed in other contexts. The second class of policies, called two partitions, statically partitions the processors into two sets, not necessarily of the same size. Surprisingly, it is observed mathematically that even for statistically identical jobs, this class of policies is superior to equal partitions under certain loadings. The analysis is validated experimentally with a workload executed on a 16-node iPSC/2 hypercube 相似文献

6.

A Dynamic Load Balancing Framework for Real-time Applications in Message Passing Systems

Ghada F. El Kabbany Nayer M. Wanas Nadia H. Hegazi Samir I. Shaheen 《International journal of parallel programming》2011,39(2):143-182

Load balancing algorithms are designed essentially to equally distribute the load on processors and maximize their utilities while minimizing the total task execution time. In order to achieve these goals, the load-balancing mechanism should be “fair” in distributing the load across the different processors. This implies that the difference between the heaviest-loaded and the lightest-loaded processors should be minimized. Therefore, the load information on each processor must be updated such that the load-balancing mechanism can be more effective. In this work, we present an application independent dynamic algorithm for scheduling tasks and load- balancing in message passing systems. We propose a DAG-based Dynamic Load Balancing algorithm for Real time applications (DAG-DLBR) that is designed to work dynamically to cope with possible changes in the load that might occur during runtime. This algorithm addresses the challenge of devising a load balancing scheme which judicially deals with the hybrid execution of existing real-time application (represented by a Direct Acyclic Graph (DAG)) together with newly arriving jobs. The main objective of this algorithm is to reduce response times of the newly arriving jobs while maintaining the time constrains of the existing DAG. To evaluate the performance of the DAG-DLBR algorithm, a comparison with the performance of two common dynamic load balancing algorithms is presented. This comparison is performed by evaluating, experimentally, the execution time of different load balancing algorithms on a homogenous real parallel machine. In addition, the values of load imbalance, the execution time, and the communication overhead time are evaluated analytically using different benchmarks as test-bed workloads. These workloads cover a wide range of dynamic applications with different task types. Experimental results illustrate the improved performance of the DAG-DLBR algorithm compared to both distributed and hierarchal based algorithms by at least 12 and 19%, respectively. This improvement is true for all workloads, even with highly dependent workload. The DAG-DLBR algorithm achieves lower computation time than its corresponding values of both the distributed and the hierarchical-based algorithms for 4, 8, 12 and 16 processors. 相似文献

7.

The impact of task service time variability on gang scheduling performance in a two-cluster system

Zafeirios C. Papazachos Helen D. Karatza 《Simulation Modelling Practice and Theory》2009,17(7):1276-1289

Gang scheduling is a common task scheduling policy for parallel and distributed systems which combines elements of space-sharing and time-sharing. In this paper we present a migration strategy which reduces the fragmentation in the schedule caused by gang scheduled jobs. We consider the existence of high priority jobs in the workload. These jobs need to be started immediately and they may interrupt a parallel job’s execution. A distributed system consisting of two homogeneous clusters is simulated to evaluate the performance for various workloads. We study the impact on performance of the variability in service time of the parallel tasks. Our simulation results indicate that the proposed strategy can result in a significant performance gain and that the performance improvement depends on the variability of gang tasks’ service time. 相似文献

8.

Experimental Analysis of Timing Validation Methods for Distributed Real-Time Systems

Cha Hojung Ha Rhan Liu Jane W. S. 《The Journal of supercomputing》2003,25(1):73-94

Scheduling jobs dynamically on processors is likely to achieve better performance in multiprocessor and distributed real-time systems. Exhaustive methods for determining whether all jobs complete by their deadlines, in systems that use modern priority-driven scheduling strategies, are often infeasible or unreliable since the execution time of each job may vary. We previously published research results on finding worst-case bounds and efficient algorithms for validating systems in which independent jobs have arbitrary release times and deadlines, and are scheduled on processors dynamically in a priority-driven manner. An efficient method has been proposed to determine how late the completion times of jobs can be in dynamic systems where the jobs are preemptable and nonmigratable. This paper further presents the performance characteristics of the proposed methods, and shows its soundness by providing extensive simulation results. The worst-case completion times of jobs obtained with the proposed methods are compared with the values by simulations under different workload characteristics. The simulation results show that the proposed algorithm performs considerably well for diverse workloads. Considering the previous work showed the unlikelihood of finding tighter bounds than the one given in the paper, the simulation results indicate that the proposed methods effectively constitute a theoretical basis needed for a comprehensive validation strategy that is capable of dealing with dynamic distributed real-time systems. 相似文献

9.

On-line hierarchical job scheduling on grids with admissible allocation

Andrei Tchernykh Uwe Schwiegelshohn Ramin Yahyapour Nikolai Kuzjurin 《Journal of Scheduling》2010,13(5):545-552

In this paper, we address non-preemptive online scheduling of parallel jobs on a Grid. Our Grid consists of a large number of identical processors that are divided into several machines. We consider a Grid scheduling model with two stages. At the first stage, jobs are allocated to a suitable machine, while at the second stage, local scheduling is independently applied to each machine. We discuss strategies based on various combinations of allocation strategies and local scheduling algorithms. Finally, we propose and analyze a scheme named adaptive admissible allocation. This includes a competitive analysis for different parameters and constraints. We show that the algorithm is beneficial under certain conditions and allows for an efficient implementation in real systems. Furthermore, a dynamic and adaptive approach is presented which can cope with different workloads and Grid properties. 相似文献

10.

Adaptive parallel job scheduling with flexible coscheduling

Frachtenberg E. Feitelson G. Petrini F. Fernandez J. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(11):1066-1077

Many scientific and high-performance computing applications consist of multiple processes running on different processors that communicate frequently. Because of their synchronization needs, these applications can suffer severe performance penalties if their processes are not all coscheduled to run together. Two common approaches to coscheduling jobs are batch scheduling, wherein nodes are dedicated for the duration of the run, and gang scheduling, wherein time slicing is coordinated across processors. Both work well when jobs are load-balanced and make use of the entire parallel machine. However, these conditions are rarely met and most realistic workloads consequently suffer from both internal and external fragmentation, in which resources and processors are left idle because jobs cannot be packed with perfect efficiency. This situation leads to reduced utilization and suboptimal performance. Flexible coscheduling (FCS) addresses this problem by monitoring each job's computation granularity and communication pattern and scheduling jobs based on their synchronization and load-balancing requirements. In particular, jobs that do not require stringent synchronization are identified, and are not coscheduled; instead, these processes are used to reduce fragmentation. FCS has been fully implemented on top of the STORM resource manager on a 256-processor alpha cluster and compared to batch, gang, and implicit coscheduling algorithms. This paper describes in detail the implementation of FCS and its performance evaluation with a variety of workloads, including large-scale benchmarks, scientific applications, and dynamic workloads. The experimental results show that FCS saturates at higher loads than other algorithms (up to 54 percent higher in some cases), and displays lower response times and slowdown than the other algorithms in nearly all scenarios. 相似文献

11.

Scheduling parallel jobs on multicore clusters using CPU oversubscription

Gladys Utrera Julita Corbalan Jesús Labarta 《The Journal of supercomputing》2014,68(3):1113-1140

相似文献

12.

Predicting communication protocol performance on superscalar architectures using instruction dependency

《Performance Evaluation》2006,63(9-10):939-955

Increasing diversity in telecommunication workloads leads to greater complexity in communication protocols. This occurs as channel bandwidth rapidly increases. These factors result in larger computational loads for network processors that are increasingly turning to high performance microprocessor designs. This paper presents an analytical method for estimating the performance of instruction level parallel (ILP) processors executing network protocol processing applications. Instruction dependency information extracted while executing an application is used to calculate upper and lower bounds for throughput, measured in instructions per cycle (IPC). Results using UDP/TCP/IP applications show that the simulated IPC values fall between the analytically derived upper and lower bounds, validating the model. The analytical method is much less expensive than cycle-accurate simulation, but reveals similar throughput performance predictions. This allows the architectural design space for network superscalar processors to be explored more rapidly and comprehensively, to reveal the maximum IPC that is possible for a given application workload and the available hardware resources. 相似文献

13.

Adaptive hierarchical scheduling policy for enterprise grid computing systems

J.H. Abawajy 《Journal of Network and Computer Applications》2009,32(3):770-779

In an enterprise grid computing environments, users have access to multiple resources that may be distributed geographically. Thus, resource allocation and scheduling is a fundamental issue in achieving high performance on enterprise grid computing. Most of current job scheduling systems for enterprise grid computing provide batch queuing support and focused solely on the allocation of processors to jobs. However, since I/O is also a critical resource for many jobs, the allocation of processor and I/O resources must be coordinated to allow the system to operate most effectively. To this end, we present a hierarchical scheduling policy paying special attention to I/O and service-demands of parallel jobs in homogeneous and heterogeneous systems with background workload. The performance of the proposed scheduling policy is studied under various system and workload parameters through simulation. We also compare performance of the proposed policy with a static space–time sharing policy. The results show that the proposed policy performs substantially better than the static space–time sharing policy. 相似文献

14.

异构计算中的负载共享 总被引：18，自引：0，他引：18

曾国荪陆鑫达《软件学报》2000,11(4):551-556

在基于消息传递的异构并行计算系统中 ,各处理器或计算机具有自制和独立地调度、执行作业的能力 .当一个可划分的作业初始位于一个处理器上时 ,为了提高计算性能 ,该处理器可以请求其他异构处理器负载共享 ,参与协同计算 ,减少作业的完成时间 .该文提出了异构计算负载共享的一种方案 .首先 ,调用负载共享协议 ,收集当前各处理器参与负载共享的许可数据 ,包括共享时间段、计算能力等 .然后 ,构造一个作业量与作业完成时间之间的关系函数 .该函数是选择一组合适的处理器群、优化作业划分、作业完成时间最小的理论基础 .最相似文献

15.

Mesh网均等分区策略 总被引：1，自引：0，他引：1

宋永生余筱琴《计算机研究与发展》1998,35(6):500-505

在大规模并行计算机系统中，处理机资源可能被多个用户作业竞争，操作系统必须采用一种处理机分配策略确定多少和哪些处理机分配给一个作业。文中针对大规模、消息通信并行计算机提出了矩形和非矩形两种处理机分配策略，这两种策略均满足对每个用户所分配处理机数的公平性以及处理机分配的邻近性。相似文献

16.

Parallel Sysplex: A Scalable, Highly Available, High Performance Commercial System

Jeffrey M. Nick Gary M. King Jen-Yao Chung Nicholas S. Bowen Ching-Shan Peng 《Journal of Parallel and Distributed Computing》1997,43(2):179

The IBM System/390 Parallel Sysplex allows a parallel server to grow from a single system to a configuration of 32 systems, and appear as a single image to the end user and applications. It can provide capacity growth for today's largest commercial workloads by enabling a workload to be spread transparently across a collection of systems with shared access to data. By the way of its parallel architecture and operating system support, the Parallel Sysplex offers near-linear scalability and continuous availability for customers' mission-critical applications. Parallel Sysplex optimizes responsiveness and reliability by distributing workloads across all of the processors in the Sysplex. If one or more processors fail, the workload is redistributed across the remaining processors. Because all of the processors have access to all of the data, it can provide a computing environment with near-continuous availability. Empirical performance data are used to demonstrate that Parallel Sysplex provides better scalability than a tightly-coupled multiprocessing system. 相似文献

17.

一种自适应负载的I/O调度算法

下载免费PDF全文

徐炜遐李琼蒋艳凰《计算机工程与科学》2009,31(11):1-3

I/O调度算法对磁盘阵列(RAID)性能具有至关重要的影响。虽然已有很多典型的I/O调度算法在一定负载情况下可获得较好的性能,但很难有哪一种算法在各种负载情况下均能获得很好的性能。本文提出了一种智能RAID控制模型,结合C4.5决策树和AdaBoost算法实现负载自动分类,根据负载变化和性能反馈情况动态调整I/O调度策略,实现面向应用需求的自治调度。模拟实验结果表明,自适应调度算法具有较好的适应性,在各种负载情况下优于现有的I/O调度算法,尤其适用于多线程混合负载环境的I/O性能优化。相似文献

18.

Preemptive cloud resource allocation modeling of processing jobs

Shahin Vakilinia Mohamed Cheriet 《The Journal of supercomputing》2018,74(5):2116-2150

Cloud computing allows execution and deployment of different types of applications such as interactive databases or web-based services which require distinctive types of resources. These applications lease cloud resources for a considerably long period and usually occupy various resources to maintain a high quality of service (QoS) factor. On the other hand, general big data batch processing workloads are less QoS-sensitive and require massively parallel cloud resources for short period. Despite the elasticity feature of cloud computing, fine-scale characteristics of cloud-based applications may cause temporal low resource utilization in the cloud computing systems, while process-intensive highly utilized workload suffers from performance issues. Therefore, ability of utilization efficient scheduling of heterogeneous workload is one challenging issue for cloud owners. In this paper, addressing the heterogeneity issue impact on low utilization of cloud computing system, conjunct resource allocation scheme of cloud applications and processing jobs is presented to enhance the cloud utilization. The main idea behind this paper is to apply processing jobs and cloud applications jointly in a preemptive way. However, utilization efficient resource allocation requires exact modeling of workloads. So, first, a novel methodology to model the processing jobs and other cloud applications is proposed. Such jobs are modeled as a collection of parallel and sequential tasks in a Markovian process. This enables us to analyze and calculate the efficient resources required to serve the tasks. The next step makes use of the proposed model to develop a preemptive scheduling algorithm for the processing jobs in order to improve resource utilization and its associated costs in the cloud computing system. Accordingly, a preemption-based resource allocation architecture is proposed to effectively and efficiently utilize the idle reserved resources for the processing jobs in the cloud paradigms. Then, performance metrics such as service time for the processing jobs are investigated. The accuracy of the proposed analytical model and scheduling analysis is verified through simulations and experimental results. The simulation and experimental results also shed light on the achievable QoS level for the preemptively allocated processing jobs. 相似文献

19.

Task cluster scheduling in a grid system

Kyriaki Gkoutioudi Helen D. Karatza 《Simulation Modelling Practice and Theory》2010,18(9):1242-1252

Effective load distribution and resource management is of great importance in designing complex distributed systems as grid. This pre-assumes the capability of partitioning the arriving jobs into independent tasks that can be executed simultaneously, assigning the tasks to processors and scheduling the task execution on each processor. A simulation model, consisting of two homogeneous clusters, is considered to evaluate the performance for various workloads. The Deferred policy is applied to collect global system information about processor queues. This paper proposes a special scheduling method referred to as task clustering method. We examine the efficiency of two task routing policies – one static and one adaptive – and six task scheduling policies, which rearrange processor queues regarding to a criterion. Our simulation results indicate that the adaptive task routing policy in conjunction with SGFS-ST scheduling algorithm, which uses more efficiently the task clustering method, leads to a significant performance improvement. 相似文献

20.

Analysis of fork-join program response times on multiprocessors

Towsley D. Rommel C.G. Stankovic J.A. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(3):286-303

Models for two processor sharing policies called task scheduling processor sharing and job scheduling processor sharing are developed and analyzed. The first policy schedules each task independently and allows parallel execution of an individual program, whereas the second policy schedules each job as a unit, thereby not allowing parallel execution of an individual program. It is found that task scheduling performs better than job scheduling for most system parameter values. The performance of the task scheduling processor sharing is compared to a first come first serve policy. First come first serve performs better than processor sharing over a wide range of system parameters. Processor sharing performs best when the task service time variability is high. The performance of processor sharing and first come first serve is studied with two classes of jobs, and for when a specific number of processors is statically assigned to each of the classes 相似文献