首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
随着大规模的MapReduce集群广泛地用于大数据处理,特别是当有多个任务需要使用同一个Hadoop集群时,一个关键问题是如何最大限度地减少集群的工作时间,提高MapReduce作业的服务效率。可将多个MapReduce作业当做一个调度任务建模,观察发现多个任务的总完工时间和任务的执行顺序有密切关系。 研究目标是设计作业调度系统分析模型,最小化一批MapReduce作业的总完工时间。提出一个更好的调度策略和实现方法, 使整个调度系统符合经典Johnson算法的条件, 从而可使用经典Johnson算法在线性时间内获取总完工时间的最优解。同时,针对需要使用两个或多个资源池进行平衡的问题, 提出了一种线性时间解决方案, 优于已知的近似模拟方案。该理论模型可应用于提高系统响应速度、节能和负载均衡等方面, 对应的应用实例提供了证实。  相似文献   

2.
As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.  相似文献   

3.
基于MapReduce的程序被越来越多地应用于大型数据分析的应用中.Apache Hadoop是最常用的开源MapReduce模型之一.程序运行时间的缩短对于MapReduce程序以及所有数据处理应用而言至关重要,而能够准确估算MapReduce程序的执行时间是优化程序的重要环节.本文定义了一个在Hadoop2.x版本...  相似文献   

4.
执行时间是作业调度的重要参考因素之一。通过分析Hadoop MapReduce环境作业的执行特征,提出了以map任务和reduce任务执行时间为输入,估算作业执行时间的方法。该方法在一定假设条件下,借助作业预执行来获取map任务和reduce任务的执行时间。实验结果表明,该方法估算作业执行时间的误差率小于7%。  相似文献   

5.
MapReduce is a popular parallel data-processing system, and task scheduling is one of the kernel techniques in MapReduce. In many applications, users have requirements that their MapReduce jobs should be completed before specific deadlines. Hence, in this paper, a novel scheduling algorithm based on the most effective sequence (SAMES) is proposed for deadline-constraint jobs in MapReduce. First, according to the characteristics of MapReduce, we propose a novel sequence-based execution strategy for MapReduce jobs and a new concept, the effective sequence (ES). Then, we design some efficient approaches for finding ESes and choose the most effective sequence (MES) for job execution. We also propose methods for MES-updates and exception handling. Finally, we verify the effectiveness of SAMES through experiments. The experimental results show that SAMES is an efficient scheduling algorithm for deadline-constraint jobs in MapReduce.  相似文献   

6.
廖彬  张陶  于炯  孙华 《计算机科学》2015,42(11):178-183
在数据量规模剧增的背景下,大数据处理过程中产生的高能耗问题亟待解决,而能耗模型是研究提高能耗效率方法的基础。利用传统的能耗模型计算MapReduce作业执行能耗面临诸多挑战,在对大数据计算模型MapReduce的集群结构、作业的任务分解及任务与资源映射模型分析建模的基础上,提出基于作业历史运行信息的MapReduce能耗预测模型。通过对不同作业历史运行信息的分析,得到DataNode运行不同任务时的计算能力及能耗特性,继而实现在MapReduce作业执行前对作业能耗的预测。实验结果验证了能耗预测模型的可行性,并通过对能耗预测准确率调节因子的修正,能够达到提高能耗模型的预测准确度的目的。  相似文献   

7.
MapReduce编程模型被广泛应用于大数据处理平台,而一个有效的任务调度算法对模型的运行效率至关重要。将MapReduce工作流的Map和Reduce阶段分别拆解为若干个有先后序限定关系的作业,每个作业再拆解为多个任务。之后基于计算集群的可用资源和任务异构性,构建面向作业和任务的2级有向无环图(DAG)模型,同时提出基于2级优先级排序的异构调度算法2-MRHS。算法的第1阶段进行优先级排序,即对作业和任务分别进行优先权值计算,再汇总得到任务的调度队列;第2阶段进行任务分配,即基于最快完成时间将每个任务所包含的数据块子任务分配给最适合的计算结点。采用大批量随机生成的DAG模型进行实验,结果表明与其他相关算法相比,本文算法有更短的调度长度(makespan)且更加稳定。  相似文献   

8.
MapReduce作业在洗牌阶段花费大量时间,因此有效的洗牌数据传输调度可以提高MapReduce的性能。数据中心网络中,常有一些周期性的数据流传输。在考虑已知这些周期性数据流传输的情况下,为MapReduce的洗牌数据传输调度问题建立了优化模型,并设计了一个有效的数据传输调度算法。在网络空闲时间段大小相同的情况下,证明了所提算法是近似比为3/2的近似算法。仿真实验结果表明,该算法能够有效地利用网络资源,减少洗牌数据流的调度长度。  相似文献   

9.
李振举  李学军  杨晟  刘涛 《计算机应用》2015,35(12):3374-3377
针对已有的MapReduce模型阶段划分粒度不合理导致模型精度和复杂度存在的问题,提出了阶段划分粒度为5的多阶段MapReduce模型(MR-Model)。首先综述了MapReduce模型的研究现状;然后将MapReduce划分为Read、Map、Shuffle、Reduce、Write共5个阶段,并对每个阶段的具体运行时间进行研究;最后通过实验对模型的预测性能进行验证。实验结果表明,提出的MR-Model可用来描述MapReduce实际任务的执行过程,与另外两种不同划分粒度的模型P-Model和H-Model相比,MR-Model模型的运行时间预测精度可以提高10%~30%,在Reduce阶段的运行时间预测精度可以提高2~3倍,综合性能较好。  相似文献   

10.
提出一种面向异构云计算环境的截止时间约束的MapReduce作业调度方法。使用加权偶图建模MapReduce作业调度问题,将Map任务及Reduce任务与资源槽分为2个节点集合,连接2个节点集合的边的权重为任务在资源槽上的执行时间。进而,使用整数线性规划求解最小加权偶图匹配,从而得到任务到资源槽的调度方案。本文考虑了云计算环境下异构节点任务处理时间的差异性,在线动态评估和调整任务的截止时间,从而提升了MapReduce作业处理的性能。实验结果表明,所提出的方法缩短了作业数据访问的时间,最小化了截止时间冲突的作业数量。  相似文献   

11.
A batch processing machine can simultaneously process several jobs forming a batch. This paper considers the problem of scheduling jobs with non-identical capacity requirements, on a single-batch processing machine of a given capacity, to minimize the makespan. The processing time of a batch is equal to the largest processing time of any job in the batch. We present some dominance properties for a general enumeration scheme and for the makespan criterion, and provide a branch and bound method. For large-scale problems, we use this enumeration scheme as a heuristic method.Scope and purposeUsually in classical scheduling problems, a machine can perform only one job at a time. Although, one can find machines that can process several jobs simultaneously as a batch. All jobs of a same batch have common starting and ending times. Batch processing machines are encountered in many different environments, such as burn-in operations in semiconductor industries or heat treatment operations in metalworking industries. In the first case, the capacity of the machine is defined by the number of jobs it can hold. In the second case, each job has a certain capacity requirement and the total size of a batch cannot exceed the capacity of the machine. Hence, the number of jobs contained in each batch may be different. In this paper, we consider this second case (which is more difficult) and we provide an exact method for the makespan criterion (minimizing the last ending time).  相似文献   

12.
针对采用MapReduce模型的大数据分析作业的调度问题进行深入研究,并分析现有任务调度算法的缺陷,现有算法没有考虑资源分配对于作业截止时间的影响,也未考虑不同类型作业截止时间的敏感性问题。因作业的完成时间随着分配资源的不同而改变,故称之为弹性作业,截止时间敏感性是指不同类型作业对截止时间要求的严格程度不同。针对以上问题,提出一种截止时间感知的弹性作业调度算法(DA)。该算法将作业依据截止时间敏感程度进行分类,在基于作业整体执行时间预测的基础上,通过调控不同的资源分配策略来改变作业完成时间,同时结合用户对于截止时间的需求及作业预执行的收益来提前规划作业的资源分配及调度次序使得整体收益最大化。将算法在仿真拥有210个物理节点的集群中进行实验,实验表明该算法满足了截止时间的限制并使得作业整体收益值平均提高了2.37倍。  相似文献   

13.
This paper proposes a practical job grouping approach, which aims to enhance the time related performance metrics of container transfers in the Patrick AutoStrad container terminal, located in Brisbane, Australia. It first formulates a mathematical model of the automated container transfers in a relatively complex environment. Apart from the consideration on collision avoidance of a fleet of large vehicles in a confined area, it also deals with many other difficult practical challenges such as the presence of multiple levels of container stacking and sequencing, variable container orientations, and vehicular dynamics that require finite acceleration and deceleration times. The proposed job grouping approach aims to improve the makespan of the schedule for yard jobs, while reducing straddle carrier waiting time by grouping jobs using a guiding function. The performance of the current sequential job allocation method and the proposed job grouping approach are evaluated and compared statistically using a pooled t-test for 30 randomly generated yard configurations. The experimental results show that the job grouping approach can effectively improve the schedule makespan and reduce the total straddle carrier waiting time.  相似文献   

14.
Job scheduling plays a critical role in resource utilisation in a grid computing environment. The heterogeneity of grid resources adds some challenges to the work of job scheduling especially when jobs have dependencies which can be represented as Direct Acyclic Graphs (DAGs). Heuristics have been developed for job scheduling optimisation. This paper presents six heuristic enhancements—MMSTFT for minimising both makespan and task finish time, levelU for upward DAG levelling, TMWD for matching tasks with data, Slack for prioritising task scheduling based on slack time, LSlack for levelling the Slack heuristic, and NLPETS for non-levelling of performance effective task scheduling (PETS). The performance of LSlack is amongst the best heuristics evaluated (with BL and LMT). Additionally, heuristic enhancements MMSTS and TMWD can significantly improve the makespan of generated schedules. To facilitate performance evaluation, a DAG simulator is implemented which provides a set of tools for DAG job configuration, execution and monitoring. The components of the DAG simulator are also presented in this paper.  相似文献   

15.
MapReduce and its open source implementation, Hadoop, have gained widespread adoption for parallel processing of big data jobs. Since the number of such big data jobs is also rapidly rising, reducing their energy consumption is increasingly more important to reduce environmental impact as well as operational costs. Prior work by Mashayekhy et al. (IEEE Trans. Parallel Distributed Syst. 26, 2720–2733, 2016), has tackled the problem of energy-aware scheduling of a single MapReduce job but we provide a far more efficient heuristic in this paper. We first model the problem as an Integer Linear Program to find the optimal solution using ILP solvers. Then we present a task-based greedy scheduling algorithm, TGSAVE, to select a slot for each task to minimize the total energy consumption of the MapReduce job for big data applications in heterogeneous environments without significant performance loss while satisfying the service level agreement (SLA). We perform several experiments on a Hadoop cluster to measure characteristics of tasks for nine different applications to evaluate our proposed algorithm. The results show that the total energy consumption of MapReduce jobs obtained by TGSAVE is up to 35% less than that achieved by EMRSA proposed in Mashayekhy et al. (IEEE Trans. Parallel Distributed Syst. 26, 2720–2733, 2016), its closest rival, for same workloads. Besides, TGSAVE is capable of finding a solution in same order of time for up to 74% tighter deadlines than the tightest deadline that EMRSA can find a feasible one. On average, TGSAVE solution is approximately 1.4% far from the optimal solution, and it can meet deadlines as tight as 12%, on average, above the energy-oblivious minimum makespan in the benchmarks we examined.  相似文献   

16.
The frequent and volatile unavailability of volunteer-based Grid computing resources challenges Grid schedulers to make effective job placements. The manner in which host resources become unavailable will have different effects on different jobs, depending on their runtime and their ability to be checkpointed or replicated. A multi-state availability model can help improve scheduling performance by capturing the various ways a resource may be available or unavailable to the Grid. This paper uses a multi-state model and analyzes a machine availability trace in terms of that model. Several prediction techniques then forecast resource transitions into the model’s states. We analyze the accuracy of our predictors, which outperform existing approaches. We also propose and study several classes of schedulers that utilize the predictions, and a method for combining scheduling factors. We characterize the inherent tradeoff between job makespan and the number of evictions due to failure, and demonstrate how our schedulers can navigate this tradeoff under various scenarios. Lastly, we propose job replication techniques, which our schedulers utilize to replicate those jobs that are most likely to fail. Our replication strategies outperform others, as measured by improved makespan and fewer redundant operations. In particular, we define a new metric for replication efficiency, and demonstrate that our multi-state availability predictor can provide information that allows our schedulers to be more efficient than others that blindly replicate all jobs or some static percentage of jobs.  相似文献   

17.

Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research.

  相似文献   

18.
This paper addresses the problem of minimizing the scheduling length (make-span) of a batch of jobs with different arrival times. A job is described by a direct acyclic graph (DAG) of parallel tasks. The paper proposes a dynamic scheduling method that adapts the schedule when new jobs are submitted and that may change the processors assigned to a job during its execution. The scheduling method is divided into a scheduling strategy and a scheduling algorithm. We also propose an adaptation of the Heterogeneous Earliest-Finish-Time (HEFT) algorithm, called here P-HEFT, to handle parallel tasks in heterogeneous clusters with good efficiency without compromising the makespan. The results of a comparison of this algorithm with another DAG scheduler using a simulation of several machine configurations and job types shows that P-HEFT gives a shorter makespan for a single DAG but scores worse for multiple DAGs. Finally, the results of the dynamic scheduling of a batch of jobs using the proposed scheduler method showed significant improvements for more heavily loaded machines when compared to the alternative resource reservation approach.  相似文献   

19.
We consider the NP-hard problem of scheduling parallel jobs with release dates on identical parallel machines to minimize the makespan. A parallel job requires simultaneously a prespecified, job-dependent number of machines when being processed. We prove that the makespan of any nonpreemptive list-schedule is within a factor of 2 of the optimal preemptive makespan. This gives the best-known approximation algorithms for both the preemptive and the nonpreemptive variant of the problem. We also show that no list-scheduling algorithm can achieve a better performance guarantee than 2 for the nonpreemptive problem, no matter which priority list is chosen. List-scheduling also works in the online setting where jobs arrive over time and the length of a job becomes known only when it completes; it therefore yields a deterministic online algorithm with competitive ratio 2 as well. In addition, we consider a different online model in which jobs arrive one by one and need to be scheduled before the next job becomes known. We show that no list-scheduling algorithm has a constant competitive ratio. Still, we present the first online algorithm for scheduling parallel jobs with a constant competitive ratio in this context. We also prove a new information-theoretic lower bound of 2.25 for the competitive ratio of any deterministic online algorithm for this model. Moreover, we show that 6/5 is a lower bound for the competitive ratio of any deterministic online algorithm of the preemptive version of the model jobs arriving over time.  相似文献   

20.
Scheduling jobs under decreasing linear deterioration   总被引:1,自引:0,他引:1  
This paper considers the scheduling problems under decreasing linear deterioration. Deterioration of a job means that its processing time is a function of its execution start time. Optimal algorithms are presented respectively for single machine scheduling of minimizing the makespan, maximum lateness, maximum cost and number of late jobs. For two-machine flow shop scheduling problem to minimize the makespan, it is proved that the optimal schedule can be obtained by Johnson's rule. If the processing times of operations are equal for each job, flow shop scheduling problems can be transformed into single machine scheduling problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号