首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes for future map tasks based on current pending tasks and then preload the needed data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.  相似文献   

2.
随着大规模的MapReduce集群广泛地用于大数据处理,特别是当有多个任务需要使用同一个Hadoop集群时,一个关键问题是如何最大限度地减少集群的工作时间,提高MapReduce作业的服务效率。可将多个MapReduce作业当做一个调度任务建模,观察发现多个任务的总完工时间和任务的执行顺序有密切关系。 研究目标是设计作业调度系统分析模型,最小化一批MapReduce作业的总完工时间。提出一个更好的调度策略和实现方法, 使整个调度系统符合经典Johnson算法的条件, 从而可使用经典Johnson算法在线性时间内获取总完工时间的最优解。同时,针对需要使用两个或多个资源池进行平衡的问题, 提出了一种线性时间解决方案, 优于已知的近似模拟方案。该理论模型可应用于提高系统响应速度、节能和负载均衡等方面, 对应的应用实例提供了证实。  相似文献   

3.
As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.  相似文献   

4.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

5.
YARN is a resource management system widely used in Hadoop. It supports MapReduce, Spark, Storm and other computing frameworks, and has become the core component of big data ecology. However, in Hadoop YARN’s existing resource scheduler, a resource guarantee mechanism based on resource reservation, will produce resource fragmentations, leading to a waste of resources. In order to improve the resource utilization and throughput of the cluster, this paper proposes a resource allocation mechanism based on reservation and backfill. In this mechanism, based on the priority of the job, it decides whether to make a reservation to the resource and introduce a backfill strategy to backfill the resource without affecting the execution of the reservation job. Experiments show that the resource scheduling mechanism based on reserved backfill can effectively improve the resource utilization and throughput of Hadoop YARN cluster.  相似文献   

6.
在异构Hadoop集群场景中, 为了缓和由于纠删码和副本存储模式混合使用, 以及服务器节点本身实时算力差异造成的MapReduce作业处理效率低下的问题, 本文实现了一种根据数据存储情况和节点实时负载来在多并发场景下动态调节MapReduce作业任务分配情况的调度策略. 该策略通过修改当前Hadoop框架中的数据存储选址策略并对节点任务并发量进行动态控制, 在多作业并发时实现更加均衡的作业间资源分配. 实验结果表明, 相较于Hadoop默认的两种作业调度策略, 本文提出的调度模式能够将作业完成时间缩短约17%, 并有效避免部分作业面临的饥饿现象.  相似文献   

7.
MapReduce Job的调度机制一直是学术研究的热点。在分析MapReduce数据流调度模型的基础上,提出一种面向MapReduce数据流的公平调度方法FlowS。该方法采用数据流池来分配资源以保证MapReduce数据流的隔离性,并且采用数据流池动态构建算法来确保资源的公平分配。实验表明,该调度方法可以有效提高Hadoop集群对MapReduce数据流的处理效率。  相似文献   

8.
The core business of many companies depends on the timely analysis of large quantities of new data. MapReduce clusters that routinely process petabytes of data represent a new entity in the evolving landscape of clouds and data centers. During the lifetime of a data center, old hardware needs to be eventually replaced by new hardware. The hardware selection process needs to be driven by performance objectives of the existing production workloads. In this work, we present a general framework, called Ariel, that automates system administrators’ efforts for evaluating different hardware choices and predicting completion times of MapReduce applications for their migration to a Hadoop cluster based on the new hardware. The proposed framework consists of two key components: (i) a set of microbenchmarks to profile the MapReduce processing pipeline on a given platform, and (ii) a regression-based model that establishes a performance relationship between the source and target platforms. Benchmarking and model derivation can be done using a small test cluster based on new hardware. However, the designed model can be used for predicting the jobs’ completion time on a large Hadoop cluster and be applied for its sizing to achieve desirable service level objectives (SLOs). We validate the effectiveness of the proposed approach using a set of twelve realistic MapReduce applications and three different hardware platforms. The evaluation study justifies our design choices and shows that the derived model accurately predicts performance of the test applications. The predicted completion times of eleven applications (out of twelve) are within 10% of the measured completion times on the target platforms.  相似文献   

9.
MapReduce programming paradigm has been widely applied to solve large‐scale data‐intensive problems. Intensive studies of MapReduce scheduling have been carried out to improve MapReduce system performance. Delay scheduling is a common way to achieve high data locality and system performance. However, inappropriate delays can lead to low system throughput and potentially break the original job priority constraints. This paper proposes a deadline‐enabled delay (DLD) scheduling algorithm that optimizes job delay decisions according to real‐time resource availability and resource competition, while still meets job deadline constraints. Experimental results illustrate that the resource availability estimation method of DLD is accurate (92%). Compared with other approaches, DLD reduces job turnaround time by 22% in average while keeping a high locality rate (88%).Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
MapReduce是云计算中重要的批数据处理框架,多任务共享MapReduce机群并满足任务实时性要求是调度算法急需解决的问题。提出两阶段实时调度算法,将调度划分为任务间调度和任务内调度。对于任务间调度,使用抽样法和经验值法确定子任务执行时间,利用该参数建立资源分配模型,动态确定任务优先级进行调度;对于子任务使用延迟调度策略进行调度,保证计算的本地性。实验结果显示,两阶段实时调度算法相比公平调度算法和FIFO算法,在保证吞吐量的同时能够满足任务实时性要求。  相似文献   

11.
同构Hadoop集群环境下改进的延迟调度算法   总被引:1,自引:1,他引:0  
在Hadoop框架下计算资源和数据资源可以在不同物理位置的特点产生本地化问题。延迟调度算法的产生旨在解决本地化问题, 此算法根据任务待处理数据的物理位置作为作业的计算节点, 调度任务至目标节点。但是可能出现同一作业中若干任务集中运行在某一计算节点, 导致作业达不到理想的并行效果。针对原有的延迟调度算法, 提出延迟一容量调度算法, 允许部分任务选择非本地化节点作为原延迟调度算法中任务的目标计算节点, 以提高作业的响应时间与增加作业的并行程度。最后通过实验对比分析, 改进后的算法在执行效率和并行效果明显优于原延迟调度算法。  相似文献   

12.
Apache Hadoop becomes ubiquitous for cloud computing which provides resources as services for multi-tenant applications. YARN (a.k.a. MapReduce 2.0) is one of the key features in the second-generation Hadoop, which provides resource management and scheduling for large-scale MapReduce environments. Two enormous challenges in the YARN scheduler are the abilities to automatically tailor and control resource allocations to different jobs for achieving their Service Level Agreements (SLAs), and minimize energy consumption of the overall cloud computing system. In this work, we propose an SLA-aware energy-efficient scheduling scheme which allocates appropriate amount of resources to MapReduce applications with YARN architecture. In our task scheduling policy, We consider the data locality information to save the MapReduce network traffic. Furthermore, the slack time between the actual execution time of completed tasks and expected completion time of the application is utilized to improve the energy-efficiency of the system. An online userspace governor-based dynamic voltage and frequency scaling (DVFS) scheme is designed in the YARN per-application ApplicationMaster to dynamically change the CPU frequency for upcoming tasks given the slack time from previous completed tasks. Experimental evaluation shows that our proposed scheme outperforms the existing MapReduce scheduling policies in terms of both resource ultization and energy-efficiency.  相似文献   

13.
MapReduce:新型的分布式并行计算编程模型   总被引:3,自引:0,他引:3  
MapReduce是Google提出的分布式并行计算编程模型,用于大规模数据的并行处理。Ma-pReduce模型受函数式编程语言的启发,将大规模数据处理作业拆分成若干个可独立运行的Map任务,分配到不同的机器上去执行,生成某种格式的中间文件,再由若干个Reduce任务合并这些中间文件获得最后的输出文件。用户在使用MapReduce模型进行大规模数据处理时,可以将主要精力放在如何编写Map和Reduce函数上,其它并行计算中的复杂问题诸如分布式文件系统、工作调度、容错、机器间通信等都交给MapReduce系统处理,在很大程度上降低了整个编程难度。MapReduce日益成为云计算平台的主流编程模型。Apache Hadoop项目提供开源的MapReduce系统还有待进一步完善。  相似文献   

14.
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter‐cluster communication models present different performance trade‐offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. In this paper, we describe our experience with developing a pragmatic scheme and also a generic graph‐matching‐based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time. The proposed graph‐matching‐based framework for cluster scheduling resolves the phase‐ordering and fixed‐ordering problem associated with earlier schemes for scheduling clustered VLIW architectures. The experimental evaluation in the context of a state‐of‐art commercial clustered architecture (using real‐world benchmark programs) reveals a significant performance improvement over the earlier proposals, which were mostly evaluated using compiled simulation of hypothetical clustered architectures. Our results clearly highlight the importance of considering the peculiarities of commercial clustered architectures and the hard‐nosed performance measurement. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

15.
随着基于Hadoop平台的大数据技术的不断发展和实践的深入,Hadoop YARN资源调度策略在异构集群中的不适用性越发明显。一方面,节点资源无法动态分配,导致优势节点的计算资源浪费、系统性能没有充分发挥;另一方面,现有的静态资源分配策略未考虑作业在不同执行阶段的差异,易产生大量资源碎片。基于以上问题,提出了一种负载自适应调度策略。监控集群执行节点和提交作业的性能信息,利用实时监控数据建模、量化节点的综合计算能力,结合节点和作业的性能信息在调度器上启动基于相似度评估的动态资源调度方案。优化后的系统能够有效识别集群节点的执行能力差异,并根据作业任务的实时需求进行细粒度的动态资源调度,在完善YARN现有调度语义的同时,可作为子级资源调度方案架构在上层调度器下。在Hadoop 2.0上实现并测试该策略,实验结果表明,作业的自适应资源调度策略显著提高了资源利用率,集群并发度提高了2到3倍,时间性能提升了近10%。  相似文献   

16.
YARN是Hadoop的一个分布式的资源管理系统,用来提高分布式集群的内存、I/O、网络、磁盘等资源的利用率.然而,YARN的配置参数众多,要对其人工调优并获得最佳的性能费时费力.本文在现有的YARN资源调度器的基础上,结合了一种闭环反馈控制方法,可在集群运行状态下动态地对MapReduce (MR)作业数进行优化,省去了人工调整参数的过程.实验表明,在YARN的容量调度器和公平调度器的基础上使用该方法,相比于默认配置,MR作业完成时间分别减少53%和14%左右.  相似文献   

17.
在大规模的Hadoop集群中,良好的任务调度策略对提高数据本地性、减小网络传输开销、减少作业执行时间以及提高集群的作业吞吐量都有着重要的影响。本文针对Hadoop架构中Reduce任务的数据本地性较低问题,提出了一种基于延迟调度策略的Reduce任务调度优化算法,通过提高Reduce任务的数据本地性来减少作业执行时间以及提高作业吞吐量,该算法在Hadoop架构的Early Shuffle阶段,使用多级延迟调度策略来提高Reduce任务的数据本地性。最后重写原生公平调度器代码实现了该调度算法,并与原生公平调度器进行了对比实验分析,实验结果表明该算法明显减少了作业执行时间,提高了集群的作业吞吐量。  相似文献   

18.
为了能有效处理海量数据,进行关联分析、商业预测等,Hadoop分布式云计算平台应运而生。但随着Hadoop的广泛应用,其作业调度方面的不足也显现出来,现有的多种作业调度器存在参数设置复杂、启动时间长等缺陷。借助于人工蜂群算法的自组织性强、收敛速度快的优势,设计并实现了能实时检测Hadoop内部资源使用情况的资源感知调度器。相比于原有的作业调度器,该调度器具有参数设置少、启动速度快等优势。基准测试结果表明,该调度器在异构集群上,调度资源密集型作业比原有调度器快10%~20%左右。  相似文献   

19.
This paper addresses the problem of minimizing the scheduling length (make-span) of a batch of jobs with different arrival times. A job is described by a direct acyclic graph (DAG) of parallel tasks. The paper proposes a dynamic scheduling method that adapts the schedule when new jobs are submitted and that may change the processors assigned to a job during its execution. The scheduling method is divided into a scheduling strategy and a scheduling algorithm. We also propose an adaptation of the Heterogeneous Earliest-Finish-Time (HEFT) algorithm, called here P-HEFT, to handle parallel tasks in heterogeneous clusters with good efficiency without compromising the makespan. The results of a comparison of this algorithm with another DAG scheduler using a simulation of several machine configurations and job types shows that P-HEFT gives a shorter makespan for a single DAG but scores worse for multiple DAGs. Finally, the results of the dynamic scheduling of a batch of jobs using the proposed scheduler method showed significant improvements for more heavily loaded machines when compared to the alternative resource reservation approach.  相似文献   

20.
MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. In this paper we make some improvements to the Hadoop MapReduce framework by introducing algorithms that are suitable for heterogeneous environments. The goal is to efficiently perform data-intensive computing in heterogeneous environments. The need for these adaptations derives from the fact that, following the framework design proposed by Google, Hadoop is optimized to run in large homogeneous clusters. Hence we propose MRA++, a new MapReduce framework design that considers the heterogeneity of nodes during data distribution, task scheduling and job control. MRA++establishes a training task to gather information prior to the data distribution. However, we show that the delay introduced in the setup phase is offset by the effectiveness of the mechanisms and algorithms, that achieve performance gains of more than 70% in 10 Mbps networks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号