首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 218 毫秒
1.
Hadoop MapReduce并行计算框架被广泛应用于大规模数据并行处理.近年来,由于其能较好地处理大规模数据,Hadoop MapReduce也被越来越多地使用在查询应用中.为了能够处理大规模数据集,Hadoop的基本设计更多地强调了数据的高吞吐率.然而在处理对短作业响应性能有较高要求的查询应用时,Hadoop MapReduce并行计算框架存在明显不足.为了提升Hadoop对于短作业的执行效率,对原有的Hadoop MapReduce作出以下3点优化:1)通过优化原有的setup和cleanup任务的执行方式,成功地缩短了作业初始化环境准备和作业结束环境清理的时间;2)将首次任务分配从"拉"模式转变为"推"模式;3)将作业执行过程中JobTracker和TaskTrackers之间的控制消息通信从现有的周期性心跳机制中分离出来,采用即时传递机制.最后,采用一种典型的基于MapReduce并行化的查询应用BLAST,对优化工作进行了评估.各种不同类型BLAST作业的测试实验表明,与现有的标准Hadoop相比,优化后的Hadoop平均执行性能提升约23%.  相似文献   

2.
当今云计算环境下,Hadoop已经成为大数据处理的事实标准。然而云计算具有大规模、高复杂和动态性的特点,容易导致故障的发生,影响Hadoop上运行的作业。虽然Hadoop具有内置的故障检测和恢复机制,但云环境中不同节点负载大小的变化,被调度的作业仍然导致失败。针对此问题提出自响应故障感知的检测调度方法,对异构环境负载能力的不同,而做出服务器快节点和慢节点的判断,把作业分配调度到合适的节点上执行,调整任务决策来尽可能的防止任务失败的发生。最后在Hadoop框架下与基本调度器进行实验性能比较,结果显示该方法减少作业失败率最高达19%,并缩短了作业执行时间,同时也减少CPU和内存的使用。  相似文献   

3.
Hadoop平台作为一个开源的在集群上运行大型数据库处理的框架受到了各个公司的青睐,然而要在Hadoop集群上运行一个作业必须手动设置将近200多个复杂的参数,如何设置这些参数对普通用户来说是非常困难的,该文针对这个问题提出了一种基于策略选择的抽样算法,通过在Hadoop中加入策略感知层,实验结果表明改进的Hadoop框架可以自动优化设置这些复杂的参数,从而提高整个系统的运行效率。  相似文献   

4.
执行时间是作业调度的重要参考因素之一。通过分析Hadoop MapReduce环境作业的执行特征,提出了以map任务和reduce任务执行时间为输入,估算作业执行时间的方法。该方法在一定假设条件下,借助作业预执行来获取map任务和reduce任务的执行时间。实验结果表明,该方法估算作业执行时间的误差率小于7%。  相似文献   

5.
为提高Hadoop作业调度的效率,增加云平台的吞吐率,提出了一种基于Hadoop云计算平台的作业调度算法。该算法在加权轮转调度算法的基础上,针对MapReduce的运行特点,增加了改进map任务本地性调度的因素,使得作业调度仍然保持了相对的公平性,并通过提高轮转周期内的map任务数据本地性,减少了任务的执行时间。实验结果证明,该调度算法与加权轮转调度算法相比,较好地提高了任务本地执行的比例,缩短了云计算系统内作业的总执行时间。  相似文献   

6.
针对Hadoop异构集群中计算和数据资源的不一致分布所导致的调度性能较低的缺点,设计了一种基于Hadoop集群和改进Late算法的并行作业调度算法;首先,介绍了基于Hadoop框架和Map-Reduce模型的调度原理,然后,在经典的Late调度算法的基础上,对Map任务和Reduce任务的各阶段执行时间进度比例进行存储和更新,为了进一步地提高调度效率,将慢任务迁移到本地化节点或离数据资源较近的物理节点上,并给了基于改进Late算法的作业调度流程;为了验证文中方法,在Hadoop集群系统上测试,设定1个为Jobtracker主控节点和7个为TaskTracker节点,实验结果表明文中方法能实现异构集群的作业调度,且与其它方法比较,具有较低的预测误差和较高的调度效率。  相似文献   

7.
在大规模的Hadoop集群中,良好的任务调度策略对提高数据本地性、减小网络传输开销、减少作业执行时间以及提高集群的作业吞吐量都有着重要的影响。本文针对Hadoop架构中Reduce任务的数据本地性较低问题,提出了一种基于延迟调度策略的Reduce任务调度优化算法,通过提高Reduce任务的数据本地性来减少作业执行时间以及提高作业吞吐量,该算法在Hadoop架构的Early Shuffle阶段,使用多级延迟调度策略来提高Reduce任务的数据本地性。最后重写原生公平调度器代码实现了该调度算法,并与原生公平调度器进行了对比实验分析,实验结果表明该算法明显减少了作业执行时间,提高了集群的作业吞吐量。  相似文献   

8.
由Apache软件基金会开发的Hadoop分布式系统基础架构,作为一个主流的云计算平台,其核心框架之一的MapReduce性能已经成为一个研究热点,其中对于Shuffle阶段的优化,使用Combine优化机制是关键.文章详细介绍了MapReduce计算框架及Shuffle流程;分别从机理简介、执行时机、运行条件三方面详细阐述了如何利用Combine优化机制;通过搭建Hadoop集群,运用MapReduce分布式算法测试实验数据.实验结果充分证明,正确地运用Combine优化机制能显著提高MapReduce框架的性能.  相似文献   

9.
基于蚁群优化算法的服务网格的作业调度   总被引:9,自引:0,他引:9  
提出了利用蚁群算法来优化服务网格的作业调度系统的方法和一个两层的作业调度模型,该模型可以在网格的动态和异构环境下实现对作业执行时间的预测,然后根据作业的预测执行时间并利用蚁群优化算法使适应函数取得最小值,从而得到最优化的作业调度。基于开发的校园网格实验床,通过实验显示该方法可以优化服务网格的性能,减少作业的平均执行时问,提高系统的吞吐率。  相似文献   

10.
针对Hadoop和Spark等大数据分析系统中无先验知识任务的高效执行问题,设计了基于累计工作量(CRW)的任务调度器CRWScheduler。该调度器根据CRW将任务在低权重队列与高权重队列间切换;在为作业分配资源时,同时考虑到作业所在的队列和其瞬时占用资源量,无需作业先验知识即显著提升系统性能。基于Apache Hadoop YARN实现了CRWScheduler原型,在28个节点的基准测试集群上的实验表明,与YARN的公平调度机制相比,作业流时间(JFT)平均降低21%,其中95百分位的作业流时间(JFT)最多降低了35%,并且在与任务级调度程序协作时可获得进一步的性能提升。  相似文献   

11.
As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.  相似文献   

12.

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

  相似文献   

13.
基于MapReduce的程序被越来越多地应用于大型数据分析的应用中.Apache Hadoop是最常用的开源MapReduce模型之一.程序运行时间的缩短对于MapReduce程序以及所有数据处理应用而言至关重要,而能够准确估算MapReduce程序的执行时间是优化程序的重要环节.本文定义了一个在Hadoop2.x版本...  相似文献   

14.
MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.  相似文献   

15.
Spark Streaming作为主流的开源分布式流分析框架,性能优化是目前的研究热点之一。在Spark Streaming性能优化中,业务场景下的配置参数优化是其性能提升的重要因素。在Spark Streaming系统中,可配置的参数有200多个,对参数调优人员的经验要求较高,未经优化的参数配置会影响流作业执行性能。因此,针对Spark Streaming的参数配置优化问题,提出一种基于深度强化学习的Spark Streaming参数优化方法(DQN-SSPO),将Spark Streaming参数优化配置问题转化为深度强化学习模型训练中的最大回报获得问题,并提出权重状态空间转移方法来增加模型训练获得高反馈奖励的概率。在3种典型的流分析任务上进行实验,结果表明经参数优化后Spark Streaming上的流作业性能在总调度时间上平均缩减27.93%,在总处理时间上平均缩减42%。  相似文献   

16.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

17.
对Hadoop平台的作业调度算法进行了研究, 提出了支持作业类型区分的多队列调度优化算法。优化算法支持根据节点当前的负载情况分配不同类型的作业, 以提高节点的资源利用率; 允许作业队列的资源在闲置时被其他作业队列占用; 在原作业队列需要时可以被即时回收, 即回收过程支持任务抢占; 采用共享队列列表和非共享队列列表的逻辑划分来防止乒乓效应。Hadoop平台的性能测试结果表明, 优化算法相比系统默认算法在作业调度的执行效率、执行平稳性等方面都有了显著的提升。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号