首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
在异构Hadoop集群场景中, 为了缓和由于纠删码和副本存储模式混合使用, 以及服务器节点本身实时算力差异造成的MapReduce作业处理效率低下的问题, 本文实现了一种根据数据存储情况和节点实时负载来在多并发场景下动态调节MapReduce作业任务分配情况的调度策略. 该策略通过修改当前Hadoop框架中的数据存储选址策略并对节点任务并发量进行动态控制, 在多作业并发时实现更加均衡的作业间资源分配. 实验结果表明, 相较于Hadoop默认的两种作业调度策略, 本文提出的调度模式能够将作业完成时间缩短约17%, 并有效避免部分作业面临的饥饿现象.  相似文献   

2.

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

  相似文献   

3.
The MapReduce framework has become the de facto standard for big data processing due to its attractive features and abilities. One is that it automatically parallelizes a job into multiple tasks and transparently handles task execution on a large cluster of commodity machines. The increasing heterogeneity of distributed environments may result in a few straggling tasks, which prolong job completion. Speculative execution is proposed to mitigate stragglers. However, the existing speculative execution mechanism could not work efficiently as many speculative tasks are still slower than their original tasks. In this paper, we explore an approach to increase the efficiency of speculative execution, and further improve MapReduce performance. We propose the Partial Speculative Execution (PSE) strategy to make speculative tasks start from the checkpoint. By leveraging the checkpoint of original tasks, PSE can eliminate the costs of re-reading, re-copying, and re-computing the processed data. We implement PSE in Hadoop, and evaluate its performance in terms of job completion time and the efficiency of speculative execution under several kinds of classical workloads. Experimental results show that, in heterogeneous environments with stragglers, PSE completes jobs 56 % faster than that with no speculation and 12 % faster than that with LATE, an improved speculative execution algorithm. In addition, on average PSE can improve the efficiency of speculative execution by 24 % compared to LATE.  相似文献   

4.
执行时间是作业调度的重要参考因素之一。通过分析Hadoop MapReduce环境作业的执行特征,提出了以map任务和reduce任务执行时间为输入,估算作业执行时间的方法。该方法在一定假设条件下,借助作业预执行来获取map任务和reduce任务的执行时间。实验结果表明,该方法估算作业执行时间的误差率小于7%。  相似文献   

5.
廖彬  张陶  于炯  孙华 《计算机科学》2015,42(11):178-183
在数据量规模剧增的背景下,大数据处理过程中产生的高能耗问题亟待解决,而能耗模型是研究提高能耗效率方法的基础。利用传统的能耗模型计算MapReduce作业执行能耗面临诸多挑战,在对大数据计算模型MapReduce的集群结构、作业的任务分解及任务与资源映射模型分析建模的基础上,提出基于作业历史运行信息的MapReduce能耗预测模型。通过对不同作业历史运行信息的分析,得到DataNode运行不同任务时的计算能力及能耗特性,继而实现在MapReduce作业执行前对作业能耗的预测。实验结果验证了能耗预测模型的可行性,并通过对能耗预测准确率调节因子的修正,能够达到提高能耗模型的预测准确度的目的。  相似文献   

6.
MapReduce编程模型被广泛应用于大数据处理平台,而一个有效的任务调度算法对模型的运行效率至关重要。将MapReduce工作流的Map和Reduce阶段分别拆解为若干个有先后序限定关系的作业,每个作业再拆解为多个任务。之后基于计算集群的可用资源和任务异构性,构建面向作业和任务的2级有向无环图(DAG)模型,同时提出基于2级优先级排序的异构调度算法2-MRHS。算法的第1阶段进行优先级排序,即对作业和任务分别进行优先权值计算,再汇总得到任务的调度队列;第2阶段进行任务分配,即基于最快完成时间将每个任务所包含的数据块子任务分配给最适合的计算结点。采用大批量随机生成的DAG模型进行实验,结果表明与其他相关算法相比,本文算法有更短的调度长度(makespan)且更加稳定。  相似文献   

7.
MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. In this paper we make some improvements to the Hadoop MapReduce framework by introducing algorithms that are suitable for heterogeneous environments. The goal is to efficiently perform data-intensive computing in heterogeneous environments. The need for these adaptations derives from the fact that, following the framework design proposed by Google, Hadoop is optimized to run in large homogeneous clusters. Hence we propose MRA++, a new MapReduce framework design that considers the heterogeneity of nodes during data distribution, task scheduling and job control. MRA++establishes a training task to gather information prior to the data distribution. However, we show that the delay introduced in the setup phase is offset by the effectiveness of the mechanisms and algorithms, that achieve performance gains of more than 70% in 10 Mbps networks.  相似文献   

8.
It is a fact that the attention of research community in computer science, business executives, and decision makers is drastically drawn by big data. As the volume of data becomes bigger, it needs performance‐oriented data‐intensive processing frameworks such as MapReduce, which can scale computation on large commodity clusters. Hadoop MapReduce processes data in Hadoop Distributed File System as jobs scheduled according to YARN fair scheduler and capacity scheduler. However, with advancement and dynamic changes in hardware and operating environments, the performance of clusters is greatly affected. Various efforts in literature have been made to address the issues of heterogeneity (i.e., clusters consisting of virtual machines and machines with different hardware), network communication, data locality, better resource utilization, and run‐time scheduling. In this paper, we present a survey to discuss various research efforts made so far to improve Hadoop MapReduce scheduling. We classify scheduling algorithms and techniques proposed in the literature so far based on their addressing areas and present a taxonomy. Furthermore, we also discuss various aspects of open issues and challenges in the scheduling of MapReduce to improve its performance. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

9.
Apache Hadoop becomes ubiquitous for cloud computing which provides resources as services for multi-tenant applications. YARN (a.k.a. MapReduce 2.0) is one of the key features in the second-generation Hadoop, which provides resource management and scheduling for large-scale MapReduce environments. Two enormous challenges in the YARN scheduler are the abilities to automatically tailor and control resource allocations to different jobs for achieving their Service Level Agreements (SLAs), and minimize energy consumption of the overall cloud computing system. In this work, we propose an SLA-aware energy-efficient scheduling scheme which allocates appropriate amount of resources to MapReduce applications with YARN architecture. In our task scheduling policy, We consider the data locality information to save the MapReduce network traffic. Furthermore, the slack time between the actual execution time of completed tasks and expected completion time of the application is utilized to improve the energy-efficiency of the system. An online userspace governor-based dynamic voltage and frequency scaling (DVFS) scheme is designed in the YARN per-application ApplicationMaster to dynamically change the CPU frequency for upcoming tasks given the slack time from previous completed tasks. Experimental evaluation shows that our proposed scheme outperforms the existing MapReduce scheduling policies in terms of both resource ultization and energy-efficiency.  相似文献   

10.
当今云计算环境下,Hadoop已经成为大数据处理的事实标准。然而云计算具有大规模、高复杂和动态性的特点,容易导致故障的发生,影响Hadoop上运行的作业。虽然Hadoop具有内置的故障检测和恢复机制,但云环境中不同节点负载大小的变化,被调度的作业仍然导致失败。针对此问题提出自响应故障感知的检测调度方法,对异构环境负载能力的不同,而做出服务器快节点和慢节点的判断,把作业分配调度到合适的节点上执行,调整任务决策来尽可能的防止任务失败的发生。最后在Hadoop框架下与基本调度器进行实验性能比较,结果显示该方法减少作业失败率最高达19%,并缩短了作业执行时间,同时也减少CPU和内存的使用。  相似文献   

11.
MapReduce is regarded as an adequate programming model for large-scale data-intensive applications. The Hadoop framework is a well-known MapReduce implementation that runs the MapReduce tasks on a cluster system. G-Hadoop is an extension of the Hadoop MapReduce framework with the functionality of allowing the MapReduce tasks to run on multiple clusters. However, G-Hadoop simply reuses the user authentication and job submission mechanism of Hadoop, which is designed for a single cluster. This work proposes a new security model for G-Hadoop. The security model is based on several security solutions such as public key cryptography and the SSL protocol, and is dedicatedly designed for distributed environments. This security framework simplifies the users authentication and job submission process of the current G-Hadoop implementation with a single-sign-on approach. In addition, the designed security framework provides a number of different security mechanisms to protect the G-Hadoop system from traditional attacks.  相似文献   

12.
MapReduce and its open source implementation, Hadoop, have gained widespread adoption for parallel processing of big data jobs. Since the number of such big data jobs is also rapidly rising, reducing their energy consumption is increasingly more important to reduce environmental impact as well as operational costs. Prior work by Mashayekhy et al. (IEEE Trans. Parallel Distributed Syst. 26, 2720–2733, 2016), has tackled the problem of energy-aware scheduling of a single MapReduce job but we provide a far more efficient heuristic in this paper. We first model the problem as an Integer Linear Program to find the optimal solution using ILP solvers. Then we present a task-based greedy scheduling algorithm, TGSAVE, to select a slot for each task to minimize the total energy consumption of the MapReduce job for big data applications in heterogeneous environments without significant performance loss while satisfying the service level agreement (SLA). We perform several experiments on a Hadoop cluster to measure characteristics of tasks for nine different applications to evaluate our proposed algorithm. The results show that the total energy consumption of MapReduce jobs obtained by TGSAVE is up to 35% less than that achieved by EMRSA proposed in Mashayekhy et al. (IEEE Trans. Parallel Distributed Syst. 26, 2720–2733, 2016), its closest rival, for same workloads. Besides, TGSAVE is capable of finding a solution in same order of time for up to 74% tighter deadlines than the tightest deadline that EMRSA can find a feasible one. On average, TGSAVE solution is approximately 1.4% far from the optimal solution, and it can meet deadlines as tight as 12%, on average, above the energy-oblivious minimum makespan in the benchmarks we examined.  相似文献   

13.
迭代式计算是一类重要的大数据分析应用.在分布式计算框架MapReduce上实现迭代计算时,计算会被分解成多个作业并按作业依存关系顺序运行,这使得程序与分布式文件系统(DFS)有多次交互而影响程序执行时间.对这些交互相关数据的缓存会降低与DFS的交互时间,进而提升程序总体的性能.考虑到集群中的大量内存在多数情况下会处于空闲状态,提出了一种使用内存缓存的迭代式应用编程框架MemLoop.该系统从作业提交API、调度算法、缓存管理模块实现缓存管理以充分利用内存缓存迭代间可驻留数据与迭代内依存数据.我们将此框架与已有相关框架进行了比较,实验结果表明该框架能够提升迭代程序的性能.  相似文献   

14.
iMapReduce: A Distributed Computing Framework for Iterative Computation   总被引:2,自引:0,他引:2  
Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.  相似文献   

15.
MapReduce is a popular parallel data-processing system, and task scheduling is one of the kernel techniques in MapReduce. In many applications, users have requirements that their MapReduce jobs should be completed before specific deadlines. Hence, in this paper, a novel scheduling algorithm based on the most effective sequence (SAMES) is proposed for deadline-constraint jobs in MapReduce. First, according to the characteristics of MapReduce, we propose a novel sequence-based execution strategy for MapReduce jobs and a new concept, the effective sequence (ES). Then, we design some efficient approaches for finding ESes and choose the most effective sequence (MES) for job execution. We also propose methods for MES-updates and exception handling. Finally, we verify the effectiveness of SAMES through experiments. The experimental results show that SAMES is an efficient scheduling algorithm for deadline-constraint jobs in MapReduce.  相似文献   

16.
YARN is a resource management system widely used in Hadoop. It supports MapReduce, Spark, Storm and other computing frameworks, and has become the core component of big data ecology. However, in Hadoop YARN’s existing resource scheduler, a resource guarantee mechanism based on resource reservation, will produce resource fragmentations, leading to a waste of resources. In order to improve the resource utilization and throughput of the cluster, this paper proposes a resource allocation mechanism based on reservation and backfill. In this mechanism, based on the priority of the job, it decides whether to make a reservation to the resource and introduce a backfill strategy to backfill the resource without affecting the execution of the reservation job. Experiments show that the resource scheduling mechanism based on reserved backfill can effectively improve the resource utilization and throughput of Hadoop YARN cluster.  相似文献   

17.
针对人脸识别算法研究过程中测试效率低下的问题,基于分布式技术,设计并实现了通用的分布式大数据测试平台。为了提高人脸识别算法的大数据测试的执行效率,提高测试结果统计计算的执行效率,基于RabbitMQ设计分布式并行执行架构,利用Hadoop集群的MapReduce框架进行分布式并行计算。利用Java语言的Spring框架开发测试平台,将测试代码与测试图片托管于Hadoop集群的HDFS文件系统,实现了测试业务与测试平台的分离,提高了平台的通用性。该测试平台不仅实现了单个测试任务的分布式执行而且满足多个测试任务同时执行,可对测试任务以及测试相关的代码与数据进行有效的管理。与传统测试方法相比,该平台测试效率提高10余倍,测试图片的数量越大测试效率提升越明显。该测试平台具有业务通用性、容量可扩展性,对于其他人工智能算法的大量数据测试具有借鉴意义与参考价值。  相似文献   

18.
王玢  吴雅婧  阳小龙  孙奇福 《软件学报》2017,28(12):3385-3398
目前大数据处理过程较少关注任务所处理数据间的依赖关系,在任务执行过程中可能产生大量数据迁移,影响数据处理效率。为减少数据迁移,提升任务执行性能,从数据关联性及数据本地性两个角度出发,提出了一种数据关联性驱动的大数据处理任务优化调度方案:D3S2(Data-Dependency-Driven Scheduling Scheme)。D3S2由两部分组成:(1)数据关联性感知的数据优化放置机制(DAPM:Dependency-Aware Placement Mechanism),根据日志信息挖掘数据关联性,进而将强关联的数据聚合并放置于相同机架上,减少了跨机架的数据迁移;(2)数据迁移代价感知的任务优化调度机制(TASM:Transfer-Aware Scheduling Mechanism),完成数据放置后,以数据本地性为约束,对任务进行统一调度,最小化任务执行过程中的数据迁移代价。DAPM和TASM互相提供决策依据,以任务执行代价最小化为目标不断迭代调整调度方案,直至达到最优任务调度方案。之后在Hadoop平台上进行了验证,实验结果表明,较之原生Hadoop,在不增加作业完成时间的基础上,D3S2减少了作业执行过程中的数据迁移量。关键词:数据关联性;数据本地性;数据放置;任务调度;迁移代价感知中图法分类号:TP311  相似文献   

19.
提出一种面向异构云计算环境的截止时间约束的MapReduce作业调度方法。使用加权偶图建模MapReduce作业调度问题,将Map任务及Reduce任务与资源槽分为2个节点集合,连接2个节点集合的边的权重为任务在资源槽上的执行时间。进而,使用整数线性规划求解最小加权偶图匹配,从而得到任务到资源槽的调度方案。本文考虑了云计算环境下异构节点任务处理时间的差异性,在线动态评估和调整任务的截止时间,从而提升了MapReduce作业处理的性能。实验结果表明,所提出的方法缩短了作业数据访问的时间,最小化了截止时间冲突的作业数量。  相似文献   

20.
为了解决当前Hadoop集群在异构资源环境下固有的调度分配方法的不足,提出了一种基于节点能力的自适应调度算法NCAS(node capacity adaptive scheduling)。首先,NCAS算法根据节点性能、任务特征计算得到调度因子;然后,由调度因子确定各节点应分得的数据量与任务槽数;最后,将数据和任务多分给快节点同时少分给慢节点。实验结果表明,与传统的调度算法相比,NCAS算法大幅度减少了备份任务的启动数量,明显减少了作业完成时间,提升了任务执行效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号