首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
MapReduce是一个能够对大规模数据进行分布式处理的框架,目前被各个领域广泛应用。在提供MapReduce服务的集群中,如何保证不同优先级用户的截止时间限定是MapReduce作业调度问题的一个挑战。针对这一问题,提出了一个基于排队网络的多优先级作业调度算法(MPSA)。首先分析和归纳了基于MapReduce模型的算法,提出了三种常见模式,采用Jackson排队网络对基于MapReduce模型的算法建立了数学模型,应用该网络模型可以求出不同优先级队列对资源的需求;随后使用AR(1)模型进行预测,使算法可以动态地适应不同的用户访问量;利用二分查找算法,分步计算出不同优先级在map阶段和reduce阶段分配的槽位数;最后实现了在MapReduce模型中应用的实时调度算法。实验结果表明,与传统的FIFO和公平调度算法相比,本文提出的算法在用户到达率和任务规模变化的情况下,可以更加有效地满足不同优先级用户的截止时间限定。  相似文献   

2.
In this paper, the online and offline scheduling schemes towards maritime Cyber Physical Systems (CPSs), to transmit video packets generating from the interior of vessel. During the sailing from the origin port to destination port, the video packets could be delivered via the infostations shoreside. The video packets have their respective release times, deadlines, weights and processing time. The video packets only could be successfully transmitted before their deadlines. A mathematic job-machine problem is mapped. Facing distinguished challenges with unique characteristics imposed in maritime scenario, we focus on the heterogeneous networking and resource optimal scheduling technology to provide valuable insights on the data transmission scheduling via this system. We aim to maximize the weight of delivered packets totally, three algorithms, an offline algorithm, an online ADMISSION Algorithm with no bounded processing times, as well as Exponential-Capacity Algorithm with bounded processing times are developed. Moreover, we induct the approximation ratio and competitive ratios of the proposed algorithms respectively. Finally, we verify the performance of the potential solutions for resource scheduling through comparison simulation.  相似文献   

3.

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

  相似文献   

4.
MapReduce in MPI for Large-scale graph algorithms   总被引:1,自引:0,他引:1  
  相似文献   

5.
云计算为大数据处理提供了一种强大而高效的解决方案.在此模式下,数据管理者(Data Manager,DM)可以租用多个数据中心以实时处理地理分散的数据.然而,由于数据产生的动态性以及资源价格的波动性,将数据迁移至哪些数据中心并提供合适的计算资源来处理它们成为DM低成本处理多源数据的一大问题.本文首先将以上问题转换成联合随机优化问题,然后利用李雅普诺夫(Lyapunov)优化框架将原问题分解成两个独立的子问题进行求解,最后基于求解结果设计在线算法.理论分析表明,所提算法可不断趋近线下最优解并能够保证数据处理时延.在WorldCup98和Youtube数据集上的实验验证了理论分析结果的正确性以及本方法的优越性.  相似文献   

6.
随着大规模的MapReduce集群广泛地用于大数据处理,特别是当有多个任务需要使用同一个Hadoop集群时,一个关键问题是如何最大限度地减少集群的工作时间,提高MapReduce作业的服务效率。可将多个MapReduce作业当做一个调度任务建模,观察发现多个任务的总完工时间和任务的执行顺序有密切关系。 研究目标是设计作业调度系统分析模型,最小化一批MapReduce作业的总完工时间。提出一个更好的调度策略和实现方法, 使整个调度系统符合经典Johnson算法的条件, 从而可使用经典Johnson算法在线性时间内获取总完工时间的最优解。同时,针对需要使用两个或多个资源池进行平衡的问题, 提出了一种线性时间解决方案, 优于已知的近似模拟方案。该理论模型可应用于提高系统响应速度、节能和负载均衡等方面, 对应的应用实例提供了证实。  相似文献   

7.
Adapting scientific computing problems to clouds using MapReduce   总被引:1,自引:0,他引:1  
Cloud computing, with its promise of virtually infinite resources, seems to suit well in solving resource greedy scientific computing problems. To study this, we established a scientific computing cloud (SciCloud) project and environment on our internal clusters. The main goal of the project is to study the scope of establishing private clouds at the universities. With these clouds, students and researchers can efficiently use the already existing resources of university computer networks, in solving computationally intensive scientific, mathematical, and academic problems. However, to be able to run the scientific computing applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like the MapReduce framework. This paper summarizes the challenges associated with reducing iterative algorithms to the MapReduce model. Algorithms used by scientific computing are divided into different classes by how they can be adapted to the MapReduce model; examples from each such class are reduced to the MapReduce model and their performance is measured and analyzed. The study mainly focuses on the Hadoop MapReduce framework but also compares it to an alternative MapReduce framework called Twister, which is specifically designed for iterative algorithms. The analysis shows that Hadoop MapReduce has significant trouble with iterative problems while it suits well for embarrassingly parallel problems, and that Twister can handle iterative problems much more efficiently. This work shows how to adapt algorithms from each class into the MapReduce model, what affects the efficiency and scalability of algorithms in each class and allows us to judge which framework is more efficient for each of them, by mapping the advantages and disadvantages of the two frameworks. This study is of significant importance for scientific computing as it often uses complex iterative methods to solve critical problems and adapting such methods to cloud computing frameworks is not a trivial task.  相似文献   

8.
Reliability and real-time requirements bring new challenges to the energy-constrained wireless sensor networks, especially to the industrial wireless sensor networks. Meanwhile, the capacity of wireless sensor networks can be substantially increased by operating on multiple nonoverlapping channels. In this context, new routing, scheduling, and power control algorithms are required to achieve reliable and real-time communications and to fully utilize the increased bandwidth in multichannel wireless sensor networks. In this paper, we develop a distributed and online algorithm that jointly solves multipath routing, link scheduling, and power control problem, which can adapt automatically to the changes in the network topology and offered load. We particularly focus on finding the resource allocation that realizes trade-off among energy consumption, end-to-end delay, and network throughput for multichannel networks with physical interference model. Our algorithm jointly considers 1) delay and energy-aware power control for optimal transmission radius and rate with physical interference model, 2) throughput efficient multipath routing based on the given optimal transmission rate between the given source-destination pairs, and 3) reliable-aware and throughput efficient multichannel maximal link scheduling for time slots and channels based on the designated paths, and the new physical interference model that is updated by the optimal transmission radius. By proving and simulation, we show that our algorithm is provably efficient compared with the optimal centralized and offline algorithm and other comparable algorithms.  相似文献   

9.
MapReduce大数据处理平台与算法研究进展   总被引:1,自引:1,他引:0  
本文综述了近年来基于MapReduce编程模型的大数据处理平台与算法的研究进展。首先介绍了12个典型的基于MapReduce的大数据处理平台,分析对比它们的实现原理和适用场景,抽象它们的共性。随后介绍基于MapReduce的大数据分析算法,包括搜索算法、数据清洗/变换算法、聚集算法、连接算法、排序算法、偏好查询、最优化算法、图算法、数据挖掘算法。将这些算法按MapReduce实现方式分类,分析影响这算法性能的因素。最后,将大数据处理算法抽象为外存算法,并对外存算法的特征加以梳理,提出了普适的外存算法性能优化方法的研究思路和研究问题,以供研究人员参考。具体包括优化外存算法的磁盘I/O,优化外存算法的局部性,以及设计增量式迭代算法。现有大数据处理平台和算法研究多集中在基于资源分配和任务调度的平台动态性能优化、特定算法并行化、特定算法性能优化等领域,本文提出的外存算法性能优化属于静态优化方法,是现有研究的良好补充,为研究人员提供了广阔的研究空间。  相似文献   

10.
提出一种GPU集群下用户服务质量QoS感知的深度学习研发平台上的动态任务调度方法.采用离线评估模块对深度学习任务进行离线评测并构建计算性能预测模型.在线调度模块基于性能预测模型,结合任务的预期QoS,共同开展任务放置和任务执行顺序的调度.在一个分布式GPU集群实例上的实验表明,该方法相比其他基准策略能够实现更高的QoS保证率和集群资源利用率.  相似文献   

11.
在异构Hadoop集群场景中, 为了缓和由于纠删码和副本存储模式混合使用, 以及服务器节点本身实时算力差异造成的MapReduce作业处理效率低下的问题, 本文实现了一种根据数据存储情况和节点实时负载来在多并发场景下动态调节MapReduce作业任务分配情况的调度策略. 该策略通过修改当前Hadoop框架中的数据存储选址策略并对节点任务并发量进行动态控制, 在多作业并发时实现更加均衡的作业间资源分配. 实验结果表明, 相较于Hadoop默认的两种作业调度策略, 本文提出的调度模式能够将作业完成时间缩短约17%, 并有效避免部分作业面临的饥饿现象.  相似文献   

12.

Recent trends in big data have shown that the amount of data continues to increase at an exponential rate. This trend has inspired many researchers over the past few years to explore new research direction of studies related to multiple areas of big data. The widespread popularity of big data processing platforms using MapReduce framework is the growing demand to further optimize their performance for various purposes. In particular, enhancing resources and jobs scheduling are becoming critical since they fundamentally determine whether the applications can achieve the performance goals in different use cases. Scheduling plays an important role in big data, mainly in reducing the execution time and cost of processing. This paper aims to survey the research undertaken in the field of scheduling in big data platforms. Moreover, this paper analyzed scheduling in MapReduce on two aspects: taxonomy and performance evaluation. The research progress in MapReduce scheduling algorithms is also discussed. The limitations of existing MapReduce scheduling algorithms and exploit future research opportunities are pointed out in the paper for easy identification by researchers. Our study can serve as the benchmark to expert researchers for proposing a novel MapReduce scheduling algorithm. However, for novice researchers, the study can be used as a starting point.

  相似文献   

13.
MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data. In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.  相似文献   

14.
15.
MapReduce是云计算中重要的批数据处理框架,多任务共享MapReduce机群并满足任务实时性要求是调度算法急需解决的问题。提出两阶段实时调度算法,将调度划分为任务间调度和任务内调度。对于任务间调度,使用抽样法和经验值法确定子任务执行时间,利用该参数建立资源分配模型,动态确定任务优先级进行调度;对于子任务使用延迟调度策略进行调度,保证计算的本地性。实验结果显示,两阶段实时调度算法相比公平调度算法和FIFO算法,在保证吞吐量的同时能够满足任务实时性要求。  相似文献   

16.
Apache Hadoop becomes ubiquitous for cloud computing which provides resources as services for multi-tenant applications. YARN (a.k.a. MapReduce 2.0) is one of the key features in the second-generation Hadoop, which provides resource management and scheduling for large-scale MapReduce environments. Two enormous challenges in the YARN scheduler are the abilities to automatically tailor and control resource allocations to different jobs for achieving their Service Level Agreements (SLAs), and minimize energy consumption of the overall cloud computing system. In this work, we propose an SLA-aware energy-efficient scheduling scheme which allocates appropriate amount of resources to MapReduce applications with YARN architecture. In our task scheduling policy, We consider the data locality information to save the MapReduce network traffic. Furthermore, the slack time between the actual execution time of completed tasks and expected completion time of the application is utilized to improve the energy-efficiency of the system. An online userspace governor-based dynamic voltage and frequency scaling (DVFS) scheme is designed in the YARN per-application ApplicationMaster to dynamically change the CPU frequency for upcoming tasks given the slack time from previous completed tasks. Experimental evaluation shows that our proposed scheme outperforms the existing MapReduce scheduling policies in terms of both resource ultization and energy-efficiency.  相似文献   

17.
It is a fact that the attention of research community in computer science, business executives, and decision makers is drastically drawn by big data. As the volume of data becomes bigger, it needs performance‐oriented data‐intensive processing frameworks such as MapReduce, which can scale computation on large commodity clusters. Hadoop MapReduce processes data in Hadoop Distributed File System as jobs scheduled according to YARN fair scheduler and capacity scheduler. However, with advancement and dynamic changes in hardware and operating environments, the performance of clusters is greatly affected. Various efforts in literature have been made to address the issues of heterogeneity (i.e., clusters consisting of virtual machines and machines with different hardware), network communication, data locality, better resource utilization, and run‐time scheduling. In this paper, we present a survey to discuss various research efforts made so far to improve Hadoop MapReduce scheduling. We classify scheduling algorithms and techniques proposed in the literature so far based on their addressing areas and present a taxonomy. Furthermore, we also discuss various aspects of open issues and challenges in the scheduling of MapReduce to improve its performance. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

18.
MapReduce Job的调度机制一直是学术研究的热点。在分析MapReduce数据流调度模型的基础上,提出一种面向MapReduce数据流的公平调度方法FlowS。该方法采用数据流池来分配资源以保证MapReduce数据流的隔离性,并且采用数据流池动态构建算法来确保资源的公平分配。实验表明,该调度方法可以有效提高Hadoop集群对MapReduce数据流的处理效率。  相似文献   

19.
针对MapReduce中允许map和shuffle阶段重叠的优化模型需要自适应性的问题,提出了基于此模型的机器学习的资源调度算法,利用贝叶斯分类器依据作业对系统资源的需求和系统环境的匹配程度对作业进行调度,并不断更新分类器,使其具有自适应性,考虑了map和shuffle的重叠阶段。通过模拟实验验证,改进后的算法能够提高MapReduce系统的性能,获得更好的平均响应时间。  相似文献   

20.
The core business of many companies depends on the timely analysis of large quantities of new data. MapReduce clusters that routinely process petabytes of data represent a new entity in the evolving landscape of clouds and data centers. During the lifetime of a data center, old hardware needs to be eventually replaced by new hardware. The hardware selection process needs to be driven by performance objectives of the existing production workloads. In this work, we present a general framework, called Ariel, that automates system administrators’ efforts for evaluating different hardware choices and predicting completion times of MapReduce applications for their migration to a Hadoop cluster based on the new hardware. The proposed framework consists of two key components: (i) a set of microbenchmarks to profile the MapReduce processing pipeline on a given platform, and (ii) a regression-based model that establishes a performance relationship between the source and target platforms. Benchmarking and model derivation can be done using a small test cluster based on new hardware. However, the designed model can be used for predicting the jobs’ completion time on a large Hadoop cluster and be applied for its sizing to achieve desirable service level objectives (SLOs). We validate the effectiveness of the proposed approach using a set of twelve realistic MapReduce applications and three different hardware platforms. The evaluation study justifies our design choices and shows that the derived model accurately predicts performance of the test applications. The predicted completion times of eleven applications (out of twelve) are within 10% of the measured completion times on the target platforms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号