期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Rong Gu Xiaoliang Yang Jinshuang Yan Yuanhao Sun Bing Wang Chunfeng Yuan Yihua Huang 《Journal of Parallel and Distributed Computing》2014

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance. 相似文献

2.

混合存储模式下MapReduce作业调度

杨振宇牛天洋吕敏《计算机系统应用》2023,32(3):70-85

在异构Hadoop集群场景中, 为了缓和由于纠删码和副本存储模式混合使用, 以及服务器节点本身实时算力差异造成的MapReduce作业处理效率低下的问题, 本文实现了一种根据数据存储情况和节点实时负载来在多并发场景下动态调节MapReduce作业任务分配情况的调度策略. 该策略通过修改当前Hadoop框架中的数据存储选址策略并对节点任务并发量进行动态控制, 在多作业并发时实现更加均衡的作业间资源分配. 实验结果表明, 相较于Hadoop默认的两种作业调度策略, 本文提出的调度模式能够将作业完成时间缩短约17%, 并有效避免部分作业面临的饥饿现象. 相似文献

3.

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Gandomi Abolfazl Movaghar Ali Reshadi Midia Khademzadeh Ahmad 《The Journal of supercomputing》2020,76(9):7177-7203

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

相似文献

4.

一种适应数据与计算密集型任务的私有云系统实现研究*

杨志豪赵太银姚兴苗李磊《计算机应用研究》2011,28(2):620-623

与公有云计算相比,针对数据与计算双重密集型任务的私有云计算系统对计算效率和系统管理效率提出了更高的要求,目前的公有云计算系统显得过于复杂和繁琐,因此需要一种简便易用的能够适应数据与计算密集型任务的私有云计算系统实现。借鉴公有云计算的相关理论和实现方法,提出了一种针对数据与计算双重密集型任务的私有云计算系统实现方案。该方案通过作业文件描述用户的计算任务,确定计算任务的计算模型和计算的输入输出文件;针对私有云的特点,简化Google云计算系统的MapReduce并行处理框架,得到更加直观的数据计算模型;自动连相似文献

5.

MRA++: Scheduling and data placement on MapReduce for heterogeneous environments

《Future Generation Computer Systems》2015

MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. In this paper we make some improvements to the Hadoop MapReduce framework by introducing algorithms that are suitable for heterogeneous environments. The goal is to efficiently perform data-intensive computing in heterogeneous environments. The need for these adaptations derives from the fact that, following the framework design proposed by Google, Hadoop is optimized to run in large homogeneous clusters. Hence we propose MRA++, a new MapReduce framework design that considers the heterogeneity of nodes during data distribution, task scheduling and job control. MRA++establishes a training task to gather information prior to the data distribution. However, we show that the delay introduced in the setup phase is offset by the effectiveness of the mechanisms and algorithms, that achieve performance gains of more than 70% in 10 Mbps networks. 相似文献

6.

基于MapReduce的分布式光线跟踪的设计与实现 总被引：1，自引：0，他引：1

下载免费PDF全文

郑欣杰朱程荣熊齐邦《计算机工程》2007,33(22):83-85

提出了基于MapReduce架构实现分布式光线跟踪渲染的方案。该方案基于Hadoop实现,利用MapReduce架构简化了分布式程序设计。使用分布式计算进行光线跟踪,充分利用了现有低端硬件设备的处理能力。实验表明,该方案通过并行计算大大加快了渲染速度。相似文献

7.

基于最小生成树的大规模数据分类模型及其MapReduce实现

黄鑫罗军《集成技术》2013,2(2):69-82

数据的快速增长,为我们提供了更多的信息,然而,也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法,它是基于MapReduce的大规模数据分类模型的最小生成树（MST）的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型,旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中,训练集作为有权重的无向完全图来处理。顶点是对象,两点之间边的权重是对象间的距离。这一距离,不同于欧几里得距离,它是一个特定的距离度量。这样,可以找到图中最小生成树集,其中,图中每棵树代表一个类。为了降低时间复杂度,提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集,可以通过计算无标签对象和它们之间的距离,来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据,是因为Hadoop支持数据密集分布应用,并且这些应用可以和数以千计的节点和数据一起运作。另外,MapReduce 和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce 和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据,实验使用了4000个集群。实验表明,MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。相似文献

8.

适应大规模数据处理的动态服务私有云系统

汪竹梅林李磊赵太银胡光岷《计算机应用》2012,32(4):1009-1012

为适应私有云环境下数据量大、计算密集、流程复杂的计算任务需求,借鉴公有云计算的相关理论与技术,结合私有云环境的特点,提出了一种适应大规模数据处理的动态服务私有云系统实现方案。该方案使用作业文件描述计算任务,以作业逻辑结构动态构建处理工作流程;通过数据流驱动服务请求,引入MapReduce并行框架进行大规模数据处理。实验结果表明：该方案能够正确有效地处理数据量大、计算密集、流程复杂的计算任务,显著提升处理效率,具有很高的实用性。相似文献

9.

Cloud-based automatic test data generation framework

《Journal of Computer and System Sciences》2016,82(5):712-738

Designing test cases is one of the most crucial activities in software testing process. Manual test case design might result in inadequate testing outputs due to lack of expertise and/or skill requirements. This article delivers automatic test data generation framework by effectively utilizing soft computing technique with Apache Hadoop MapReduce as the parallelization framework. We have evaluated and analyzed statistically our proposed framework using real world open source libraries. The experimental results conducted on Hadoop cluster with ten nodes are effective and our framework significantly outperforms other existing cloud-based testing models. 相似文献

10.

A MapReduce-based scalable discovery and indexing of structured big data

《Future Generation Computer Systems》2017

Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce. 相似文献

11.

MARIANE: Using MApReduce in HPC environments

《Future Generation Computer Systems》2014

MapReduce is increasingly becoming a popular programming model. However, the widely used implementation, Apache Hadoop, uses the Hadoop Distributed File System (HDFS), which is currently not directly applicable to a majority of existing HPC environments such as Teragrid and NERSC that support other distributed file systems. On such resourceful High Performance Computing (HPC) infrastructures, the MapReduce model can rarely make use of full resources, as special circumstances must be created for its adoption, or simply limited resources must be isolated to the same end. This paper not only presents a MapReduce implementation directly suitable for such environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems’ functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also shows better performance in such settings. This paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in HPC environments over Apache Hadoop in a data intensive setting at the National Energy Research Scientific Computing Center (NERSC). 相似文献

12.

基于MapReduce的高能物理数据分析系统

臧冬松霍菁梁栋《计算机工程》2014,(2):1-5

将MapReduce思想引入到高能物理数据分析中,提出一个基于Hadoop框架的高能物理数据分析系统。通过建立事例的TAG信息数据库,将需要进一步分析的事例数减少2~3个数量级,从而减轻I/O压力,提高分析作业的效率。利用基于TAG信息的事例预筛选模型以及事例分析的MapReduce模型,设计适用于ROOT框架的数据拆分、事例读取、结果合并等MapReduce类库。在北京正负电子对撞机实验上进行系统实现后,将其应用于一个8节点实验集群上进行测试,结果表明,该系统可使4×106个事例的分析时间缩短23%,当增加节点个数时,每秒钟能够并发分析的事例数与集群的节点数基本呈正比,说明事例分析集群具有良好的扩展性。相似文献

13.

iMapReduce: A Distributed Computing Framework for Iterative Computation 总被引：2，自引：0，他引：2

Yanfeng Zhang Qixin Gao Lixin Gao Cuirong Wang 《Journal of Grid Computing》2012,10(1):47-68

Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms. 相似文献

14.

最小化多MapReduce任务总完工时间的分析模型及其应用

田文洪陈瑜王心阳薛瑞尼赵勇《计算机工程与科学》2014,36(4):571-578

随着大规模的MapReduce集群广泛地用于大数据处理,特别是当有多个任务需要使用同一个Hadoop集群时,一个关键问题是如何最大限度地减少集群的工作时间,提高MapReduce作业的服务效率。可将多个MapReduce作业当做一个调度任务建模,观察发现多个任务的总完工时间和任务的执行顺序有密切关系。研究目标是设计作业调度系统分析模型,最小化一批MapReduce作业的总完工时间。提出一个更好的调度策略和实现方法, 使整个调度系统符合经典Johnson算法的条件, 从而可使用经典Johnson算法在线性时间内获取总完工时间的最优解。同时,针对需要使用两个或多个资源池进行平衡的问题, 提出了一种线性时间解决方案, 优于已知的近似模拟方案。该理论模型可应用于提高系统响应速度、节能和负载均衡等方面, 对应的应用实例提供了证实。相似文献

15.

Adapting scientific computing problems to clouds using MapReduce 总被引：1，自引：0，他引：1

Satish Narayana Srirama Author VitaePelle JakovitsAuthor Vitae Eero Vainikko Author Vitae 《Future Generation Computer Systems》2012,28(1):184-192

Cloud computing, with its promise of virtually infinite resources, seems to suit well in solving resource greedy scientific computing problems. To study this, we established a scientific computing cloud (SciCloud) project and environment on our internal clusters. The main goal of the project is to study the scope of establishing private clouds at the universities. With these clouds, students and researchers can efficiently use the already existing resources of university computer networks, in solving computationally intensive scientific, mathematical, and academic problems. However, to be able to run the scientific computing applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like the MapReduce framework. This paper summarizes the challenges associated with reducing iterative algorithms to the MapReduce model. Algorithms used by scientific computing are divided into different classes by how they can be adapted to the MapReduce model; examples from each such class are reduced to the MapReduce model and their performance is measured and analyzed. The study mainly focuses on the Hadoop MapReduce framework but also compares it to an alternative MapReduce framework called Twister, which is specifically designed for iterative algorithms. The analysis shows that Hadoop MapReduce has significant trouble with iterative problems while it suits well for embarrassingly parallel problems, and that Twister can handle iterative problems much more efficiently. This work shows how to adapt algorithms from each class into the MapReduce model, what affects the efficiency and scalability of algorithms in each class and allows us to judge which framework is more efficient for each of them, by mapping the advantages and disadvantages of the two frameworks. This study is of significant importance for scientific computing as it often uses complex iterative methods to solve critical problems and adapting such methods to cloud computing frameworks is not a trivial task. 相似文献

16.

用于Hadoop2.x的MapReduce性能评估模型

吴岳《计算机系统应用》2021,30(2):219-225

基于MapReduce的程序被越来越多地应用于大型数据分析的应用中.Apache Hadoop是最常用的开源MapReduce模型之一.程序运行时间的缩短对于MapReduce程序以及所有数据处理应用而言至关重要,而能够准确估算MapReduce程序的执行时间是优化程序的重要环节.本文定义了一个在Hadoop2.x版本... 相似文献

17.

消息代理机制下的MapReduce数据流优化

葛君伟蒋仙方义秋《计算机工程与应用》2013,49(5)

MapReduce编程模型是广泛应用于云计算环境下处理海量数据的一种并行计算框架。然而该框架下的面向数据密集型计算,集群节点间的数据传输依赖性较强,造成节点间的消息处理负载过重。提出基于消息代理机制的MapReduce改进模型,优化数据流。经实验数据表明,基于消息代理机制的MapReduce框架能提高数据密集型应用上的负载均衡。相似文献

18.

基于MapReduce的并行石漠化CA模型

张学锋余利胡宝清严国全李博《计算机工程与应用》2013,49(16):40-42

针对石漠化演化模拟预测CA模型在单机上训练和运行时间较长的问题。给出了MapReduce编程模型实现的并行化石漠化CA模型,并在用普通PC搭建的Hadoop集群上进行研究实验。实验结果表明,在Hadoop集群上实现的MapReduce并行化石漠化CA模型具有较好的加速比。相似文献

19.

云平台下图数据处理技术

刘超唐郑望姚宏胡成玉梁庆中《计算机应用》2015,35(1):43-47

针对Hadoop云平台下MapReduce计算模型在处理图数据时效率低下的问题,提出了一种类似谷歌Pregel的图数据处理计算框架--MyBSP.首先,分析了MapReduce的运行机制及不足之处;其次,阐述了MyBSP框架的结构、工作流程及主要接口;最后,在分析PageRank图处理算法原理的基础上,设计并实现了基于MyBSP框架的PageRank算法.实验结果表明,基于MyBSP框架的图数据处理算法与基于MapReduce的算法相比,迭代处理的性能提升了1.9~3倍.MyBSP算法的执行时间减少了67%,能够满足图数据高效处理的应用前景. 相似文献

20.

A hadoop based platform for natural language processing of web pages and documents

《Journal of Visual Languages and Computing》2015

The rapid and extensive pervasion of information through the web has enhanced the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing very large data volumes in a reasonable time frame is becoming a major challenge and a crucial requirement for many commercial and research fields. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing, and particularly the tasks of text annotation and key feature extraction, is an application area with high computational requirements; therefore, these tasks can significantly benefit of parallel architectures. This paper presents a distributed framework for crawling web documents and running Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop ecosystem and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application and framework (a widely used open source tool for text engineering and NLP). A validation is also offered in using the solution for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents. 相似文献