首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
With increasingly inexpensive storage and growing processing power, the cloud has rapidly become the environment of choice to store and analyze data for a variety of applications. Most large-scale data computations in the cloud heavily rely on the MapReduce paradigm and on its Hadoop implementation. Nevertheless, this exponential growth in popularity has significantly impacted power consumption in cloud infrastructures. In this paper, we focus on MapReduce processing and we investigate the impact of dynamically scaling the frequency of compute nodes on the performance and energy consumption of a Hadoop cluster. To this end, a series of experiments are conducted to explore the implications of Dynamic Voltage and Frequency Scaling (DVFS) settings on power consumption in Hadoop clusters. By enabling various existing DVFS governors (i.e., performance, powersave, ondemand, conservative and userspace) in a Hadoop cluster, we observe significant variation in performance and power consumption across different applications: the different DVFS settings are only sub-optimal for several representative MapReduce applications. Furthermore, our results reveal that the current CPU governors do not exactly reflect their design goal and may even become ineffective to manage the power consumption in Hadoop clusters. This study aims at providing a clearer understanding of the interplay between performance and power management in Hadoop clusters and therefore offers useful insight into designing power-aware techniques for Hadoop systems.  相似文献   

2.
iMapReduce: A Distributed Computing Framework for Iterative Computation   总被引:2,自引:0,他引:2  
Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.  相似文献   

3.
Nowadays, many organizations analyze their data with the MapReduce paradigm, most of them using the popular Apache Hadoop framework. As the data size managed by MapReduce applications is steadily increasing, the need for improving the Hadoop performance also grows. Existing modifications of Hadoop (e.g., Mellanox Unstructured Data Accelerator) attempt to improve performance by changing some of its underlying subsystems. However, they are not always capable to cope with all its performance bottlenecks or they hinder its portability. Furthermore, new frameworks like Apache Spark or DataMPI can achieve good performance improvements, but they do not keep compatibility with existing MapReduce applications. This paper proposes Flame-MR, a new event-driven MapReduce architecture that increases Hadoop performance by avoiding memory copies and pipelining data movements, without modifying the source code of the applications. The performance evaluation on two representative systems (an HPC cluster and a public cloud platform) has shown experimental evidence of significant performance increases, reducing the execution time by up to 54% on the Amazon EC2 cloud.  相似文献   

4.
The HaLoop approach to large-scale iterative data analysis   总被引:1,自引:0,他引:1  
The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.  相似文献   

5.
黄鑫  罗军 《集成技术》2013,2(2):69-82
数据的快速增长,为我们提供了更多的信息,然而,也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法,它是基于MapReduce的大规模数据分类模型的最小生成树(MST)的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型,旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中,训练集作为有权重的无向完全图来处理。顶点是对象,两点之间边的权重是对象间的距离。这一距离,不同于欧几里得距离,它是一个特定的距离度量。这样,可以找到图中最小生成树集,其中,图中每棵树代表一个类。为了降低时间复杂度,提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集,可以通过计算无标签对象和它们之间的距离,来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据,是因为Hadoop支持数据密集分布应用,并且这些应用可以和数以千计的节点和数据一起运作。另外,MapReduce 和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce 和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据,实验使用了4000个集群。实验表明,MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。  相似文献   

6.
Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.  相似文献   

7.
MapReduce is regarded as an adequate programming model for large-scale data-intensive applications. The Hadoop framework is a well-known MapReduce implementation that runs the MapReduce tasks on a cluster system. G-Hadoop is an extension of the Hadoop MapReduce framework with the functionality of allowing the MapReduce tasks to run on multiple clusters. However, G-Hadoop simply reuses the user authentication and job submission mechanism of Hadoop, which is designed for a single cluster. This work proposes a new security model for G-Hadoop. The security model is based on several security solutions such as public key cryptography and the SSL protocol, and is dedicatedly designed for distributed environments. This security framework simplifies the users authentication and job submission process of the current G-Hadoop implementation with a single-sign-on approach. In addition, the designed security framework provides a number of different security mechanisms to protect the G-Hadoop system from traditional attacks.  相似文献   

8.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

9.
The ability to handle very large amounts of image data is important for image analysis, indexing and retrieval applications. Sadly, in the literature, scalability aspects are often ignored or glanced over, especially with respect to the intricacies of actual implementation details. In this paper we present a case-study showing how a standard bag-of-visual-words image indexing pipeline can be scaled across a distributed cluster of machines. In order to achieve scalability, we investigate the optimal combination of hybridisations of the MapReduce distributed computational framework which allows the components of the analysis and indexing pipeline to be effectively mapped and run on modern server hardware. We then demonstrate the scalability of the approach practically with a set of image analysis and indexing tools built on top of the Apache Hadoop MapReduce framework. The tools used for our experiments are freely available as open-source software, and the paper fully describes the nuances of their implementation.  相似文献   

10.
Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions.To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.  相似文献   

11.
在异构Hadoop集群场景中, 为了缓和由于纠删码和副本存储模式混合使用, 以及服务器节点本身实时算力差异造成的MapReduce作业处理效率低下的问题, 本文实现了一种根据数据存储情况和节点实时负载来在多并发场景下动态调节MapReduce作业任务分配情况的调度策略. 该策略通过修改当前Hadoop框架中的数据存储选址策略并对节点任务并发量进行动态控制, 在多作业并发时实现更加均衡的作业间资源分配. 实验结果表明, 相较于Hadoop默认的两种作业调度策略, 本文提出的调度模式能够将作业完成时间缩短约17%, 并有效避免部分作业面临的饥饿现象.  相似文献   

12.
MapReduce is increasingly becoming a popular programming model. However, the widely used implementation, Apache Hadoop, uses the Hadoop Distributed File System (HDFS), which is currently not directly applicable to a majority of existing HPC environments such as Teragrid and NERSC that support other distributed file systems. On such resourceful High Performance Computing (HPC) infrastructures, the MapReduce model can rarely make use of full resources, as special circumstances must be created for its adoption, or simply limited resources must be isolated to the same end. This paper not only presents a MapReduce implementation directly suitable for such environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems’ functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also shows better performance in such settings. This paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in HPC environments over Apache Hadoop in a data intensive setting at the National Energy Research Scientific Computing Center (NERSC).  相似文献   

13.
Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce.  相似文献   

14.
It is a fact that the attention of research community in computer science, business executives, and decision makers is drastically drawn by big data. As the volume of data becomes bigger, it needs performance‐oriented data‐intensive processing frameworks such as MapReduce, which can scale computation on large commodity clusters. Hadoop MapReduce processes data in Hadoop Distributed File System as jobs scheduled according to YARN fair scheduler and capacity scheduler. However, with advancement and dynamic changes in hardware and operating environments, the performance of clusters is greatly affected. Various efforts in literature have been made to address the issues of heterogeneity (i.e., clusters consisting of virtual machines and machines with different hardware), network communication, data locality, better resource utilization, and run‐time scheduling. In this paper, we present a survey to discuss various research efforts made so far to improve Hadoop MapReduce scheduling. We classify scheduling algorithms and techniques proposed in the literature so far based on their addressing areas and present a taxonomy. Furthermore, we also discuss various aspects of open issues and challenges in the scheduling of MapReduce to improve its performance. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

15.
The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this study, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). This paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.  相似文献   

16.
荀亚玲  张继福  秦啸 《软件学报》2015,26(8):2056-2073
MapReduce是一种适用于大规模数据密集型应用的有效编程模型,具有编程简单、易于扩展、容错性好等特点,已在并行和分布式计算领域得到了广泛且成功的应用.由于MapReduce将计算扩展到大规模的机器集群上,处理数据的合理放置成为影响MapReduce集群系统性能(包括能耗、资源利用率、通信和I/O代价、响应时间、系统的可靠性和吞吐率等)的关键因素之一.首先,对MapReduce编程模型的典型实现——Hadoop缺省的数据放置策略进行分析,并进一步讨论了MapReduce框架下,设计数据放置策略时需考虑的关键问题和衡量数据放置策略的标准;其次,对目前MapReduce集群环境下的数据放置策略优化方法的研究与进展进行了综述和分析;最后,分析和归纳了MapReduce集群环境下数据放置策略的下一步研究工作.  相似文献   

17.
针对Hadoop平台MapReduce分布式计算模型运行机制中的顺序制约而产生的计算资源浪费问题,从提高平台中每个执行节点的细粒度并行数据处理角度出发,结合Java共享内存多线程编程技术,对该模型进行了优化,提出一种MapReduce+OpenMP粗细粒度相结合的分布式并行计算模型。并在由四个节点组成的Hadoop集群环境下对不同规模大小的出租车GPS轨迹数据分析处理,验证该模型的性能和效率,实验结果证明MapReduce+OpenMP分布式并行计算模型确实能够提高针对大数据集的计算效率,是对Hadoop平台大数据分析处理模型有效的完善和优化。  相似文献   

18.
MapReduce计算框架已被广泛用于大规模数据分析的应用。虽然它具有弹性的可扩展性和细粒度的容错系统,然而性能却并不令人满意。MapReduce可以通过分配更多的计算节点来实现更好的性能,但是,这种做法并不符合成本效益。用户渴望MapReduce在提供弹性的可扩展性和细密度容错的同时,可以具有更高的计算效率。该文提出了一种动态优化map阶段排序性能的方法,并进行了测试,测试结果表明,该方法能够提升MapReduce的基准测试性能。  相似文献   

19.
随着大数据的发展,Hadoop系统成为了大数据处理中的重要工具之一。在实际应用中,Hadoop的I/O操作制约系统性能的提升。通常Hadoop系统通过软件压缩数据来减少I/O操作,但是软件压缩速度较慢,因此使用硬件压缩加速器来替换软件压缩。Hadoop运行在Java虚拟机上,无法直接调用底层I/O硬件压缩加速器。通过实现Hadoop压缩器/解压缩器类和设计C++动态链接库来解决从Hadoop系统中获得压缩数据和将数据流向I/O硬件压缩加速器两个关键技术,从而将I/O硬件压缩加速器集成到Hadoop系统框架。实验结果表明,I/O硬件压缩加速器的每赫兹压缩速度为15.9Byte/s/Hz,集成I/O硬件压缩加速器提升Hadoop系统性能2倍。  相似文献   

20.
传统的协同过滤推荐技术在大数据环境下存在一定的不足。针对该问题,提出了一种基于云计算技术的个性化推荐方法:将大数据集和推荐计算分解到多台计算机上并行处理。在对经典ItemCF算法MapReduce化后,建立了一个基于Hadoop开源框架的并行推荐引擎,并通过在已商用的英语训练平台上进行学习推荐工作验证了该系统的有效性。实验结果表明,在集群中使用云计算技术处理海量数据,可以大大提高推荐系统的可扩展性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号