首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The core business of many companies depends on the timely analysis of large quantities of new data. MapReduce clusters that routinely process petabytes of data represent a new entity in the evolving landscape of clouds and data centers. During the lifetime of a data center, old hardware needs to be eventually replaced by new hardware. The hardware selection process needs to be driven by performance objectives of the existing production workloads. In this work, we present a general framework, called Ariel, that automates system administrators’ efforts for evaluating different hardware choices and predicting completion times of MapReduce applications for their migration to a Hadoop cluster based on the new hardware. The proposed framework consists of two key components: (i) a set of microbenchmarks to profile the MapReduce processing pipeline on a given platform, and (ii) a regression-based model that establishes a performance relationship between the source and target platforms. Benchmarking and model derivation can be done using a small test cluster based on new hardware. However, the designed model can be used for predicting the jobs’ completion time on a large Hadoop cluster and be applied for its sizing to achieve desirable service level objectives (SLOs). We validate the effectiveness of the proposed approach using a set of twelve realistic MapReduce applications and three different hardware platforms. The evaluation study justifies our design choices and shows that the derived model accurately predicts performance of the test applications. The predicted completion times of eleven applications (out of twelve) are within 10% of the measured completion times on the target platforms.  相似文献   

2.
The combination of the Hadoop MapReduce programming model and cloud computing allows biological scientists to analyze next-generation sequencing (NGS) data in a timely and cost-effective manner. Cloud computing platforms remove the burden of IT facility procurement and management from end users and provide ease of access to Hadoop clusters. However, biological scientists are still expected to choose appropriate Hadoop parameters for running their jobs. More importantly, the available Hadoop tuning guidelines are either obsolete or too general to capture the particular characteristics of bioinformatics applications. In this study, we aim to minimize the cloud computing cost spent on bioinformatics data analysis by optimizing the extracted significant Hadoop parameters. When using MapReduce-based bioinformatics tools in the cloud, the default settings often lead to resource underutilization and wasteful expenses. We choose k-mer counting, a representative application used in a large number of NGS data analysis tools, as our study case. Experimental results show that, with the fine-tuned parameters, we achieve a total of 4× speedup compared with the original performance (using the default settings). This paper presents an exemplary case for tuning MapReduce-based bioinformatics applications in the cloud, and documents the key parameters that could lead to significant performance benefits.  相似文献   

3.
Nowadays, many organizations analyze their data with the MapReduce paradigm, most of them using the popular Apache Hadoop framework. As the data size managed by MapReduce applications is steadily increasing, the need for improving the Hadoop performance also grows. Existing modifications of Hadoop (e.g., Mellanox Unstructured Data Accelerator) attempt to improve performance by changing some of its underlying subsystems. However, they are not always capable to cope with all its performance bottlenecks or they hinder its portability. Furthermore, new frameworks like Apache Spark or DataMPI can achieve good performance improvements, but they do not keep compatibility with existing MapReduce applications. This paper proposes Flame-MR, a new event-driven MapReduce architecture that increases Hadoop performance by avoiding memory copies and pipelining data movements, without modifying the source code of the applications. The performance evaluation on two representative systems (an HPC cluster and a public cloud platform) has shown experimental evidence of significant performance increases, reducing the execution time by up to 54% on the Amazon EC2 cloud.  相似文献   

4.
MapReduce is emerging as a prominent tool for big data processing. Data locality is a key feature in MapReduce that is extensively leveraged in data-intensive cloud systems: it avoids network saturation when processing large amounts of data by co-allocating computation and data storage, particularly for the map phase. However, our studies with Hadoop, a widely used MapReduce implementation, demonstrate that the presence of partitioning skew (Partitioning skew refers to the case when a variation in either the intermediate keys’ frequencies or their distributions or both among different data nodes) causes a huge amount of data transfer during the shuffle phase and leads to significant unfairness on the reduce input among different data nodes. As a result, the applications severe performance degradation due to the long data transfer during the shuffle phase along with the computation skew, particularly in reduce phase. In this paper, we develop a novel algorithm named LEEN for locality-aware and fairness-aware key partitioning in MapReduce. LEEN embraces an asynchronous map and reduce scheme. All buffered intermediate keys are partitioned according to their frequencies and the fairness of the expected data distribution after the shuffle phase. We have integrated LEEN into Hadoop. Our experiments demonstrate that LEEN can efficiently achieve higher locality and reduce the amount of shuffled data. More importantly, LEEN guarantees fair distribution of the reduce inputs. As a result, LEEN achieves a performance improvement of up to 45 % on different workloads.  相似文献   

5.
MapReduce and its open source implementation, Hadoop, have gained widespread adoption for parallel processing of big data jobs. Since the number of such big data jobs is also rapidly rising, reducing their energy consumption is increasingly more important to reduce environmental impact as well as operational costs. Prior work by Mashayekhy et al. (IEEE Trans. Parallel Distributed Syst. 26, 2720–2733, 2016), has tackled the problem of energy-aware scheduling of a single MapReduce job but we provide a far more efficient heuristic in this paper. We first model the problem as an Integer Linear Program to find the optimal solution using ILP solvers. Then we present a task-based greedy scheduling algorithm, TGSAVE, to select a slot for each task to minimize the total energy consumption of the MapReduce job for big data applications in heterogeneous environments without significant performance loss while satisfying the service level agreement (SLA). We perform several experiments on a Hadoop cluster to measure characteristics of tasks for nine different applications to evaluate our proposed algorithm. The results show that the total energy consumption of MapReduce jobs obtained by TGSAVE is up to 35% less than that achieved by EMRSA proposed in Mashayekhy et al. (IEEE Trans. Parallel Distributed Syst. 26, 2720–2733, 2016), its closest rival, for same workloads. Besides, TGSAVE is capable of finding a solution in same order of time for up to 74% tighter deadlines than the tightest deadline that EMRSA can find a feasible one. On average, TGSAVE solution is approximately 1.4% far from the optimal solution, and it can meet deadlines as tight as 12%, on average, above the energy-oblivious minimum makespan in the benchmarks we examined.  相似文献   

6.
Apache Hadoop becomes ubiquitous for cloud computing which provides resources as services for multi-tenant applications. YARN (a.k.a. MapReduce 2.0) is one of the key features in the second-generation Hadoop, which provides resource management and scheduling for large-scale MapReduce environments. Two enormous challenges in the YARN scheduler are the abilities to automatically tailor and control resource allocations to different jobs for achieving their Service Level Agreements (SLAs), and minimize energy consumption of the overall cloud computing system. In this work, we propose an SLA-aware energy-efficient scheduling scheme which allocates appropriate amount of resources to MapReduce applications with YARN architecture. In our task scheduling policy, We consider the data locality information to save the MapReduce network traffic. Furthermore, the slack time between the actual execution time of completed tasks and expected completion time of the application is utilized to improve the energy-efficiency of the system. An online userspace governor-based dynamic voltage and frequency scaling (DVFS) scheme is designed in the YARN per-application ApplicationMaster to dynamically change the CPU frequency for upcoming tasks given the slack time from previous completed tasks. Experimental evaluation shows that our proposed scheme outperforms the existing MapReduce scheduling policies in terms of both resource ultization and energy-efficiency.  相似文献   

7.
The efficient use of energy is essential to address concerns of cost and sustainability. Many data centers contain MapReduce clusters to process Big Data applications. A large number of machines and fault tolerance capabilities make MapReduce clusters energy inefficient. In this paper, we present a Configurator based on performance and energy models to improve the energy efficiency of MapReduce systems. Our solution is novel as it takes into account the dependence of the performance and energy consumption of a cluster on MapReduce parameters. While this dependence is known, we are the first to model it and design a Configurator to optimize these parameter settings for maximizing the energy efficiency of MapReduce systems. Our empirical evaluations show that the Configurator can result in up to 50% improvement in the energy efficiency of typical MapReduce applications in two architecturally different clusters.  相似文献   

8.
针对Hadoop平台MapReduce分布式计算模型运行机制中的顺序制约而产生的计算资源浪费问题,从提高平台中每个执行节点的细粒度并行数据处理角度出发,结合Java共享内存多线程编程技术,对该模型进行了优化,提出一种MapReduce+OpenMP粗细粒度相结合的分布式并行计算模型。并在由四个节点组成的Hadoop集群环境下对不同规模大小的出租车GPS轨迹数据分析处理,验证该模型的性能和效率,实验结果证明MapReduce+OpenMP分布式并行计算模型确实能够提高针对大数据集的计算效率,是对Hadoop平台大数据分析处理模型有效的完善和优化。  相似文献   

9.
With the recent emergence of cloud computing based services on the Internet, MapReduce and distributed file systems like HDFS have emerged as the paradigm of choice for developing large scale data intensive applications. Given the scale at which these applications are deployed, minimizing power consumption of these clusters can significantly cut down operational costs and reduce their carbon footprint—thereby increasing the utility from a provider’s point of view. This paper addresses energy conservation for clusters of nodes that run MapReduce jobs. The algorithm dynamically reconfigures the cluster based on the current workload and turns cluster nodes on or off when the average cluster utilization rises above or falls below administrator specified thresholds, respectively. We evaluate our algorithm using the GridSim toolkit and our results show that the proposed algorithm achieves an energy reduction of 33% under average workloads and up to 54% under low workloads.  相似文献   

10.
Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.  相似文献   

11.

Cancer classification is one of the main steps during patient healing process. This fact enforces modern clinical researchers to use advanced bioinformatics methods for cancer classification. Cancer classification is usually performed using gene expression data gained in microarray experiment and advanced machine learning methods. Microarray experiment generates huge amount of data, and its processing via machine learning methods represents a big challenge. In this study, two-step classification paradigm which merges genetic algorithm feature selection and machine learning classifiers is utilized. Genetic algorithm is built in MapReduce programming spirit which makes this algorithm highly scalable for Hadoop cluster. In order to improve the performance of the proposed algorithm, it is extended into a parallel algorithm which process on microarray data in distributed manner using the Hadoop MapReduce framework. In this paper, the algorithm was tested on eleven GEMS data sets (9 tumors, 11 tumors, 14 tumors, brain tumor 1, lung cancer, brain tumor 2, leukemia 1, DLBCL, leukemia 2, SRBCT, and prostate tumor) and its accuracy reached 100% for less than 25 selected features. The proposed cloud computing-based MapReduce parallel genetic algorithm performed well on gene expression data. In addition, the scalability of the suggested algorithm is unlimited because of underlying Hadoop MapReduce platform. The presented results indicate that the proposed method can be effectively implemented for real-world microarray data in the cloud environment. In addition, the Hadoop MapReduce framework demonstrates substantial decrease in the computation time.

  相似文献   

12.
MapReduce is regarded as an adequate programming model for large-scale data-intensive applications. The Hadoop framework is a well-known MapReduce implementation that runs the MapReduce tasks on a cluster system. G-Hadoop is an extension of the Hadoop MapReduce framework with the functionality of allowing the MapReduce tasks to run on multiple clusters. However, G-Hadoop simply reuses the user authentication and job submission mechanism of Hadoop, which is designed for a single cluster. This work proposes a new security model for G-Hadoop. The security model is based on several security solutions such as public key cryptography and the SSL protocol, and is dedicatedly designed for distributed environments. This security framework simplifies the users authentication and job submission process of the current G-Hadoop implementation with a single-sign-on approach. In addition, the designed security framework provides a number of different security mechanisms to protect the G-Hadoop system from traditional attacks.  相似文献   

13.
黄鑫  罗军 《集成技术》2013,2(2):69-82
数据的快速增长,为我们提供了更多的信息,然而,也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法,它是基于MapReduce的大规模数据分类模型的最小生成树(MST)的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型,旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中,训练集作为有权重的无向完全图来处理。顶点是对象,两点之间边的权重是对象间的距离。这一距离,不同于欧几里得距离,它是一个特定的距离度量。这样,可以找到图中最小生成树集,其中,图中每棵树代表一个类。为了降低时间复杂度,提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集,可以通过计算无标签对象和它们之间的距离,来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据,是因为Hadoop支持数据密集分布应用,并且这些应用可以和数以千计的节点和数据一起运作。另外,MapReduce 和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce 和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据,实验使用了4000个集群。实验表明,MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。  相似文献   

14.
近年来,能效数据库系统成为数据库领域的一个研究议题.CPU动态电压频率调节(DVFS)是一种有效的动态功率节能技术.探寻PostgreSQL数据库在ACPI不同调节器下查询操作的性能、能耗、功率之间潜在联系,发现动态功耗管理与数据库系统的能效关系,通过运行TPC-H测试基准生成的数据库与相应22个查询,总结出调节器对数据库查询处理各种操作的影响.实验结果表明,DVFS可以对DBMS进行动态功耗管理是有效的,查询处理的不同操作具有各自特性,利用这些特性来设计效率更高的调节器是颇有前途的.  相似文献   

15.
Energy consumption has been one of the most critical issues in the Chip Multiprocessor (CMP). Using the Dynamic Voltage and Frequency Scaling (DVFS), a CMP system can achieve a balance between the performance and the energy-efficiency. In this paper, we propose a three-phase discrete DVFS algorithm for a CMP system dedicated to applications where the period of the applications’ task graph is smaller than the deadline of tasks. In these applications, multiple task graphs are unrolled and then concatenated together to form a new task graph. The proposed DVFS algorithm is applied to the newly formed task graph to stretch tasks’ execution time, lower operating frequencies of processors and achieve the system power efficiency. Experimental results show that the proposed algorithm reduces the energy dissipation by 25% on average, compared to previous DVFS approaches.  相似文献   

16.
MapReduce is a programming model from Google for cluster-based computing in domains such as search engines, machine learning, and data mining. MapReduce provides automatic data management and fault tolerance to improve programmability of clusters. MapReduce’s execution model includes an all-map-to-all-reduce communication, called the shuffle, across the network bisection. Some MapReductions move large amounts of data (e.g., as much as the input data), stressing the bisection bandwidth and introducing significant runtime overhead. Optimizing such shuffle-heavy MapReductions is important because (1) they include key applications (e.g., inverted indexing for search engines and data clustering for machine learning) and (2) they run longer than shuffle-light MapReductions (e.g., 5x longer). In MapReduce, the asynchronous nature of the shuffle results in some overlap between the shuffle and map. Unfortunately, this overlap is insufficient in shuffle-heavy MapReductions. We propose MapReduce with communication overlap (MaRCO) to achieve nearly full overlap via the novel idea of including reduce in the overlap. While MapReduce lazily performs reduce computation only after receiving all the map data, MaRCO employs eager reduce to process partial data from some map tasks while overlapping with other map tasks’ communication. MaRCO’s approach of hiding the latency of the inevitably high shuffle volume of shuffle-heavy MapReductions is fundamental for achieving performance. We implement MaRCO in Hadoop’s MapReduce and show that on a 128-node Amazon EC2 cluster, MaRCO achieves 23% average speed-up over Hadoop for shuffle-heavy MapReductions.  相似文献   

17.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

18.
Adapting scientific computing problems to clouds using MapReduce   总被引:1,自引:0,他引:1  
Cloud computing, with its promise of virtually infinite resources, seems to suit well in solving resource greedy scientific computing problems. To study this, we established a scientific computing cloud (SciCloud) project and environment on our internal clusters. The main goal of the project is to study the scope of establishing private clouds at the universities. With these clouds, students and researchers can efficiently use the already existing resources of university computer networks, in solving computationally intensive scientific, mathematical, and academic problems. However, to be able to run the scientific computing applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like the MapReduce framework. This paper summarizes the challenges associated with reducing iterative algorithms to the MapReduce model. Algorithms used by scientific computing are divided into different classes by how they can be adapted to the MapReduce model; examples from each such class are reduced to the MapReduce model and their performance is measured and analyzed. The study mainly focuses on the Hadoop MapReduce framework but also compares it to an alternative MapReduce framework called Twister, which is specifically designed for iterative algorithms. The analysis shows that Hadoop MapReduce has significant trouble with iterative problems while it suits well for embarrassingly parallel problems, and that Twister can handle iterative problems much more efficiently. This work shows how to adapt algorithms from each class into the MapReduce model, what affects the efficiency and scalability of algorithms in each class and allows us to judge which framework is more efficient for each of them, by mapping the advantages and disadvantages of the two frameworks. This study is of significant importance for scientific computing as it often uses complex iterative methods to solve critical problems and adapting such methods to cloud computing frameworks is not a trivial task.  相似文献   

19.
胡持  杨庚  杨倍思  闵兆娥 《计算机应用》2015,35(12):3408-3412
根据云计算分布式的特点,并结合同态加密和Hadoop环境下MapReduce并行框架,提出了一种基于MapReduce计算框架的并行同态加密方案。实现了具体的并行同态加密算法,并对该方案的安全性和正确性进行了理论分析。同时,在16个核的计算集群中进行实验,数据加密的加速比可以达到13。实验结果表明,基于MapReduce的同态加密方案可以有效地减少数据的加密时间,有利于面向实时的应用。  相似文献   

20.
大数据、云计算技术的迅猛发展为挖掘气象数据丰富的科研和经济价值提供了技术支撑,促进了Hadoop及其包含的文件存储系统(HDFS,Hadoop Distributed File System)和分布式计算模型在气象数据处理领域广泛应用。由于气象数据具有大数据的4V特征,还需要引入新的数据处理算法来提高气象数据处理效率。通过对决策树算法原理的研究,基于Hadoop云平台,创建随机森林模型,为数据挖掘算法在云平台上的应用提供一种新的可能性。基于决策树(CART,Classification And Regression Trees)挖掘算法的气象大数据云平台设计,采用Hadoop系统架构和MapReduce工作流程,对气象大数据云平台采用集群部署。平台总体架构分为基础设施层、数据管理与处理层、应用层,减少了决策树建立的时间,实现了气象数据高效加工和挖掘分析等平台功能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号