共查询到20条相似文献,搜索用时 0 毫秒
1.
MapReduce has emerged as a popular programming model in the field of data-intensive computing. This is due to its simplistic design, which provides ease of use for programmers, and its framework implementations such as Hadoop, which have been adopted by large business and technology companies. In this paper we make some improvements to the Hadoop MapReduce framework by introducing algorithms that are suitable for heterogeneous environments. The goal is to efficiently perform data-intensive computing in heterogeneous environments. The need for these adaptations derives from the fact that, following the framework design proposed by Google, Hadoop is optimized to run in large homogeneous clusters. Hence we propose MRA++, a new MapReduce framework design that considers the heterogeneity of nodes during data distribution, task scheduling and job control. MRA++establishes a training task to gather information prior to the data distribution. However, we show that the delay introduced in the setup phase is offset by the effectiveness of the mechanisms and algorithms, that achieve performance gains of more than 70% in 10 Mbps networks. 相似文献
2.
In this paper, we present a new MapReduce framework, called Grex, designed to leverage general purpose graphics processing units (GPUs) for parallel data processing. Grex provides several new features. First, it supports a parallel split method to tokenize input data of variable sizes, such as words in e-books or URLs in web documents, in parallel using GPU threads. Second, Grex evenly distributes data to map/reduce tasks to avoid data partitioning skews. In addition, Grex provides a new memory management scheme to enhance the performance by exploiting the GPU memory hierarchy. Notably, all these capabilities are supported via careful system design without requiring any locks or atomic operations for thread synchronization. The experimental results show that our system is up to 12.4× and 4.1× faster than two state-of-the-art GPU-based MapReduce frameworks for the tested applications. 相似文献
3.
随着云计算的快速发展,IT资源规模的不断扩大导致能耗问题日益凸显.为降低MapReduce编程模型带来的高能耗,文中研究Map/Reduce任务的资源消费特征及该特征与能效的关系,旨在寻找一种能够指导资源分配和任务调度的资源模型,进而实现能效优化.文中提出任务的能效与任务被分配的资源量无关,而与其被分配的各种资源的资源量比例相关,且存在一个“最佳资源比”使得能效达到最高.基于此,文中首先提出了普适的资源和能效模型,从模型层面证明最佳资源比和能效之间的关系,量化空闲资源量和空闲能耗;随后分析MapReduce编程模型,将普适资源比模型变换到MapReduce下.通过抽象的数据的“生产者-消费者”模式,求解Map/Reduce任务的最佳资源比;最后,通过实验从任务能效和空闲能耗两个角度证明了最佳资源比的存在,并根据实验结果,对MapReduce执行过程进行划分,给出了部分Map/Reduce任务的最佳资源比.最佳资源比的提出和求解将有利于基于该最佳资源比的任务调度和资源分配算法的研究,进而实现Map/Reduce任务能效的提高. 相似文献
4.
Clustering is a useful data mining technique which groups data points such that the points within a single group have similar characteristics, while the points in different groups are dissimilar. Density-based clustering algorithms such as DBSCAN and OPTICS are one kind of widely used clustering algorithms. As there is an increasing trend of applications to deal with vast amounts of data, clustering such big data is a challenging problem. Recently, parallelizing clustering algorithms on a large cluster of commodity machines using the MapReduce framework have received a lot of attention.In this paper, we first propose the new density-based clustering algorithm, called DBCURE, which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce. We next develop DBCURE-MR, which is a parallelized DBCURE using MapReduce. While traditional density-based algorithms find each cluster one by one, our DBCURE-MR finds several clusters together in parallel. We prove that both DBCURE and DBCURE-MR find the clusters correctly based on the definition of density-based clusters. Our experimental results with various data sets confirm that DBCURE-MR finds clusters efficiently without being sensitive to the clusters with varying densities and scales up well with the MapReduce framework. 相似文献
5.
6.
Rong Gu Xiaoliang Yang Jinshuang Yan Yuanhao Sun Bing Wang Chunfeng Yuan Yihua Huang 《Journal of Parallel and Distributed Computing》2014
As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance. 相似文献
7.
一种多用户MapReduce集群的作业调度算法的设计与实现 总被引:1,自引:0,他引:1
随着更多的企业开始使用数据密集型集群计算系统如Hadoop和Dryad实现了更多的应用,多用户间共享MapRe-duce集群这种既减少了建立独立集群的代价,同时又使得多用户间可以共享更多的大数据集资源的需求日益增多。在公平调度算法的基础上,结合槽分配延迟和优先级的技术,本文提出了一种改进算法,可以实现更好的数据本地性,改善整个系统的计算性能如吞吐率、响应时间等;同时为了满足差别化的商业服务,通过对用户设置相应的优先级保证紧急任务的完成。 相似文献
8.
随着应用的扩展,大规模图数据不断涌现,如何对拥有大量结点的图进行分析成为研究者关注的焦点问题之一.结点的海量性与分析的复杂性使得图分析任务需要借助MapReduce平台多机并行完成.在该平台上,现有的PageRank算法每轮迭代都须扫描、传输所有网页的完整状态,I/O和网络传输的开销严重影响了计算效率.为此,本文提出一种在MapReduce平台上基于图划分的PageRank加速方法:GCPR(Graph-clustering PageRank).GCPR利用图划分、数据两层压缩技术在MapReduce平台上进行PageRank迭代计算,不仅减少了Map到Reduce中间阶段I/O和网络传输的开销(MapReduce运算的主要瓶颈之一),而且平衡了计算资源.实验证明GCPR能极大提升MapReduce平台上的PageRank计算效率. 相似文献
9.
A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud 总被引:1,自引:0,他引:1
《Journal of Computer and System Sciences》2014,80(5):1008-1020
In big data applications, data privacy is one of the most concerned issues because processing large-scale privacy-sensitive data sets often requires computation resources provisioned by public cloud services. Sub-tree data anonymization is a widely adopted scheme to anonymize data sets for privacy preservation. Top–Down Specialization (TDS) and Bottom–Up Generalization (BUG) are two ways to fulfill sub-tree anonymization. However, existing approaches for sub-tree anonymization fall short of parallelization capability, thereby lacking scalability in handling big data in cloud. Still, either TDS or BUG individually suffers from poor performance for certain valuing of k-anonymity parameter. In this paper, we propose a hybrid approach that combines TDS and BUG together for efficient sub-tree anonymization over big data. Further, we design MapReduce algorithms for the two components (TDS and BUG) to gain high scalability. Experiment evaluation demonstrates that the hybrid approach significantly improves the scalability and efficiency of sub-tree anonymization scheme over existing approaches. 相似文献
10.
MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations. 相似文献
11.
Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework 总被引:1,自引:0,他引:1
Nitesh MaheshwariAuthor Vitae Radheshyam Nanduri Author VitaeVasudeva Varma Author Vitae 《Future Generation Computer Systems》2012,28(1):119-127
With the recent emergence of cloud computing based services on the Internet, MapReduce and distributed file systems like HDFS have emerged as the paradigm of choice for developing large scale data intensive applications. Given the scale at which these applications are deployed, minimizing power consumption of these clusters can significantly cut down operational costs and reduce their carbon footprint—thereby increasing the utility from a provider’s point of view. This paper addresses energy conservation for clusters of nodes that run MapReduce jobs. The algorithm dynamically reconfigures the cluster based on the current workload and turns cluster nodes on or off when the average cluster utilization rises above or falls below administrator specified thresholds, respectively. We evaluate our algorithm using the GridSim toolkit and our results show that the proposed algorithm achieves an energy reduction of 33% under average workloads and up to 54% under low workloads. 相似文献
12.
YARM:基于MapReduce的高效可扩展的语义推理引擎 总被引:1,自引:0,他引:1
随着语义网的快速发展,RDF语义数据大量涌现.大规模RDF语义数据推理的一个主要问题是计算量大、完成计算需要消耗很长的时间.显然,传统的单机语义推理引擎难以处理大规模的语义数据.另一方面,现有的基于MapReduce的大规模语义推理引擎,缺乏对算法在分布和并行计算环境下执行效率的优化,使得推理时间仍然较长.此外,现有的推理引擎大多存在可扩展性方面的不足,难以适应大规模语义数据的增长需求.针对现有的语义推理系统在执行效率和可扩展性方面的不足,文中提出了一种基于MapReduce的并行化语义推理算法和引擎YARM.为了实现分布和并行计算环境下的高效推理,YARM做出了以下4点优化:(1)采用合理的数据划分模型和并行化算法,降低计算节点间的通信开销;(2)优化推理规则的执行次序,提升了推理计算速度;(3)设计了简洁的去重策略,避免新增作业处理重复数据;(4)设计实现了一种新的基于MapReduce的并行化推理算法.实验结果表明,在真实数据集和大规模合成数据集上,YARM的执行速度比当前最新的基于MapReduce的推理引擎快10倍左右,同时YARM还表现出更好的数据和系统可扩展性. 相似文献
13.
As Grids rapidly expand in size and complexity, the task of benchmarking and testing, interactive or unattended, quickly becomes unmanageable. In this article we describe the difficulties of testing/benchmarking resources in large Grid infrastructures and we present the software architecture implementation of GridBench, an extensible tool for testing, benchmarking and ranking of Grid resources. We give an overview of GridBench services and tools, which support the easy definition, invocation and management of tests and benchmarking experiments. We also show how the tool can be used in the analysis of benchmarking results and how the measurements can be used to complement the information provided by Grid Information Services and used as a basis for resource selection and user-driven resource ranking. In order to illustrate the usage of the tool, we describe scenarios for using the GridBench framework to perform test/benchmark experiments and analyze the results. 相似文献
14.
Christos Dimou Andreas L. Symeonidis Pericles A. Mitkas 《Expert systems with applications》2009,36(4):7630-7643
Driven by the urging need to thoroughly identify and accentuate the merits of agent technology, we present in this paper, MEANDER, an integrated framework for evaluating the performance of agent-based systems. The proposed framework is based on the Agent Performance Evaluation (APE) methodology, which provides guidelines and representation tools for performance metrics, measurement collection and aggregation of measurements. MEANDER comprises a series of integrated software components that implement and automate various parts of the methodology and assist evaluators in their tasks. The main objective of MEANDER is to integrate performance evaluation processes into the entire development lifecycle, while clearly separating any evaluation-specific code from the application code at hand. In this paper, we describe in detail the architecture and functionality of the MEANDER components and test its applicability to an existing multi-agent system. 相似文献
15.
HyDB:集成MapReduce和数据库的高效SaaS架构 总被引:1,自引:0,他引:1
随着数据的快速增长和云计算的兴起,软件作为服务(SaaS)标志着计算机系统按需服务的应用的兴起.高效经济SaaS使得许多企业将大规模数据分析服务从部署在并行数据库的高端服务器转移至更便宜的无共享体系结构的低端服务器集群上.论文提出了集成MapReduce和数据库的高效经济SaaS架构—HyDB系统,解决海量结构化,半结构化与非结构化数据的高效查询服务,通过对数据的存储模型和查询模型进行研究,提出了完整的数据存储和查询服务方案,给出基于队列的作业调度算法,并支持针对简约数据查询的快速响应模式.最后通过可扩展实验,证明了该系统架构具有良好的加载性能、查询性能和容错能力,可以为用户提供优质的数据服务. 相似文献
16.
Despite the importance given to the computational efficiency of multibody system (MBS) simulation tools, there is a lack of standard benchmarks to measure the performance of these kinds of software applications. Benchmarking is done on an individual basis: different sets of problems are used, and the procedures and conditions considered to measure computational efficiency are also different. In this scenario, it becomes almost impossible to compare the performance of the different available simulation methods in an objective and quantitative way.This work proposes a benchmarking system for MBS simulation tools. The structure of the benchmark problem collection is defined, and a group of five problems involving rigid bodies is proposed. For these problems, documentation and validated reference solutions in standard formats have been generated, and a procedure to measure the computational efficiency of a given simulation software is described.Finally, the benchmarking system has been applied to evaluate the performance of two different simulation tools: ADAMS/Solver, a popular general-purpose commercial MBS simulation tool, and a custom Fortran code implementation of an Index-3 augmented Lagrangian formulation with projections combined with the implicit single-step trapezoidal rule as integration scheme. Results show that the proposed problems are able to reach the limits of the tested simulation methods, and therefore they can be considered good benchmark problems. 相似文献
17.
18.
An efficient approach to human motion modeling for the verification of human-centric product design and manufacturing in virtual environments 总被引:3,自引:0,他引:3
The efficient and reliable human-integrated design of products and processes is a major goal of the manufacturing industry. Thus, numerous human-related product functionality and manufacturability aspects need to be verified, using simulation, in the context of the product development procedures. The natural representation of human motions in virtual environments is crucial for the reliability of the simulation results. In this context, the paper presents an efficient approach to human motion analysis and modeling with respect to the anthropometric parameters, based on real motion data. Statistical methods employed for the analysis of the data and additive motion models are derived. The models are capable of predicting human motion and driving digital humans in product and worker simulation environments. A specific test case is presented to demonstrate the application of the suggested methodology on a real industrial problem. 相似文献
19.
Accurate measurement and modeling of network performance is important for predicting and optimizing the running time of high-performance computing applications. Although the LogP family of models has proven to be a valuable tool for assessing the communication performance of parallel architectures, non-intrusive LogP parameter assessment of real systems remains a difficult task. Based on an analysis of accuracy and contention properties of existing measurement methods, we develop a new low-overhead measurement method which also assesses protocol changes in the underlying transport layers. We use the gathered parameters to simulate LogGP models of collective operations and demonstrate the errors in common benchmarking methods for collective operations. The simulations provide new insight into the nature of collective algorithms and their pipelining properties. We show that the error of conventional benchmark methods can grow linearly with the system size. 相似文献
20.
Requirements Engineering - Model-based testing (MBT) is a method that supports the design and execution of test cases by models that specify the intended behaviors of a system under test. While... 相似文献