首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
邹裕 《计算机系统应用》2016,25(11):216-220
针对从海量数据中分析与提取知识计算时间高的问题,提出一种基于Hadoop的知识提取算法.本文结合Hadoop的并行处理能力与分布式存储特点,设计了一种知识提取框架,可兼容不同的原型约简方法.基于MapReduce编程方法将约简方法并行化处理,并且设计了分类准确率高、计算速度快的原型约简组合规则.最终基于真实UCI大数据集进行实验,本框架将最近邻分类器的分类时间提高两个数量级.  相似文献   

2.
Hadoop是处理海量数据的分布式计算框架,已经得到了广泛的应用。但是Hadoop处理图结构数据存在一些不足。图结构数据的强耦合特性,无法通过一次MapReduce计算得出结果,而是需要迭代计算,甚至一次迭代需要多次Ma-pReduce完成。而重新启动MapReduce作业,开销较大,以及迭代过程中可能存在静态数据的不必要传输。文中在Hadoop的基础之上,提出map端存储的策略,即将静态数据存储在map端,在map端完成静态与动态数据相关的计算,减少了整个迭代计算的总运行时间。通过搭建修改过的Hadoop平台,与改进前迭代方案进行比较,实验结果表明map端存储策略运行时间得到了一定程度的减少。  相似文献   

3.
As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.  相似文献   

4.
Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions.To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.  相似文献   

5.
迭代计算普遍存在于大数据处理中,而传统的MapReduce不能显式地支持迭代计算。近几年,研究者扩展和改进原始MapReduce,已开发了若干迭代式MapReduce以更好地为大数据处理而支持迭代计算。对迭代式MapReduce编程框架进行综合评述,较详细地阐述了这些研究成果,给出了它们各自的基本思想,并分析了它们各自的特点、优势和不足,且对比了它们所采用的一些技术。对迭代式MapReduce未来的发展趋势进行了展望。  相似文献   

6.
MapReduce is regarded as an adequate programming model for large-scale data-intensive applications. The Hadoop framework is a well-known MapReduce implementation that runs the MapReduce tasks on a cluster system. G-Hadoop is an extension of the Hadoop MapReduce framework with the functionality of allowing the MapReduce tasks to run on multiple clusters. However, G-Hadoop simply reuses the user authentication and job submission mechanism of Hadoop, which is designed for a single cluster. This work proposes a new security model for G-Hadoop. The security model is based on several security solutions such as public key cryptography and the SSL protocol, and is dedicatedly designed for distributed environments. This security framework simplifies the users authentication and job submission process of the current G-Hadoop implementation with a single-sign-on approach. In addition, the designed security framework provides a number of different security mechanisms to protect the G-Hadoop system from traditional attacks.  相似文献   

7.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

8.

Cancer classification is one of the main steps during patient healing process. This fact enforces modern clinical researchers to use advanced bioinformatics methods for cancer classification. Cancer classification is usually performed using gene expression data gained in microarray experiment and advanced machine learning methods. Microarray experiment generates huge amount of data, and its processing via machine learning methods represents a big challenge. In this study, two-step classification paradigm which merges genetic algorithm feature selection and machine learning classifiers is utilized. Genetic algorithm is built in MapReduce programming spirit which makes this algorithm highly scalable for Hadoop cluster. In order to improve the performance of the proposed algorithm, it is extended into a parallel algorithm which process on microarray data in distributed manner using the Hadoop MapReduce framework. In this paper, the algorithm was tested on eleven GEMS data sets (9 tumors, 11 tumors, 14 tumors, brain tumor 1, lung cancer, brain tumor 2, leukemia 1, DLBCL, leukemia 2, SRBCT, and prostate tumor) and its accuracy reached 100% for less than 25 selected features. The proposed cloud computing-based MapReduce parallel genetic algorithm performed well on gene expression data. In addition, the scalability of the suggested algorithm is unlimited because of underlying Hadoop MapReduce platform. The presented results indicate that the proposed method can be effectively implemented for real-world microarray data in the cloud environment. In addition, the Hadoop MapReduce framework demonstrates substantial decrease in the computation time.

  相似文献   

9.
黄鑫  罗军 《集成技术》2013,2(2):69-82
数据的快速增长,为我们提供了更多的信息,然而,也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法,它是基于MapReduce的大规模数据分类模型的最小生成树(MST)的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型,旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中,训练集作为有权重的无向完全图来处理。顶点是对象,两点之间边的权重是对象间的距离。这一距离,不同于欧几里得距离,它是一个特定的距离度量。这样,可以找到图中最小生成树集,其中,图中每棵树代表一个类。为了降低时间复杂度,提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集,可以通过计算无标签对象和它们之间的距离,来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据,是因为Hadoop支持数据密集分布应用,并且这些应用可以和数以千计的节点和数据一起运作。另外,MapReduce 和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce 和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据,实验使用了4000个集群。实验表明,MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。  相似文献   

10.
迭代式计算是一类重要的大数据分析应用.在分布式计算框架MapReduce上实现迭代计算时,计算会被分解成多个作业并按作业依存关系顺序运行,这使得程序与分布式文件系统(DFS)有多次交互而影响程序执行时间.对这些交互相关数据的缓存会降低与DFS的交互时间,进而提升程序总体的性能.考虑到集群中的大量内存在多数情况下会处于空闲状态,提出了一种使用内存缓存的迭代式应用编程框架MemLoop.该系统从作业提交API、调度算法、缓存管理模块实现缓存管理以充分利用内存缓存迭代间可驻留数据与迭代内依存数据.我们将此框架与已有相关框架进行了比较,实验结果表明该框架能够提升迭代程序的性能.  相似文献   

11.
随着互联网的用户及内容呈指数级增长,大规模数据场景下的相似度计算对算法的效率提出了更高的要求。为提高算法的执行效率,对MapReduce架构下的算法执行缺陷进行了分析,结合Spark适于迭代型及交互型任务的特点,基于二维划分算法将算法从MapReduce平台移植到Spark平台;同时,通过参数调整、内存优化等方法进一步提高算法的执行效率。通过2组数据集分别在3组不同规模的集群上的实验表明,与MapReduce相比,在Spark平台下算法的执行效率平均提高了4.715倍,平均能耗效率只有Hadoop能耗的24.86%,能耗效率提升了4倍左右。  相似文献   

12.
An important property of today’s big data processing is that the same computation is often repeated on datasets evolving over time, such as web and social network data. While repeating full computation of the entire datasets is feasible with distributed computing frameworks such as Hadoop, it is obviously inefficient and wastes resources. In this paper, we present HadUP (Hadoop with Update Processing), a modified Hadoop architecture tailored to large-scale incremental processing with conventional MapReduce algorithms. Several approaches have been proposed to achieve a similar goal using task-level memoization. However, task-level memoization detects the change of datasets at a coarse-grained level, which often makes such approaches ineffective. Instead, HadUP detects and computes the change of datasets at a fine-grained level using a deduplication-based snapshot differential algorithm (D-SD) and update propagation. As a result, it provides high performance, especially in an environment where task-level memoization has no benefit. HadUP requires only a small amount of extra programming cost because it can reuse the code for the map and reduce functions of Hadoop. Therefore, the development of HadUP applications is quite easy.  相似文献   

13.
时空轨迹大数据分布式蜂群模式挖掘算法   总被引:1,自引:0,他引:1  
针对时空轨迹大数据的蜂群模式挖掘需求,提出了一种高效的基于MapReduce的分布式蜂群模式挖掘算法。首先,提出了基于最大移动目标集的对象集闭合蜂群模式概念,并利用最小时间支集优化了串行挖掘算法;其次,提出了蜂群模式的并行化挖掘模型,利用蜂群模式时间域无关性,并行化了聚类与子时间域上的蜂群模式挖掘过程;第三,设计了一个基于MapReduce链式架构的分布式并行挖掘算法,通过四个阶段快速地实现了蜂群模式的并行挖掘;最后,在Hadoop平台上,使用真实交通轨迹大数据集对分布式算法的有效性和高效性进行了验证与分析。  相似文献   

14.
针对Hadoop平台MapReduce分布式计算模型运行机制中的顺序制约而产生的计算资源浪费问题,从提高平台中每个执行节点的细粒度并行数据处理角度出发,结合Java共享内存多线程编程技术,对该模型进行了优化,提出一种MapReduce+OpenMP粗细粒度相结合的分布式并行计算模型。并在由四个节点组成的Hadoop集群环境下对不同规模大小的出租车GPS轨迹数据分析处理,验证该模型的性能和效率,实验结果证明MapReduce+OpenMP分布式并行计算模型确实能够提高针对大数据集的计算效率,是对Hadoop平台大数据分析处理模型有效的完善和优化。  相似文献   

15.
The HaLoop approach to large-scale iterative data analysis   总被引:1,自引:0,他引:1  
The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.  相似文献   

16.
It is difficult to express the parallelism present in complex computations by using existing higher level abstractions such as MapReduce and Dryad. These computations include applications from wide variety of domains, like Artificial Intelligence, Decision Tree Algorithms, Association Rule Mining, Recommender Systems, Graph Algorithms, Clustering Algorithms, Compute Intensive Scientific Workflows, Optimization Algorithms, and so forth. Their execution graphs introduce new challenges in terms of programmer expressibility and runtime performance such as iterative and recursive computations, shared communication model, and so forth. We propose an extension to MapReduce, called Generate‐Map‐Reduce (GMR), targeted towards modeling these applications. GMR introduces a new Generate abstraction into the MapReduce framework that captures recursive computations. The runtime also supports iterative jobs and a distributed communication model by using shared data structures. We illustrate recursive computations with GMR by modeling complex applications such as simulated annealing, A* search, and adaptive quadrature computation that require recursive spawning of new tasks to handle variable degree of parallelism. GMR runtime supports caching of common data across iterations in memory and local disks. We illustrate how this caching helps in achieving significant speedup for iterative computations by modeling k‐means clustering. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

17.
城市道路旅行时间计算一直是智能交通系统中研究的核心问题之一,准确高效的旅行时间计算可以有效地帮助道路管控,减少交通拥挤.然而面对巨大而且快速增长的城市道路交通检测数据,如何将分布式计算模式融合到传统的旅行时间计算问题中已成为一个亟待解决的问题.论文基于海量道路车牌识别数据,设计了基于MapReduce编程模型的城市道路旅行时间实测计算的算法.并利用Hadoop环境进行了实现,可以支持对自定义路段集下不同时间段道路旅行时间的计算.通过实验证明,相对于传统的旅行时间计算方式,在计算时间上基于MapReduce的旅行时间计算模式可以提高十倍以上.  相似文献   

18.
When users store data in big data platforms,the integrity of outsourced data is a major concern for data owners due to the lack of direct control over the data.However,the existing remote data auditing schemes for big data platforms are only applicable to static data.In order to verify the integrity of dynamic data in a Hadoop big data platform,we presents a dynamic auditing scheme meeting the special requirement of Hadoop.Concretely,a new data structure,namely Data Block Index Table,is designed to support dynamic data operations on HDFS(Hadoop distributed file system),including appending,inserting,deleting,and modifying.Then combined with the MapReduce framework,a dynamic auditing algorithm is designed to audit the data on HDFS concurrently.Analysis shows that the proposed scheme is secure enough to resist forge attack,replace attack and replay attack on big data platform.It is also efficient in both computation and communication.  相似文献   

19.
在异构Hadoop集群场景中, 为了缓和由于纠删码和副本存储模式混合使用, 以及服务器节点本身实时算力差异造成的MapReduce作业处理效率低下的问题, 本文实现了一种根据数据存储情况和节点实时负载来在多并发场景下动态调节MapReduce作业任务分配情况的调度策略. 该策略通过修改当前Hadoop框架中的数据存储选址策略并对节点任务并发量进行动态控制, 在多作业并发时实现更加均衡的作业间资源分配. 实验结果表明, 相较于Hadoop默认的两种作业调度策略, 本文提出的调度模式能够将作业完成时间缩短约17%, 并有效避免部分作业面临的饥饿现象.  相似文献   

20.
随着云计算技术的发展,许多MapReduce运行系统被开发出来,如Hadoop、Phoenix和Twister等.直观上,Hadoop具有很强的可扩展性、稳定性,适合处理大规模离线应用;Phoenix具有运行速度快等优点,适合处理数据密集型任务;Twister是轻量级的迭代系统,非常适合迭代式的应用.不同的应用在不同的MapReduce运行系统中有着不同的性能.通过测试不同应用在这些运行系统上的性能,给出了实验比较和性能分析,从而为大数据处理时选择合适的并行编程模型提供依据.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号