首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MapReduce in MPI for Large-scale graph algorithms   总被引:1,自引:0,他引:1  
  相似文献   

2.
基于MapReduce的分布式近邻传播聚类算法   总被引:2,自引:0,他引:2  
随着信息技术迅速发展,数据规模急剧增长,大规模数据处理非常具有挑战性.许多并行算法已被提出,如基于MapReduce的分布式K平均聚类算法、分布式谱聚类算法等.近邻传播(affinity propagation,AP)聚类能克服K平均聚类算法的局限性,但是处理海量数据性能不高.为有效实现海量数据聚类,提出基于MapReduce的分布式近邻传播聚类算法——DisAP.该算法先将数据点随机划分为规模相近的子集,并行地用AP聚类算法稀疏化各子集,然后融合各子集稀疏化后的数据再次进行AP聚类,由此产生的聚类代表作为所有数据点的聚类中心.在人工合成数据、人脸图像数据、IRIS数据以及大规模数据集上的实验表明:DisAP算法对数据规模有很好的适应性,在保持AP聚类效果的同时可有效缩减聚类时间.  相似文献   

3.
为简化江门中微子实验的离线数据处理流程,减少资源消耗,提出一种在分布式计算环境中进行数据处理的通用软件系统。基于信息传递接口实现节点间的通信与数据交换,使用Master/Worker架构对计算作业生命周期进行管理,包括计算作业拆分、计算资源分配以及计算任务执行与监控。测试结果表明,该系统具有良好的可扩展性,其产生的数据与人工逐步执行作业脚本运行模拟软件产生的数据一致。  相似文献   

4.
MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations.  相似文献   

5.
时空复杂度较高以及物理机器内存不足,会导致传统聚类算法不能有效地分析处理大规模数据网络.针对该问题,在MapReduce分布式模型的基础上,提出一种网络数据分布式聚类算法.根据MRC理论设计有限MapReduce轮数,控制混洗过程所需时间,利用Map内合并技术对网络流量进行控制,在进行中间结果合并时仅对社团合并,而不考虑社团内部节点,以控制内存开销.使用模拟生成的数据在集群中进行实验,结果表明,当数据规模和集群规模增大时,该算法具有较好的加速比和扩展性.  相似文献   

6.
基于MapReduce的Canopy-Kmeans改进算法   总被引:2,自引:0,他引:2  
针对分布式Canopy-Kmeans算法中Canopy选取的随机性问题,采用"最小最大原则"对该算法进行了改进,避免了Cannopy选取的盲目性;采用MapReduce并行计算框架对算法进行了并行扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景。以海量互联网新闻信息聚类作为应用背景,对改进后的算法进行了实验分析。实验结果表明:该方法较随机挑选Canopy策略在分类准确率以及抗噪能力上都明显提高,而且在处理海量数据时表现出较大的性能优势。  相似文献   

7.
k-modes是一种代表性的分类数据的聚类算法.首先对k-modes聚类算法的实现过程进行了改进:通过在分配数据对象到簇时更新这个簇中各个属性项的次数,使得在遍历一次全部数据对象就能计算出新的簇中心.为了使k-modes能够处理大规模分类数据,在Hadoop平台上用MapReduce并行计算模型实现了k-modes算法.实验表明:在处理大量数据时,并行k-modes比串行k-modes极大地缩短了聚类时间,取得了较好的加速比.  相似文献   

8.
针对MapReduce模型中存在的多个Reduce任务之间完成时间差别较大的问题,分析了影响Reduce任务完成时间的因素,指出了MapReduce模型中Reduce任务节点存在数据倾斜问题,提出了一种改进型的MapReduce模型MBR(Map-Balance-Reduce)模型。通过添加Balance任务,对Map任务处理完成的中间数据进行均衡操作,使得分配到Reduce任务节点的数据比较均衡,从而确保Reduce任务的完成时间基本一致。仿真实验结果表明,经过Balance任务后,Map任务产生的中间数据能够比较均衡的分配给Reduce任务节点,达到数据计算均衡的目的,在一定程度上减少了整个作业的执行时间。  相似文献   

9.
针对RapidlO网络的特点,分析MPICH2的层次设计以及建立在TCP,SCTP网络通信协议上的MPI通信方法,通过重新定义ADI3下的CH3层,设计并实现了一种基于RapidIO的MPI设备层,建立了从MPI到RapidlO的通信通道并实现了多流通信的思想.通过在装有RapidIO网卡机器上的实验表明,在带宽和延迟通信性能上,这种专用的MPI设备层要比以太网模拟器表现出色,而且对于大数据量的通信,性能表现更好.  相似文献   

10.
为了有效地监控集群系统,基于消息传递接口(Message Passing Interface,MPI)并行库构建一个简单易行的并行任务模型.详细介绍该任务模型中的集群监控、节点负载均衡评估模型结构以及Linux集群数据采集.实验表明该模型配置简单、资源开销低,且对集群系统的干扰小.  相似文献   

11.
根据集群的特点和聚类的特性,从理论上探讨了聚类并行化的可行性,并在此基础上用实验进行了验证,结果表明通过这些改进能够获得比较理想的性能。  相似文献   

12.
大规模数据常因其分布式存储特性导致寻找其相似度最大的前k对数据比较困难.针对上述问题,提出一种基于MapReduce的最相似k对数据查询方案.该方案首先将所有数据对分割成多个组,然后提出所有数据对分组算法和核心数据对分组算法,通过单独计算每个组中的最近似k对数据,再从所有组的最近似k对数据中选择相似度最高的k对数据,进而正确地确定最近似k对数据.最后基于合成数据和真实数据进行实验,通过改变最近似数据对数k和机器数目s验证算法性能.实验结果表明增加机器数目s能够提升算法的运行效率和可扩展性,而k参数的变化对基于MapReduce的算法影响不大.  相似文献   

13.
This paper proposes a new hash-based indexing method to speed up fingerprint identification in large databases. A Locality-Sensitive Hashing (LSH) scheme has been designed relying on Minutiae Cylinder-Code (MCC), which proved to be very effective in mapping a minutiae-based representation (position/ angle only) into a set of fixed-length transformation-invariant binary vectors. A novel search algorithm has been designed thanks to the derivation of a numerical approximation for the similarity between MCC vectors. Extensive experimentations have been carried out to compare the proposed approach against 15 existing methods over all the benchmarks typically used for fingerprint indexing. In spite of the smaller set of features used (top performing methods usually combine more features), the new approach outperforms existing ones in almost all of the cases.  相似文献   

14.
针对原始k均值法在MapReduce建模中执行时间较长和聚类结果欠佳问题,提出一种基于MapReduce的分治k均值聚类方法。采取分治法处理大数据集,将所要处理的整个数据集拆分为较小的块并存储在每台机器的主存储器中;通过可用的机器传播,将数据集的每个块由其分配的机器独立地进行聚类;采用最小加权距离确定数据点应该被分配的类簇,判断收敛性。实验结果表明,与传统k均值聚类方法和流式k均值聚类方法相比,所提方法用时更短,结果更优。  相似文献   

15.
随着用户数量与数据体量的飞速增长,传统基于相似性矩阵构造的协同过滤算法求解效率低下.针对这一问题,提出一种基于MapReduce框架下的并行相似矩阵构造算法.依据基于改进的局部敏感哈希(locality sensitive Hashing,LSH)算法将项目集合划分为不相交的组,基于MapReduce框架进行组内部相似...  相似文献   

16.
随着大数据时代的到来,数据量和数据复杂度急剧提高,Skyline查询结果集规模巨大,无法为用户提供精确的信息.MapReduce作为并行计算框架,已广泛应用于大数据处理中.本文提出了MapReduce框架下基于支配个数的结果优化算法(MR-DMN),解决了大数据环境下的Skyline结果集优化问题.大量的实验表明:算法具有良好的时间和空间效率.  相似文献   

17.
To effectively utilize information stored in a digital image library, effective image indexing and retrieval techniques are essential. This paper proposes an image indexing and retrieval technique based on the compressed image data using vector quantization (VQ). By harnessing the characteristics of VQ, the proposed technique is able to capture the spatial relationships of pixels when indexing the image. Experimental results illustrate the robustness of the proposed technique and also show that its retrieval performance is higher compared with existing color-based techniques.  相似文献   

18.
Image indexing and retrieval based on color histograms   总被引:4,自引:0,他引:4  
While general object recognition is difficult, it is relatively easy to capture various primitive properties such as color distributions, prominent regions and their topological features from an image and use them to narrow down the search space when attempts to retrieving images by contents from an image database are made.In this paper, we present an image database in which images are indexed and retrieved based on color histograms. We first address the problems inherent in color histograms created by the conventional method, and then propose a new method to create histograms which are compact in size and insensitive to minor illumination variations such as highlight, shape, and etc. A powerful indexing scheme where each histogram of an image is encoded into a numerical key, and stored in a two-layered tree structure is introduced. This approach turns the problem of histogram matching, which is computation intensive, into index key search, so as to realize quick data access in a large scale image database. Two types of user interfaces, Query by user provided sample images, and Query by combination of the system provided templates, are provided to meet various user requests. Various experimental evaluations exhibit the effectiveness of the image database system.  相似文献   

19.
目前常用的个性化推荐系统模型通常是基于协同过滤或者是基于内容的,也有部分基于关联规则的。这些算法没有考虑事务间的顺序,然而在很多应用中这样的顺序很重要。文章提出了一种简易的基于序列模式的推荐模型,并且考虑到大规模数据的处理,结合了MapReduce编程模型。这种简易的推荐模型可以用来辅助通常的个性化推荐系统。  相似文献   

20.
Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes for future map tasks based on current pending tasks and then preload the needed data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号