共查询到20条相似文献,搜索用时 15 毫秒
1.
MapReduce in MPI for Large-scale graph algorithms 总被引:1,自引:0,他引:1
Steven J. PlimptonKaren D. Devine 《Parallel Computing》2011,37(9):610-632
2.
基于MapReduce的分布式近邻传播聚类算法 总被引:2,自引:0,他引:2
随着信息技术迅速发展,数据规模急剧增长,大规模数据处理非常具有挑战性.许多并行算法已被提出,如基于MapReduce的分布式K平均聚类算法、分布式谱聚类算法等.近邻传播(affinity propagation,AP)聚类能克服K平均聚类算法的局限性,但是处理海量数据性能不高.为有效实现海量数据聚类,提出基于MapReduce的分布式近邻传播聚类算法——DisAP.该算法先将数据点随机划分为规模相近的子集,并行地用AP聚类算法稀疏化各子集,然后融合各子集稀疏化后的数据再次进行AP聚类,由此产生的聚类代表作为所有数据点的聚类中心.在人工合成数据、人脸图像数据、IRIS数据以及大规模数据集上的实验表明:DisAP算法对数据规模有很好的适应性,在保持AP聚类效果的同时可有效缩减聚类时间. 相似文献
3.
4.
MapReduce is a programming model proposed to simplify large-scale data processing. In contrast, the message passing interface (MPI) standard is extensively used for algorithmic parallelization, as it accommodates an efficient communication infrastructure. In the original implementation of MapReduce, the reduce function can only start processing following termination of the map function. If the map function is slow for any reason, this will affect the whole running time. In this paper, we propose MapReduce overlapping using MPI, which is an adapted structure of the MapReduce programming model for fast intensive data processing. Our implementation is based on running the map and the reduce functions concurrently in parallel by exchanging partial intermediate data between them in a pipeline fashion using MPI. At the same time, we maintain the usability and the simplicity of MapReduce. Experimental results based on three different applications (WordCount, Distributed Inverted Indexing and Distributed Approximate Similarity Search) show a good speedup compared to the earlier versions of MapReduce such as Hadoop and the available MPI-MapReduce implementations. 相似文献
5.
6.
基于MapReduce的Canopy-Kmeans改进算法 总被引:2,自引:0,他引:2
毛典辉 《计算机工程与应用》2012,48(27):22-26,68
针对分布式Canopy-Kmeans算法中Canopy选取的随机性问题,采用"最小最大原则"对该算法进行了改进,避免了Cannopy选取的盲目性;采用MapReduce并行计算框架对算法进行了并行扩展,使之能够充分利用集群的计算和存储能力,从而适应海量数据的应用场景。以海量互联网新闻信息聚类作为应用背景,对改进后的算法进行了实验分析。实验结果表明:该方法较随机挑选Canopy策略在分类准确率以及抗噪能力上都明显提高,而且在处理海量数据时表现出较大的性能优势。 相似文献
7.
k-modes是一种代表性的分类数据的聚类算法.首先对k-modes聚类算法的实现过程进行了改进:通过在分配数据对象到簇时更新这个簇中各个属性项的次数,使得在遍历一次全部数据对象就能计算出新的簇中心.为了使k-modes能够处理大规模分类数据,在Hadoop平台上用MapReduce并行计算模型实现了k-modes算法.实验表明:在处理大量数据时,并行k-modes比串行k-modes极大地缩短了聚类时间,取得了较好的加速比. 相似文献
8.
针对MapReduce模型中存在的多个Reduce任务之间完成时间差别较大的问题,分析了影响Reduce任务完成时间的因素,指出了MapReduce模型中Reduce任务节点存在数据倾斜问题,提出了一种改进型的MapReduce模型MBR(Map-Balance-Reduce)模型。通过添加Balance任务,对Map任务处理完成的中间数据进行均衡操作,使得分配到Reduce任务节点的数据比较均衡,从而确保Reduce任务的完成时间基本一致。仿真实验结果表明,经过Balance任务后,Map任务产生的中间数据能够比较均衡的分配给Reduce任务节点,达到数据计算均衡的目的,在一定程度上减少了整个作业的执行时间。 相似文献
9.
针对RapidlO网络的特点,分析MPICH2的层次设计以及建立在TCP,SCTP网络通信协议上的MPI通信方法,通过重新定义ADI3下的CH3层,设计并实现了一种基于RapidIO的MPI设备层,建立了从MPI到RapidlO的通信通道并实现了多流通信的思想.通过在装有RapidIO网卡机器上的实验表明,在带宽和延迟通信性能上,这种专用的MPI设备层要比以太网模拟器表现出色,而且对于大数据量的通信,性能表现更好. 相似文献
10.
为了有效地监控集群系统,基于消息传递接口(Message Passing Interface,MPI)并行库构建一个简单易行的并行任务模型.详细介绍该任务模型中的集群监控、节点负载均衡评估模型结构以及Linux集群数据采集.实验表明该模型配置简单、资源开销低,且对集群系统的干扰小. 相似文献
11.
根据集群的特点和聚类的特性,从理论上探讨了聚类并行化的可行性,并在此基础上用实验进行了验证,结果表明通过这些改进能够获得比较理想的性能。 相似文献
12.
大规模数据常因其分布式存储特性导致寻找其相似度最大的前k对数据比较困难.针对上述问题,提出一种基于MapReduce的最相似k对数据查询方案.该方案首先将所有数据对分割成多个组,然后提出所有数据对分组算法和核心数据对分组算法,通过单独计算每个组中的最近似k对数据,再从所有组的最近似k对数据中选择相似度最高的k对数据,进而正确地确定最近似k对数据.最后基于合成数据和真实数据进行实验,通过改变最近似数据对数k和机器数目s验证算法性能.实验结果表明增加机器数目s能够提升算法的运行效率和可扩展性,而k参数的变化对基于MapReduce的算法影响不大. 相似文献
13.
Cappelli R Ferrara M Maltoni D 《IEEE transactions on pattern analysis and machine intelligence》2011,33(5):1051-1057
This paper proposes a new hash-based indexing method to speed up fingerprint identification in large databases. A Locality-Sensitive Hashing (LSH) scheme has been designed relying on Minutiae Cylinder-Code (MCC), which proved to be very effective in mapping a minutiae-based representation (position/ angle only) into a set of fixed-length transformation-invariant binary vectors. A novel search algorithm has been designed thanks to the derivation of a numerical approximation for the similarity between MCC vectors. Extensive experimentations have been carried out to compare the proposed approach against 15 existing methods over all the benchmarks typically used for fingerprint indexing. In spite of the smaller set of features used (top performing methods usually combine more features), the new approach outperforms existing ones in almost all of the cases. 相似文献
14.
针对原始k均值法在MapReduce建模中执行时间较长和聚类结果欠佳问题,提出一种基于MapReduce的分治k均值聚类方法。采取分治法处理大数据集,将所要处理的整个数据集拆分为较小的块并存储在每台机器的主存储器中;通过可用的机器传播,将数据集的每个块由其分配的机器独立地进行聚类;采用最小加权距离确定数据点应该被分配的类簇,判断收敛性。实验结果表明,与传统k均值聚类方法和流式k均值聚类方法相比,所提方法用时更短,结果更优。 相似文献
15.
随着用户数量与数据体量的飞速增长,传统基于相似性矩阵构造的协同过滤算法求解效率低下.针对这一问题,提出一种基于MapReduce框架下的并行相似矩阵构造算法.依据基于改进的局部敏感哈希(locality sensitive Hashing,LSH)算法将项目集合划分为不相交的组,基于MapReduce框架进行组内部相似... 相似文献
16.
17.
To effectively utilize information stored in a digital image library, effective image indexing and retrieval techniques are essential. This paper proposes an image indexing and retrieval technique based on the compressed image data using vector quantization (VQ). By harnessing the characteristics of VQ, the proposed technique is able to capture the spatial relationships of pixels when indexing the image. Experimental results illustrate the robustness of the proposed technique and also show that its retrieval performance is higher compared with existing color-based techniques. 相似文献
18.
Image indexing and retrieval based on color histograms 总被引:4,自引:0,他引:4
While general object recognition is difficult, it is relatively easy to capture various primitive properties such as color distributions, prominent regions and their topological features from an image and use them to narrow down the search space when attempts to retrieving images by contents from an image database are made.In this paper, we present an image database in which images are indexed and retrieved based on color histograms. We first address the problems inherent in color histograms created by the conventional method, and then propose a new method to create histograms which are compact in size and insensitive to minor illumination variations such as highlight, shape, and etc. A powerful indexing scheme where each histogram of an image is encoded into a numerical key, and stored in a two-layered tree structure is introduced. This approach turns the problem of histogram matching, which is computation intensive, into index key search, so as to realize quick data access in a large scale image database. Two types of user interfaces, Query by user provided sample images, and Query by combination of the system provided templates, are provided to meet various user requests. Various experimental evaluations exhibit the effectiveness of the image database system. 相似文献
19.
20.
Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes for future map tasks based on current pending tasks and then preload the needed data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters. 相似文献