首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Approximate Distributed K-Means Clustering over a Peer-to-Peer Network   总被引:4,自引:0,他引:4  
Data intensive Peer-to-Peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clusterings by “local” synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at the modest communication cost.  相似文献   

2.
Distributed data mining applies techniques to mine distributed data sources by avoiding the need to first collect the data into a central site. This has a significant appeal when issues of communication cost and privacy put a restriction on traditional centralized methods. Although there has been development on many fronts in distributed data mining, we are still lacking models that abstract the process by showing similarities and contrasts between the different methods. In this paper, we introduce two abstract models for distributed clustering in peer-to-peer environments with different goals. The first is the Locally optimized Distributed Clustering (LDC) model, which aims toward achieving better local clusters at each node, and is facilitated by collaboration through sharing of summarized cluster information. The second is the Globally optimized Distributed Clustering (GDC) model, which aims toward achieving one global clustering solution that is an approximation of centralized clustering. We also report on concrete realizations of the two models that show their benefits, through application in text mining. The LDC model is realized through the Collaborative P2P Clustering algorithm, while the GDC model is realized through the Hierarchically distributed P2P Clustering algorithm. In the former, we show that peer collaboration results in significant increase in local clustering quality. The process utilizes cluster summarization to exchange information between peers. In the latter, we target scalability by structuring the P2P network hierarchically and devise a distributed variant of the k-means algorithm to compute one set of clusters across the hierarchy. We demonstrate through experimental results the effectiveness of both methods and make recommendation on when to use each method.  相似文献   

3.
Super-Peer网络中基于语义的分簇算法研究   总被引:1,自引:0,他引:1  
在P2P网络中,利用共享数据的语义信息,将网络划分成不同的语义簇是提高网络查找性能、增强网络可扩展性的有效手段.然而现有的基于分类层次的语义分簇方法较少考虑簇之间的负载平衡问题,这必然会影响网络的性能.为此本文提出了两种针对分类层次语义空间的自组织语义分簇算法,即语义优先分簇算法SFCA和负载平衡优先分簇算法LBFCA,这两种算法能够根据网络的负载动态的将网络划分成不同的语义簇,并很好的保持了簇中数据的语义关系和簇之间的负载平衡.实验表明这两种分簇算法具有良好的性能和可扩展性.  相似文献   

4.
P2P(peer-to-peer)网络分布式聚类算法是利用P2P网络上各个节点的计算、存储能力以及网络的带宽,将算法的时间复杂度和空间复杂度平摊到各个节点,使处理和分析海量分布式数据成为可能,从而克服传统基于单个服务器的集中式聚类算法在数据处理能力等方面的限制。提出一种基于节点置信半径的分布式K-means聚类算法,该算法通过计算节点上数据分布的密度,找到同一类数据在节点的稠密和稀疏分布,从而确定聚类置信半径并指导下一步的聚类。实验表明,该算法能够有效地减少迭代次数,节省网络带宽;同时聚类结果也接近集中式聚类算法的结果。  相似文献   

5.
Recently, peer-to-peer (P2P) search technique has become popular in the Web as an alternative to centralized search due to its high scalability and low deployment-cost. However, P2P search systems are known to suffer from the problem of peer dynamics, such as frequent node join/leave and document changes, which cause serious performance degradation. This paper presents the architecture of a P2P search system that supports full-text search in an overlay network with peer dynamics. This architecture, namely HAPS, consists of two layers of peers. The upper layer is a DHT (distributed hash table) network interconnected by some super peers (which we refer to as hubs). Each hub maintains distributed data structures called search directories, which could be used to guide the query and to control the search cost. The bottom layer consists of clusters of ordinary peers (called providers), which can receive queries and return relevant results. Extensive experimental results indicate that HAPS can perform searches effectively and efficiently. In addition, the performance comparison illustrates that HAPS outperforms a flat structured system and a hierarchical unstructured system in the environment with peer dynamics.  相似文献   

6.
依据信息论的思想,对基于层次的K-均值聚类算法(HKMA)过程进行了分析,该算法首先采用层次方法对文档进行初始聚类,得到的聚类总数作为k均值算法中的k值,在此基础上,通过k均值聚类对聚类结果进行修正。实验结果表明,HKMA执行时间整体上优于k-means算法,而且随着数据量的增大执行时间的增长幅度也较小。  相似文献   

7.
为了解决K-means算法在聚类数量增多的情况下,因选择了不合适的中心初值而影响到聚类效果这一问题,提出了一种局部迭代的快速K-means聚类算法(PIFKM+?)。该算法在K-means聚类的基础上,不断寻找能够被分割的聚类簇和能够被删除的聚类簇,并对受影响的局部数据进行重新聚类处理,降低了整个聚类更新的时间复杂度,提高了聚类的效果。PIFKM+?算法在面对聚类数量众多的情况下,具有能够快速更新聚类、对聚类中心初值不敏感、能够提高聚类精确度等优势。通过与K-means和K-means++两种算法的比较,在仿真数据集和真实数据集的综合实验下,验证了该算法的精确性、高效率性和可扩展性,同时实验结果的统计分析表明该算法在提高了聚类精确度的同时并没有损失太多的时间效率。  相似文献   

8.
基于MapReduce的分布式近邻传播聚类算法   总被引:2,自引:0,他引:2  
随着信息技术迅速发展,数据规模急剧增长,大规模数据处理非常具有挑战性.许多并行算法已被提出,如基于MapReduce的分布式K平均聚类算法、分布式谱聚类算法等.近邻传播(affinity propagation,AP)聚类能克服K平均聚类算法的局限性,但是处理海量数据性能不高.为有效实现海量数据聚类,提出基于MapReduce的分布式近邻传播聚类算法——DisAP.该算法先将数据点随机划分为规模相近的子集,并行地用AP聚类算法稀疏化各子集,然后融合各子集稀疏化后的数据再次进行AP聚类,由此产生的聚类代表作为所有数据点的聚类中心.在人工合成数据、人脸图像数据、IRIS数据以及大规模数据集上的实验表明:DisAP算法对数据规模有很好的适应性,在保持AP聚类效果的同时可有效缩减聚类时间.  相似文献   

9.
通过引入上、下近似的思想,粗糙K-means已成为一种处理聚类边界模糊问题的有效算法,粗糙模糊K-means、模糊粗糙K-means等作为粗糙K-means的衍生算法,进一步对聚类边界对象的不确定性进行了细化描述,改善了聚类的效果。然而,这些算法在中心均值迭代计算时没有充分考虑各簇的数据对象与均值中心的距离、邻近范围的数据分布疏密程度等因素对聚类精度的影响。针对这一问题提出了一种局部密度自适应度量的方法来描述簇内数据对象的空间特征,给出了一种基于局部密度自适应度量的粗糙K-means聚类算法,并通过实例计算分析验证了算法的有效性。  相似文献   

10.
传统的分簇方法很少同时考虑安全因素及其对网络性能的影响。针对此问题,提出了一种基于信任关系的分簇方法,该分簇方法结合人类记忆的扩散激发模型思想,能够根据有限的局部信息自动地对整个网络进行分割。试验结果表明,该文所提出的分簇方法在精确度方面与集中式的分簇方法非常接近。因此,在提高Ad Hoc网络性能的同时,还可提高其安全性。  相似文献   

11.
针对集中式系统框架难以进行海量数据聚类分析的问题,提出基于MapReduce的K-means聚类优化算法。该算法运用MapReduce并行编程框架,引入Canopy聚类,优化K-means算法初始中心的选取,改进迭代过程中通信和计算模式。实验结果表明该算法能够有效地改善聚类质量,具有较高的执行效率以及优良的扩展性,适合用于海量数据的聚类分析。  相似文献   

12.
Recently, peer-to-peer (P2P) search technique has become popular in the Web as an alternative to centralized search due to its high scalability and low deployment-cost. However, P2P search systems are known to suffer from the problem of peer dynamics, such as frequent node join/leave and document changes, which cause serious performance degradation. This paper presents the architecture of a P2P search system that supports full-text search in an overlay network with peer dynamics. This architecture, namely H...  相似文献   

13.
朱强  孙玉强 《计算机应用》2014,34(9):2505-2509
传感器节点的资源是有限的,高的通信开销会消耗大量的电量。为了减小分布式流数据分类算法的通信开销,提出一种高效的分布式流数据聚类算法。该算法包含在线局部聚类和离线全局协同聚类两个阶段。在线局部聚类算法将每个流数据源进行局部聚类,并将聚类后的结果通过序列化技术发往协同节点;协同节点得到来自不同流数据源的局部聚类信息后进行全局聚类。从实验中可以看出,当不断增加窗口的大小时,算法用于数据发送的时间恒定不变,算法的聚类时间和总的时间呈线性增长,即所提出算法的执行时间不受滑动窗口宽度和聚类个数的影响;同时该算法与集中式算法的准确性接近,并且通信开销远远小于相关的分布式算法。实验结果表明,该算法具有很好的可扩展性,可应用于对大规模分布式流数据源进行聚类分析。  相似文献   

14.
在推荐系统中应用K-means算法聚类可有效降维,然而聚类效果往往依赖于选定的初始中心,并且一旦选定目标簇后,推荐过程只针对目标簇进行,与其他簇无关。针对上述两个问题,提出一种基于满二叉树的二分K-means聚类并行推荐算法。该算法首先反复迭代二分K-means算法,迭代过程中使用簇内凝聚度作为分裂阈值,形成一颗满二叉树;然后通过层次遍历将用户归入到K个叶子节点(簇);最后针对K个簇,应用MapReduce框架进行并行推荐预测。MovieLens上的实验结果表明,该算法可大幅度提高推荐系统准确性,同时增强系统可扩展性。  相似文献   

15.
With the evolution of large number of social networking sites in which various users share the information at various levels in Peer-to-Peer (P2P) manner, there is a need of efficient P2P collaborative mechanisms to achieve efficiency and accuracy at each level. To achieve high level of accuracy and scalability, a distributed collaborative filtering (CF) approach for P2P service selection and recovery is proposed in this paper. The proposed approach is different from the traditional centralized approaches as both user and network views are modelled and an estimation of the service recovery time is included if some of the services are failed during execution. A novel Context Aware P2P Service Selection and Recovery (CAPSSR) algorithm is proposed. To filter the relevant contents for user needs, a new Distributed Filtering Metric (DFM) is included in the algorithm which selects the contents based upon the user input. The performance of the proposed algorithm is evaluated with traditional centralized algorithm with respect to scalability and accuracy. The results obtained show that the proposed approach is better than the existing approaches in terms of accuracy and scalability.  相似文献   

16.
基于P2P网络的协同过滤推荐算法的研究与实现   总被引:1,自引:0,他引:1  
协同过滤算法是当前电子商务推荐系统最有效的信息过滤技术之一。而传统协同过滤算法的最大弱点是可扩展性问题,随着用户数量以及商品项目的增加,计算复杂度的快速增长导致大规模电子商务系统的可扩展性问题.本文提出了一种基于P2P网络协同过滤推荐算法方法,采用对等计算的方法进行用户数据库的管理和评分预测工作,该系统充分利用P2P网络对等计算的优点,采用了多生成树的路由算法。实验数据表明了我们采用的基于P2P网络的分布式协同过滤方法较传统集中式算法有更好的可扩展性和预测准确性.  相似文献   

17.
基于网络性能的计算网格主机聚类   总被引:7,自引:0,他引:7  
网络主机聚类是随着网格任务调度技术发展而产生的一个新技术,基于网络性能的主机聚类算法的时间效率和结果准确性有待于进一步提高.为解决这一问题,提出了实用且高效的基于密度的计算网格主机聚类启发式算法.对该算法性能进行多角度分析和大规模仿真实验,有力地证明了该算法不仅具有较优的时间效率,而且在有效结果簇、平均变化系数和平均优势比等方面具有较好的综合性能.  相似文献   

18.
提出对LEACH协议路由算法成簇机制的改进策略,在簇建立阶段,采用分布式成簇算法与集中式成簇算法相结合的成簇方式和最优簇数可根据网络状况动态变化的簇类个数选择机制,以使网络在较低的能耗水平下获得更为合理的簇类划分。  相似文献   

19.
Ad hoc networks consist of wireless hosts that communicate with each other in the absence of a fixed infrastructure. Such networks cannot rely on centralized and organized network management. The clustering problem consists of partitioning network nodes into non-overlapping groups called clusters. Clusters give a hierarchical organization to the network that facilitates network management and that increases its scalability.In a weight-based clustering algorithm, the clusterheads are selected according to their weight (a node’s parameter). The higher the weight of a node, the more suitable this node is for the role of clusterhead. In ad hoc networks, the amount of bandwidth, memory space or battery power of a node could be used to determine weight values.A self-stabilizing algorithm, regardless of the initial system configuration, converges to legitimate configurations without external intervention. Due to this property, self-stabilizing algorithms tolerate transient faults and they are adaptive to any topology change.In this paper, we present a robust self-stabilizing weight-based clustering algorithm for ad hoc networks. The robustness property guarantees that, starting from an arbitrary configuration, after one asynchronous round, the network is partitioned into clusters. After that, the network stays partitioned during the convergence phase toward a legitimate configuration where the clusters verify the “ad hoc clustering properties”.  相似文献   

20.
基于复杂网络社团划分的网络流量分类   总被引:1,自引:0,他引:1  
随着网络的高速发展以及各种应用的不断涌现,采用端口号映射或有效负载分析的方法进行流量分类与应用识别已难以满足应用的需求。以流为网络节点、流之间统计特征的相似度为边,构建流相关网络模型,利用Newman快速社团划分算法(NFCD)对流相关网络模型进行社团划分,得到了流的聚类结果,实现了网络流量的分类,并与先前的两种无监督的流量分类算法(K-Means,DBSCAN)进行了对比。实验结果显示,利用NFCD算法具有更高的准确率,并能产生更好的聚类效果,且不受输入参数影响。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号