首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 140 毫秒
1.
近年来,图数据聚类在学术界引起了广泛的关注,许多优秀的聚类方法,如模块度优化算法、谱聚类,以及基于密度的聚类算法在图数据上取得了很好的效果。SCAN是一种著名的基于密度的图聚类算法,该算法不仅能够找出图中的聚类,而且还能够发现不同聚类间的Hub节点,以及图中的离群点。然而,该算法存在两方面的局限性:首先,在大规模图数据上,该算法需要耗费大量的时间用于计算图中每条边的结构相似性;另一方面,该算法存在两个参数[ε]和[μ],并且对这两个参数比较敏感。为了解决其局限性,提出了一种基于OpenMP的并行算法来求解节点相似性,并且提出了两种有效的负载均衡策略;其次,提出一种基于三角形的新型图结构聚类算法TSCAN。该模型能够有效降低算法对参数的敏感性,而且还能够发现重叠以及更稠密的社区。在多个大规模数据集上实验发现,基于多核的并行算法能够达到近乎线性的加速比,而且TSCAN算法对参数不敏感,能有效发现重叠社区。  相似文献   

2.
社交关系的数据挖掘一直是大图数据研究领域中的热门问题。图聚类算法如SCAN(Structural clustering algorithm for networks)虽可迅速地从海量图数据中获得关系紧密的社区结构,但这类社区往往只表示了社交对象的聚集,无法反馈对象间的真实社交关系,如家庭成员、同事、同学等。要获取对象间真实的社交关系,需要更多维度地挖掘现实中社交对象间复杂的交互关系。对象间的交互维度很多,例如:通话、见面、微信、Email等,而传统SCAN等聚类算法仅能够挖掘单维度的交互数据。本文在研究社交对象间的多维社交关系图数据与传统图结构聚类算法的基础上,提出了一种有效的子空间聚类算法SCA(Subspace Cluster Algorithm),首次对多维度下子空间的图结构聚类进行研究,目的是探索如何通过图数据挖掘发现对象间真实的社交关系。SCA算法遵循自底向上的原则,能够发现社交图数据中所有子空间的聚类集。为了提升SCA的运行速度,我们利用其子空间聚类单调性进行了性能优化,进而提出了剪枝算法SCA+。最后,我们进行了大规模的性能测试实验,以及真实数据的案例研究,其结果验证了算法的效率和效用。  相似文献   

3.
杜鹃  张卓  曹建春 《计算机应用与软件》2021,38(11):288-294,313
提出一种基于快速无偏分层图抽样的MapReduce负载平衡方法.将聚类算法融合到MapReduce连接操作中,提出MapReduce并行聚类连接算法的实现方法;根据聚类结果动态调整抽样率的无偏分层图抽样算法,从而实现连接操作目标数据的准确、平衡抽样.通过合成数据集和真实数据集下的数据处理实验,与Hash连接算法及基于NS抽样的聚类算法进行对比,验证了所提出的算法方案在不同数据倾斜程度下都具有良好的负载平衡性能,其运行效率也没有因为新采样算法的采用而受到影响.  相似文献   

4.
针对以大数据为中心的信息开放共享平台,如何从嵌入大规模噪声结构的网络中解码出网络的真实结构,进一步在挖掘关联信息的过程中得到较为准确的挖掘结果的问题,提出基于结构熵的聚类方法实现对图中节点关联程度的划分。提出了计算二维结构信息的求解算法和基于熵减原则的模块划分算法,对图结构中节点划分得到对应的模块;利用 K 维结构信息算法对已划分的模块做进一步的划分,实现对图结构中节点的聚类;通过实例分析表明,所提出的图聚类方法不仅能够反映图结构的真实结构,而且可以有效地挖掘出图结构中节点之间的关联程度。同时对比了其他3种聚类方法,实验表明该方法在执行时间上具有更高的效率和保证聚类结果的可靠性。  相似文献   

5.
在众多聚类算法中,谱聚类作为一种代表性的图聚类算法,由于其对复杂数据分布的适应性强、聚类效果好等优点而受到人们的广泛关注.然而,由于其高计算时间复杂度难以应用于处理大规模数据.为提高谱聚类算法在大规模数据集上的可用性,提出关键节点选择的快速图聚类算法.该算法包含三个重要步骤:第一,提出一种充分考虑抱团性和分离性的快速节点重要性评价方法;第二,选择关键节点代替原数据集构建二分图,通过奇异值分解获得数据的近似特征向量;第三,集成多次的近似特征向量,提高近似谱聚类结果的鲁棒性.该算法将时间复杂度由谱聚类原有的O(n3)降低到O(t(n+2n2)),增强了其在大规模数据集上的可用性.通过该算法与其他七个具有代表性的谱聚类算法在五个Benchmark数据集上进行的实验分析,比较结果展示了该算法相比其他算法能够更加高效地识别数据中的复杂类结构.  相似文献   

6.
一种高效的属性图聚类方法   总被引:1,自引:0,他引:1  
吴烨  钟志农  熊伟  陈荦  景宁 《计算机学报》2013,36(8):1704-1713
图是描述现实世界各类复杂系统的一种普适模型,且许多实际应用中的图是大规模的.图的聚类是理解、分析和可视化大规模图的关键技术之一.现实世界的图往往包含丰富的属性信息,如何综合结构和属性信息进行属性图的聚类是一个新的挑战.大多数的现有方法或者将结构和属性转化为距离,基于传统方法进行聚类;或者只考虑某一方面聚类.文中结合信息论中最小长度原则,基于遗传算法,提出一种高效的属性图聚类方法GA-AGC.通过对属性图聚类问题建模,转化为最小描述长度原则问题;扩展标签传播方法作为遗传算法初始化方法,结合编码减小的局部变异方法,提出一种解决属性图聚类的遗传算法.文中方法无需设定聚类的数目,算法复杂度近似线性于结点和边的数目.真实数据集上的实验验证了算法的有效性和高效性.  相似文献   

7.
针对传统谱聚类算法难以应用于大规模高光谱图像,以及现有的改进谱聚类算法对大规模高光谱图像的处理效果不佳的问题,为降低聚类数据的复杂度,以降低聚类过程的计算成本从而多方面提升聚类性能,提出一种基于超像素锚图二重降维的高光谱聚类算法。首先,对高光谱数据进行主成分分析(PCA)处理,并针对高光谱图像的区域特性对其进行基于超像素切割的降维;其次,通过构造锚图的思想对上一步所得数据进行锚点的选取,并构建邻接锚图来实现二重降维,从而进行谱聚类;同时,为去除算法运行中人为调节参数的环节,在构建锚图时采用一种去除高斯核的无核锚图构造方式以实现自动构图。在Indian Pines数据集和Salinas数据集上的实验结果表明所提算法在保证可用性与低耗时的前提下可提高聚类的整体效果,从而验证了所提算法能提高聚类的质量与性能。  相似文献   

8.
余晓山  吴扬扬 《计算机应用》2014,34(6):1595-1599
针对传统的层次聚类算法在处理大规模文本时可扩展性不足的问题,提出基于MapReduce编程模型的并行化文本层次聚类算法。将基于文本向量分量组特征统计的垂直数据划分算法应用于MapReduce的数据分发,将MapReduce的排序特性应用于合并点的选择,使得算法更加高效,同时有利于提高聚类精度。实验结果表明了利用该算法进行大规模文本聚类的有效性及良好的可扩展性。  相似文献   

9.
黄鑫  罗军 《集成技术》2013,2(2):69-82
数据的快速增长,为我们提供了更多的信息,然而,也对传统信息获取技术提出了挑战。这篇论文提出了MCMM算法,它是基于MapReduce的大规模数据分类模型的最小生成树(MST)的算法。它可以看做是介于传统的KNN方法和基于聚类分类方法之间的模型,旨在克服这两种方法的不足并能处理大规模的数据。在这一模型中,训练集作为有权重的无向完全图来处理。顶点是对象,两点之间边的权重是对象间的距离。这一距离,不同于欧几里得距离,它是一个特定的距离度量。这样,可以找到图中最小生成树集,其中,图中每棵树代表一个类。为了降低时间复杂度,提取了每棵树中最具代表性的点来代表该树。这些压缩了的点集,可以通过计算无标签对象和它们之间的距离,来进行分类。MCMM模型基于MapReduce实现并且部署在Hadoop平台。该模型可扩展处理大规模的数据,是因为Hadoop支持数据密集分布应用,并且这些应用可以和数以千计的节点和数据一起运作。另外,MapReduce 和Hadoop能在由商品机组成的集群上很好的运行。MCMM模型使用云平台并且通过使用MapReduce 和Hadoop进行云计算是有益处的。实验采用的数据集包括从UCI数据库得到的真实数据和一些模拟数据,实验使用了4000个集群。实验表明,MCMM模型在精确度和扩展性上优于KNN和其他一些经常使用的基础分类方法。  相似文献   

10.
潘振君  梁成  张化祥 《计算机应用》2021,41(12):3438-3446
针对多视图数据分析易受原始数据集噪声干扰,以及需要额外的步骤计算聚类结果的问题,提出一种基于一致图学习的鲁棒多视图子空间聚类(RMCGL)算法。首先,在各个视图下学习数据在子空间中的潜在鲁棒表示,并基于该表示得到各视图的相似度矩阵。随后,基于得到的多个相似度矩阵学习一个统一的相似度图。最后,通过对相似度图对应的拉普拉斯矩阵添加秩约束,确保得到的相似度图具有最优的聚类结构,并可直接得到最终的聚类结果。该过程在一个统一的优化框架中完成,能同时学习潜在鲁棒表示、相似度矩阵和一致图。RMCGL算法的聚类精度(ACC)在BBC、100leaves和MSRC数据集上比基于图的多视图聚类(GMC)算法分别提升了3.36个百分点、5.82个百分点和5.71个百分点。实验结果表明,该算法具有良好的聚类效果。  相似文献   

11.
In recent years, many information networks have become available for analysis, including social networks, road networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SA-Cluster performs matrix multiplication to calculate the random walk distances between graph vertices. As part of the clustering refinement, the graph edge weights are iteratively adjusted to balance the relative importance between structural and attribute similarities. As a consequence, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update. In order to improve the efficiency and scalability of SA-cluster, in this paper, we propose an efficient algorithm In-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. We further design parallel matrix computation techniques on a multicore architecture. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.  相似文献   

12.
As we are in the big data age, graph data such as user networks in Facebook and Flickr becomes large. How to reduce the visual complexity of a graph layout is a challenging problem. Clustering graphs is regarded as one of effective ways to address this problem. Most of current graph visualization systems, however, directly use existing clustering algorithms that are not originally developed for the visualization purpose. For graph visualization, a clustering algorithm should meet specific requirements such as the sufficient size of clusters, and automatic determination of the number of clusters. After identifying the requirements of clustering graphs for visualization, in this paper we present a new clustering algorithm that is particularly designed for visualization so as to reduce the visual complexity of a layout, together with a strategy for improving the scalability of our algorithm. Experiments have demonstrated that our proposed algorithm is capable of detecting clusters in a way that is required in graph visualization.  相似文献   

13.
Graph traversals are in the basis of many distributed algorithms. In this paper, we use graph relabelling systems to encode two basic graph traversals which are the broadcast and the convergecast. This encoding allows us to derive formal, modular and simple encoding for many distributed graph algorithms. We illustrate this method by investigating the distributed computation of a breadth-first spanning tree and the distributed computation of a minimum spanning tree. Our formalism allows to focus on the correctness of a distributed algorithm rather than on the implementation and the communication details.  相似文献   

14.
大规模数据下复杂网络的算法分析面临复杂度高的挑战,为此引入图稀疏的思想,在保持原始图性质的情况下以一定的精度在稀疏图上实现了高效的算法分析。图稀疏算法是一种保留顶点、对边稀疏采样的方法。按照相应算法分析所需要的原始图性质,提出图稀疏的边度量方式。文中系统回顾了4种边度量下的图稀疏采样方法:生成图稀疏、边连通图稀疏、聚类图稀疏、边传播性图稀疏,归纳了不同边度量方式下图稀疏的优缺点和适应性,并进一步讨论了动态图流稀疏化的最新研究进展。最后,总结了图稀疏领域有待解决的问题并展望了未来的研究方向。  相似文献   

15.
图聚类是指把图中相对连接紧密的顶点及其相关的边分组形成一个子图的过程,在包括机器学习、数据挖掘、模式识别、图像分析及生物信息等领域有着广泛应用。但是,随着大数据时代的到来,图数据海量增长。面对广泛的大规模图计算需求,由于图结构本身的不规则性,单机算法运行效率低下,用传统的并行计算方法进行图计算难以获得高性能。使用线性代数的方法在Combinatorial BLAS上实现了同辈压力(Peer Pressure)图聚类的分布式算法,首先将该图聚类的算法转换为对稀疏矩阵的运算,从而结构化表示图的不规则数据结构及接入模式,然后基于MPI编程模型将其并行实现。实验结果表明,在并行处理规模达到43亿的由稀疏矩阵表示的超大规模图时,基于线性代数表示的同辈压力图聚类算法在曙光超级计算机上取得了较高的并行性能及良好的可扩展性,在64个核上获得了40.1的并行加速。  相似文献   

16.
Graph similarity is an important notion with many applications. Graph edit distance is one of the most flexible graph similarity measures available. The main problem with this measure is that in practice it can only be computed for small graphs due to its exponential time complexity. This paper addresses the high complexity of graph edit distance computations. Specifically, we present CSI_GED, a novel edge-centric approach for computing graph edit distance through common sub-structure isomorphisms enumeration. CSI_GED utilizes backtracking search combined with a number of heuristics to reduce memory requirements and quickly prune away a large portion of the mapping search space. Experiments show that CSI_GED is highly efficient for computing graph edit distance; it outperforms the state-of-the-art methods by over three orders of magnitude. It also shows that CSI_GED scales the computation gracefully to larger and distant graphs on which current methods fail to run. Moreover, we evaluated CSI_GED as a stand-alone graph edit similarity search query method. The experiments show that CSI_GED is effective and scalable, and outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.  相似文献   

17.
Graph matching and graph edit distance have become important tools in structural pattern recognition. The graph edit distance concept allows us to measure the structural similarity of attributed graphs in an error-tolerant way. The key idea is to model graph variations by structural distortion operations. As one of its main constraints, however, the edit distance requires the adequate definition of edit cost functions, which eventually determine which graphs are considered similar. In the past, these cost functions were usually defined in a manual fashion, which is highly prone to errors. The present paper proposes a method to automatically learn cost functions from a labeled sample set of graphs. To this end, we formulate the graph edit process in a stochastic context and perform a maximum likelihood parameter estimation of the distribution of edit operations. The underlying distortion model is learned using an Expectation Maximization algorithm. From this model we finally derive the desired cost functions. In a series of experiments we demonstrate the learning effect of the proposed method and provide a performance comparison to other models.  相似文献   

18.
基于MapReduce的分布式近邻传播聚类算法   总被引:2,自引:0,他引:2  
随着信息技术迅速发展,数据规模急剧增长,大规模数据处理非常具有挑战性.许多并行算法已被提出,如基于MapReduce的分布式K平均聚类算法、分布式谱聚类算法等.近邻传播(affinity propagation,AP)聚类能克服K平均聚类算法的局限性,但是处理海量数据性能不高.为有效实现海量数据聚类,提出基于MapReduce的分布式近邻传播聚类算法——DisAP.该算法先将数据点随机划分为规模相近的子集,并行地用AP聚类算法稀疏化各子集,然后融合各子集稀疏化后的数据再次进行AP聚类,由此产生的聚类代表作为所有数据点的聚类中心.在人工合成数据、人脸图像数据、IRIS数据以及大规模数据集上的实验表明:DisAP算法对数据规模有很好的适应性,在保持AP聚类效果的同时可有效缩减聚类时间.  相似文献   

19.
In this paper, we present a practical algorithm to extract a curve skeleton of a 3D shape. The core of our algorithm comprises coupled processes of graph contraction and surface clustering. Given a 3D shape represented by a triangular mesh, we first construct an initial skeleton graph by directly copying the connectivity and geometry information from the input mesh. Graph contraction and surface clustering are then performed iteratively. The former merges certain graph nodes based on computation of an approximate centroidal Voronoi diagram, seeded by subsampling the graph nodes from the previous iteration. Meanwhile, a coupled surface clustering process serves to regularize the graph contraction. Constraints are used to ensure that extremities of the graph are not shortened undesirably, to ensure that skeleton has the correct topological structure, and that surface clustering leads to an approximately-centered skeleton of the input shape. These properties lead to a stable and reliable skeleton graph construction algorithm.Experiments demonstrate that our skeleton extraction algorithm satisfies various desirable criteria. Firstly, it produces a skeleton homotopic with the input (the genus of both shapes agree) which is both robust (results are stable with respect to noise and remeshing of the input shape) and reliable (every boundary point is visible from at least one curve-skeleton location). It can also handle point cloud data if we first build an initial skeleton graph based on k-nearest neighbors. In addition, a secondary output of our algorithm is a skeleton-to-surface mapping, which can e.g. be used directly for skinning animation.Highlights(1) An algorithm for curve skeleton extraction from 3D shapes based on coupled graph contraction and surface clustering. (2) The algorithm meets various desirable criteria and can be extended to work for incomplete point clouds.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号