首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
孙琛琛  申德荣  寇月  聂铁铮  于戈 《软件学报》2016,27(9):2303-2319
实体识别是数据质量的一个重要方面,对于大数据处理不可或缺.已有的实体识别研究工作聚焦于数据对象相似度算法、分块技术和监督的实体识别技术,而非监督的实体识别中匹配决定的问题很少被涉及.提出一种面向实体识别的聚类算法来弥补这个缺失.利用数据对象及其相似度构建带权重的数据对象相似图.聚类过程中,利用相似图上重启式随机游走来动态地计算类簇与结点的相似度.聚类的基本逻辑是,类簇迭代地吸收离它最近的结点.提出数据对象排序方法来优化聚类的顺序,提高聚类精确性;提出了优化的随机游走平稳概率分布计算方法,降低聚类算法开销.通过在真实数据集和生成数据集上的对比实验,验证了该算法的有效性.  相似文献   

针对现有在线社交网络(OSNs)采样方法无法有效地应用于低连通性的社交网络,且采集的样本顶点平均度严重偏离原始社交网络、顶点过度采样等问题,本文基于蒙特卡罗随机游走(MHRW)采样方法,引入双重跳跃策略、并行机制和顶点缓存区,提出一种跳跃无偏并行顶点(JPS)采样方法。将在线社交网络数据集建模为包含顶点和边的社交图进行模拟采样,利用Python/Matplotlib绘图库绘制采集的样本顶点属性图。实验结果表明,该采样方法更有效地应用于不同连通强度的社交图,提高了采样过程中的顶点更新率,降低了样本顶点的平均度偏差且能够更快速地收敛。  相似文献   

双层随机游走半监督聚类   总被引:3,自引:0,他引:3  
何萍  徐晓华  陆林  陈崚 《软件学报》2014,25(5):997-1013
半监督聚类旨在根据用户给出的必连和不连约束,把所有数据点划分到不同的簇中,从而获得更准确、更加符合用户要求的聚类结果.目前的半监督聚类算法大多数通过修改已有的聚类算法或者结合度规学习,使聚类结果与点对约束尽可能地保持一致,却很少考虑点对约束对周围无约束数据的显式影响程度.提出一种由在顶点上的低层随机游走和在组件上的高层随机游走两部分构成的双层随机游走半监督聚类算法,其中,低层随机游走主要负责计算选出的约束顶点对其他顶点的影响范围和影响程度,称为组件;高层随机游走则进一步将各个点对约束以自适应的强度在组件上进行约束传播,把它们在每个顶点上的影响综合在一个簇指示矩阵中.UCI数据集和大型真实数据集上的实验结果表明,双层随机游走半监督聚类算法比其他半监督聚类算法更准确,也比较高效.  相似文献   

Clustering networks play a key role in many scientific fields, from Biology to Sociology and Computer Science. Some clustering approaches are called global because they exploit knowledge about the whole network topology. Vice versa, so-called local methods require only a partial knowledge of the network topology. Global approaches yield accurate results but do not scale well on large networks; local approaches, vice versa, are less accurate but computationally fast. We propose CONCLUDE (COmplex Network CLUster DEtection), a new clustering method that couples the accuracy of global approaches with the scalability of local methods. CONCLUDE generates random, non-backtracking walks of finite length to compute the importance of each edge in keeping the network connected, i.e., its edge centrality. Edge centralities allow for mapping vertices onto points of a Euclidean space and compute all-pairs distances between vertices; those distances are then used to partition the network into clusters.  相似文献   

结构-属性平衡图节点相似度测量算法   总被引:1,自引:0,他引:1       下载免费PDF全文
摘  要:节点相似度是图聚类算法的重要基础,在基于结构-属性图聚类现有方法中,由于传统图模型的限制,需要多次矩阵相乘来调整属性边的权值,算法执行效率低。为解决这一问题,提出了结构-属性平衡图的概念,并采用随机游走模型策略统一度量结构-属性平衡图GB中顶点间的相似度。与现有方法相比,该方法不但能测量直接相连的顶点之间的相似度,还可测量不直接相连而存在不同长度的路径的顶点之间的相似度,且没有增加原相似度矩阵的规模,节省了大量存储空间,提高了算法执行效率。  相似文献   

Parallel updates of minimum spanning trees (MSTs) have been studied in the past. These updates allowed a single change in the underlying graph, such as a change in the cost of an edge or an insertion of a new vertex. Multiple update problems for MSTs are concerned with handling more than one such change. In the sequential case multiple update problems may be solved using repeated applications of an efficient algorithm for a single update. However, for efficiency reasons, parallel algorithms for multiple update problems must consider all changes to the underlying graph simultaneously. In this paper we describe parallel algorithms for updating an MST whenk new vertices are inserted or deleted in the underlying graph, when the costs ofk edges are changed, or whenk edge insertions and deletions are performed. For multiple vertex insertion update, our algorithm achieves time and processor bounds ofO(log n·logk) and nk/(logn·logk), respectively, on a CREW parallel random access machine. These bounds are optimal for dense graphs. A novel feature of this algorithm is a transformation of the previous MST andk new vertices to a bipartite graph which enables us to obtain the above-mentioned bounds.  相似文献   

Let G=(V,E) be a weighted undirected graph, with non-negative edge weights. We consider the problem of efficiently computing approximate distances between all pairs of vertices in?G. While many efficient algorithms are known for this problem in unweighted graphs, not many results are known for this problem in weighted graphs. Zwick?(J. Assoc. Comput. Mach. 49:289–317, 2002) showed that for any fixed ε>0, stretch 1+ε distances (a path in G between u,vV is said to be of stretch t if its length is at most t times the distance between u and v in G) between all pairs of vertices in a weighted directed graph on n vertices can be computed in $\tilde{O}(n^{\omega})$ time, where ω<2.376 is the exponent of matrix multiplication and n is the number of vertices. It is known that finding distances of stretch less than 2 between all pairs of vertices in G is at least as hard as Boolean matrix multiplication of two n×n matrices. Here we show that all pairs stretch 2+ε distances for any fixed ε>0 in G can be computed in expected time O(n 9/4). This algorithm uses a fast rectangular matrix multiplication subroutine. We also present a combinatorial algorithm (that is, it does not use fast matrix multiplication) with expected running time O(n 9/4) for computing all-pairs stretch 5/2 distances in?G. This combinatorial algorithm will serve as a key step in our all-pairs stretch 2+ε distances algorithm.  相似文献   

《Pattern recognition》2014,47(2):820-832
A key issue of semi-supervised clustering is how to utilize the limited but informative pairwise constraints. In this paper, we propose a new graph-based constrained clustering algorithm, named SCRAWL. It is composed of two random walks with different granularities. In the lower-level random walk, SCRAWL partitions the vertices (i.e., data points) into constrained and unconstrained ones, according to whether they are in the pairwise constraints. For every constrained vertex, its influence range, or the degrees of influence it exerts on the unconstrained vertices, is encapsulated in an intermediate structure called component. The edge set between each pair of components determines the affecting scope of the pairwise constraints. In the higher-level random walk, SCRAWL enforces the pairwise constraints on the components, so that the constraint influence can be propagated to the unconstrained edges. At last, we combine the cluster membership of all the components to obtain the cluster assignment for each vertex. The promising experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our method.  相似文献   

最小顶点覆盖问题是一个应用很广泛的NP难题,针对该问题给出一种增量式属性约简方法。首先将最小顶点覆盖问题转化为一个决策表的最小属性约简问题;利用增量式属性约简思想,随着图中边数的增多,提出一种更新最小顶点覆盖的增量式属性约简算法;该算法时间复杂度低于计算整个图的最小顶点覆盖的时间复杂度,同时针对大规模图问题,可随着边的增加动态更新最小顶点覆盖,因此降低了属性约简的方法求解最小顶点覆盖问题的运行时间;实验结果表明该算法的可行性和有效性。  相似文献   

When searching for a marked vertex in a graph, Szegedy’s usual search operator is defined by using the transition probability matrix of the random walk with absorbing barriers at the marked vertices. Instead of using this operator, we analyze searching with Szegedy’s quantum walk by using reflections around the marked vertices, that is, the standard form of quantum query. We show we can boost the probability to 1 of finding a marked vertex in the complete graph. Numerical simulations suggest that the success probability can be improved for other graphs, like the two-dimensional grid. We also prove that, for a certain class of graphs, we can express Szegedy’s search operator, obtained from the absorbing walk, using the standard query model.  相似文献   

Community detection methods based on random walks are widely adopted in various network analysis tasks. It could capture structures and attributed information while alleviating the issues of noises. Though random walks on plain networks have been studied before, in real-world networks, nodes are often not pure vertices, but own different characteristics, described by the rich set of data associated with them. These node attributes contain plentiful information that often complements the network, and bring opportunities to the random-walk-based analysis. However, node attributes make the node interactions more complicated and are heterogeneous with respect to topological structures. Accordingly, attributed community detection based on random walk is challenging as it requires joint modelling of graph structures and node attributes. To bridge this gap, we propose a Community detection with Attributed random walk via Seed replacement (CAS). Our model is able to conquer the limitation of directly utilize the original network topology and ignore the attribute information. In particular, the algorithm consists of four stages to better identify communities. (1) Select initial seed nodes in the network; (2) Capture the better-quality seed replacement path set; (3) Generate the structure-attribute interaction transition matrix and perform the colored random walk; (4) Utilize the parallel conductance to expand the communities. Experiments on synthetic and real-world networks demonstrate the effectiveness of CAS.  相似文献   

Efficiently searching top-k representative vertices is crucial for understanding the structure of large dynamic graphs. Recent studies show that communities formed by a vertex with high local clustering coefficient and its neighbours can achieve enhanced information propagation speed as well as disease transmission speed. However, local clustering coefficient, which measures the cliquishness of a vertex in its local neighbourhood, prefers vertices with small degrees. To remedy this issue, in this paper we propose a new ranking measure, weighted clustering coefficient (WCC) of vertices, by integrating both local clustering coefficient and degree. WCC not only inherits the properties of local clustering coefficient but also approximately measures the density (i.e., average degree) of its neighbourhood subgraph. Thus, vertices with higher WCC are more likely to be representative. We study efficiently computing and monitoring top-k representative vertices based on WCC over large dynamic graphs. To reduce the search space, we propose a series of heuristic upper bounds for WCC to prune a large portion of disqualifying vertices from the search space. We also develop an approximation algorithm by utilizing Flajolet-Martin sketch to trade acceptable accuracy for enhanced efficiency. An efficient incremental algorithm dealing with frequent updates in dynamic graphs is explored as well. Extensive experimental results on a variety of real-life graph datasets demonstrate the efficiency and effectiveness of our approaches.  相似文献   

We propose a novel distributed algorithm to cluster graphs. The algorithm recovers the solution obtained from spectral clustering without the need for expensive eigenvalue/eigenvector computations. We prove that, by propagating waves through the graph, a local fast Fourier transform yields the local component of every eigenvector of the Laplacian matrix, thus providing clustering information. For large graphs, the proposed algorithm is orders of magnitude faster than random walk based approaches. We prove the equivalence of the proposed algorithm to spectral clustering and derive convergence rates. We demonstrate the benefit of using this decentralized clustering algorithm for community detection in social graphs, accelerating distributed estimation in sensor networks and efficient computation of distributed multi-agent search strategies.  相似文献   

Because of its wide application, the subgraph matching problem has been studied extensively during the past decade. However, most existing solutions assume that a data graph is a vertex/edge-labeled graph (i.e., each vertex/edge has a simple label). These solutions build structural indices by considering the vertex labels. However, some real graphs contain rich-content vertices such as user profiles in social networks and HTML pages on the World Wide Web. In this study, we consider the problem of subgraph matching using a more general scenario. We build a structural index that does not depend on any vertex content. Based on the index, we design a holistic subgraph matching algorithm that considers the query graph as a whole and finds one match at a time. In order to further improve efficiency, we propose a “partial evaluation and assembly” framework to find subgraph matches over large graphs. Last but not least, our index has light maintenance overhead. Therefore, our method can work well on dynamic graphs. Extensive experiments on real graphs show that our method outperforms the state-of-the-art algorithms.  相似文献   

现有重叠社团发现算法大多直接从相邻连边的相似性出发,不能有效利用网络的多层连边信息。基于此,本文提出了一种基于连边距离矩阵的重叠社区发现算法LDM。首先结合连边-节点-连边随机游走模型,以实现多级连边信息的有效利用,其次借助模糊聚类方法,处理连边距离矩阵以获取连边社区,最后根据扩展模块度调整和优化重叠社区结构。在人工网络和真实网络上的实验结果表明,所提算法能够有效提高重叠社区发现算法的准确度。  相似文献   

李曙光  周彤 《计算机科学》2011,38(11):241-244
有界聚类问题源于II3M研究院开发的一个分布式流处理系统,即S系统。问题的输入是一个点赋权和边赋权的无向图,并指定若干个称为终端的顶点。称顶点集合的一个子集为一个子类。子类中所有顶点的权和加上该子类边界上所有边的权和称为该子类的费用。有界聚类问题是要得到所有顶点的一个聚类,要求每个子类的费用不超过给定预算召,每个子类至多包含一个终端,并使得所有子类的总费用最小。对于限制树宽图上的有界聚类问题,给出了拟多项式时间精确算法。利用取整的技巧对该算法进行修正,可在多项式时间之内得到(1+ε)-近似解,其中每个子类的费用不超过(1+ε)B,:是任意小的正数。如果进一步要求每个子类恰好包含一个终端,则所给算法可在多项式时间之内得到(1+ε)-近似解,其中每个子类的费用不超过(2+ε)B。  相似文献   

研究图聚类的算法问题。在基于划分的图聚类中,重点比较点与点之间距离的计算方法及其对聚类结果的影响。由于社会关系网络图中点没有坐标值,所以不能使用欧几里得距离和曼哈坦距离。使用k-medoids聚类算法时,分别采用最短距离和随机漫步距离算法,将DBLP数据集构成的社会关系网络图分类成各个子图,通过实验数据验证两种算法的优劣。实验证明最短距离算法获得聚类效果更为理想,达到了较好的分类效果。  相似文献   

Building k-connected neighborhood graphs for isometric data embedding   总被引:2,自引:0,他引:2  
Isometric data embedding using geodesic distance requires the construction of a connected neighborhood graph so that the geodesic distance between every pair of data points can be estimated. This paper proposes an approach for constructing k-connected neighborhood graphs. The approach works by applying a greedy algorithm to add each edge, in a nondecreasing order of edge length, to a neighborhood graph if end vertices of the edge are not yet k-connected on the graph. The k-connectedness between vertices is tested using a network flow technique by assigning every vertex a unit flow capacity. This approach is applicable to a wide range of data. Experiments show that it gives better estimation of geodesic distances than other approaches, especially when the data are undersampled or nonuniformly distributed.  相似文献   

推荐是促进诸如社交网络等应用活跃度的重要模式,但 庞大 的节点规模以及复杂的节点间关系给社交网络的推荐问题带来了挑战。随机游走是一种能够有效解决这类推荐问题的策略,但传统的随机游走算法没有充分考虑相邻节点间影响力的差异。提出一种基于FP-Growth的图上随机游走推荐方法,其基于社交网络的图结构,引入FP-Growth算法来挖掘相邻节点之间的频繁度,在此基础上构造转移概率矩阵来进行随机游走计算,最后得到好友重要程度排名并做出推荐。该方法既保留了随机游走方法能有效缓解数据稀疏性等特性,又权衡了不同节点连接关系的差异性。实验结果表明,提出的方法比传统随机游走算法的推荐性能更佳。  相似文献   

刘静姝  王莉  刘惊雷 《计算机应用》2005,40(12):3413-3422
为了解决样本数较大时,传统谱聚类算法执行特征分解消耗时间过大的问题,提出了一种无需特征分解的快速谱聚类算法,通过乘法更新迭代来降低时间开销。首先,利用Nyström方法进行随机采样,建立了采样矩阵和原始矩阵之间的关系;其次,基于乘法更新原理实现矩阵指示器矩阵的迭代更新;最后,在理论上对所设计算法进行了正确性和收敛性分析。在广泛使用的五个真实数据集和三个人工合成数据集上进行测试。实验结果表明,在真实数据集上,所提算法的标准互信息(NMI)平均值为0.45,与k-means聚类算法相比提高了12.50%;运行时间为61.73 s,与传统谱聚类算法相比减少了61.13%;而且表现性能优于层次聚类算法,验证了该算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号