首页 | 本学科首页   官方微博 | 高级检索  
     

基于非负矩阵分解的大规模异构数据联合聚类
引用本文:申国伟, 杨武, 王巍, 于淼, 董国忠. 基于非负矩阵分解的大规模异构数据联合聚类[J]. 计算机研究与发展, 2016, 53(2): 459-466. DOI: 10.7544/issn1000-1239.2016.20148284
作者姓名:申国伟  杨武  王巍  于淼  董国忠
作者单位:1.(哈尔滨工程大学信息安全研究中心 哈尔滨 150001) (shenguowei@hrbeu.edu.cn)
基金项目:国家“八六三”高技术研究发展计划基金项目(2012AA012802);国家自然科学基金项目(61170242)
摘    要:异构信息网络中包含多类实体和关系.随着数据规模增大时,不同类实体规模增长不平衡,异构关系数据也变得异常稀疏,导致聚类算法的时间复杂度高、准确率低.针对上述问题,提出了一种基于关联矩阵分解的2阶段联合聚类算法FNMTF-CM.第1阶段,抽取规模较小的一类实体中的关联关系构建关联矩阵,通过对称非负矩阵分解得到划分指示矩阵.与原始关系矩阵相比,关联矩阵的稠密度更高,规模更小.第2阶段,将划分指示矩阵作为关系矩阵三分解的输入,进而快速求解另一类实体的划分指示矩阵.在标准测试数据集和异构关系数据集上的实验表明,算法准确率和性能整体优于传统的基于非负矩阵分解的联合聚类算法.

关 键 词:异构网络  联合聚类  非负矩阵分解  大规模数据  关联矩阵

Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization
Shen Guowei, Yang Wu, Wang Wei, Yu Miao, Dong Guozhong. Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization[J]. Journal of Computer Research and Development, 2016, 53(2): 459-466. DOI: 10.7544/issn1000-1239.2016.20148284
Authors:Shen Guowei  Yang Wu  Wang Wei  Yu Miao  Dong Guozhong
Affiliation:1.(Research Center of Information Security, Harbin Engineering University, Harbin 150001)
Abstract:Heterogeneous information network contains multi-typed entities and interactive relations. Some co-clustering algorithms have been proposed to mine underlying structure of different entities. However, with the increase of data scale, the scale of different class entities are growing unbalanced, and heterogeneous relational data are becoming extremely sparse. In order to solve this problem, we propose a two steps co-clustering algorithm FNMTF-CM based on correlation matrix decomposition. In the first step, the correlation matrix is built with the correlation relationship of smaller-typed entities and decomposed into indicating matrix of smaller-typed entity based on symmetric nonnegative matrix factorization. Correlation matrix has higher dense degree and smaller size compared with the original heterogeneous relationship matrix, so our algorithm can process large-scale heterogeneous data and maintain a high precision. After that, the indicating matrix of smaller-typed can be used as the input directly, so the heterogeneous relational matrix tri-factorization is very fast. Experiments on artificial and real-world heterogeneous data sets show that the accuracy and performance of FNMTF-CM algorithm are superior to the traditional co-clustering algorithms based on nonnegative matrix factorization.
Keywords:heterogeneous network  co-clustering  nonnegative matrix factorization  large-scale data  correlation matrix
点击此处可从《计算机研究与发展》浏览原始摘要信息
点击此处可从《计算机研究与发展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号