首页 | 本学科首页   官方微博 | 高级检索  
     

FPC: 大规模网页的快速增量聚类
引用本文:余 钧,郭 岩,张 凯,刘 林,刘 悦,俞晓明,程学旗. FPC: 大规模网页的快速增量聚类[J]. 中文信息学报, 2016, 30(2): 182-188
作者姓名:余 钧  郭 岩  张 凯  刘 林  刘 悦  俞晓明  程学旗
作者单位:1. 中国科学院 计算技术研究所 中国科学院网络数据科学与技术重点实验室,北京 100190;
2.2. 中国科学院大学,北京 100190;
2.3. 中国信息安全评测中心,北京 100085
基金项目:国家973计划(2012CB316303,2013CB329602);国家863计划(2014AA015204);国家自然科学基金(61232010,61425016,61572473,61572467)
摘    要:面向结构相似的网页聚类是网络数据挖掘的一项重要技术。传统的网页聚类没有给出网页簇中心的表示方式,在计算点簇间和簇簇间相似度时需要计算多个点对的相似度,这种聚类算法一般比使用簇中心的聚类算法慢,难以满足大规模快速增量聚类的需求。针对此问题,该文提出一种快速增量网页聚类方法FPC(Fast Page Clustering)。在该方法中,先提出一种新的计算网页相似度的方法,其计算速度是简单树匹配算法的500倍;给出一种网页簇中心的表示方式,在此基础上使用Kmeans算法的一个变种MKmeans(Merge-Kmeans)进行聚类,在聚类算法层面上提高效率;使用局部敏感哈希技术,从数量庞大的网页类集中快速找出最相似的类,在增量合并层面上提高效率。

关 键 词:DOM树分层向量  网页簇中心  局部敏感哈希  快速增量聚类  

FPC: Fast Incremental Clustering for Large Scale Web Pages
YU Jun,GUO Yan,ZHANG Kai,LIU Lin,LIU Yue,YU Xiaoming,CHENG Xueqi. FPC: Fast Incremental Clustering for Large Scale Web Pages[J]. Journal of Chinese Information Processing, 2016, 30(2): 182-188
Authors:YU Jun  GUO Yan  ZHANG Kai  LIU Lin  LIU Yue  YU Xiaoming  CHENG Xueqi
Affiliation:1. CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100190, China;
   2. University of Chinese Academy of Sciences, Beijing 100190, China;
   3. China Information Technology Security Evaluation Center, Beijing 100085, China)
Abstract:Structure-oriented web page clustering is one of the most important technique in web data mining. Previous traditional methods havent given a formal definition of the web page cluster center and have to calculate several point-wise similarities for the purpose of getting the similarity between a point and a cluster or the similarity between two clusters. The efficiency of these methods is much slower than the clustering algorithms using cluster center, especially they cant satisfy the need of large scale clustering in fast incremental web pages clustering. To solve these issues, this paper proposes a fast incremental clustering method FPC (Fast Page Clustering). In our method, a new approach is given to calculat the similarity between two web pages which is 500 times faster than the Simple Tree Matching algorithm; then a formal representation of web page cluster center is described and a Kmeans-like MKmeans(Merge-Kmeans) clustering algorithm for fast clustering is applied; Moreover, we use local sensitive hashing technique to quickly find the most similar cluster in a large scale cluster set and improve the efficiency in terms of the incremental clustering.
Keywords:DOM tree layered vectors  web page cluster center  local sensitive hashing  fast incremental clustering  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号