FPC: 大规模网页的快速增量聚类 FPC: Fast Incremental Clustering for Large Scale Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

FPC: 大规模网页的快速增量聚类

引用本文：	余钧,郭岩,张凯,刘林,刘悦,俞晓明,程学旗.FPC: 大规模网页的快速增量聚类[J].中文信息学报,2016,30(2):182-188.

作者姓名：	余钧郭岩张凯刘林刘悦俞晓明程学旗

作者单位：	1. 中国科学院计算技术研究所中国科学院网络数据科学与技术重点实验室,北京 100190; 2.2. 中国科学院大学,北京 100190; 2.3. 中国信息安全评测中心,北京 100085

基金项目：	国家973计划(2012CB316303,2013CB329602);国家863计划(2014AA015204);国家自然科学基金(61232010,61425016,61572473,61572467)

摘要：	面向结构相似的网页聚类是网络数据挖掘的一项重要技术。传统的网页聚类没有给出网页簇中心的表示方式,在计算点簇间和簇簇间相似度时需要计算多个点对的相似度,这种聚类算法一般比使用簇中心的聚类算法慢,难以满足大规模快速增量聚类的需求。针对此问题,该文提出一种快速增量网页聚类方法FPC(Fast Page Clustering)。在该方法中,先提出一种新的计算网页相似度的方法,其计算速度是简单树匹配算法的500倍;给出一种网页簇中心的表示方式,在此基础上使用Kmeans算法的一个变种MKmeans(Merge-Kmeans)进行聚类,在聚类算法层面上提高效率;使用局部敏感哈希技术,从数量庞大的网页类集中快速找出最相似的类,在增量合并层面上提高效率。
关键词：	DOM树分层向量网页簇中心局部敏感哈希快速增量聚类
FPC: Fast Incremental Clustering for Large Scale Web Pages

YU Jun,GUO Yan,ZHANG Kai,LIU Lin,LIU Yue,YU Xiaoming,CHENG Xueqi.FPC: Fast Incremental Clustering for Large Scale Web Pages[J].Journal of Chinese Information Processing,2016,30(2):182-188.

Authors:	YU Jun GUO Yan ZHANG Kai LIU Lin LIU Yue YU Xiaoming CHENG Xueqi

Affiliation:	1. CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100190, China; 3. China Information Technology Security Evaluation Center, Beijing 100085, China)

Abstract:	Structure-oriented web page clustering is one of the most important technique in web data mining. Previous traditional methods havent given a formal definition of the web page cluster center and have to calculate several point-wise similarities for the purpose of getting the similarity between a point and a cluster or the similarity between two clusters. The efficiency of these methods is much slower than the clustering algorithms using cluster center, especially they cant satisfy the need of large scale clustering in fast incremental web pages clustering. To solve these issues, this paper proposes a fast incremental clustering method FPC (Fast Page Clustering). In our method, a new approach is given to calculat the similarity between two web pages which is 500 times faster than the Simple Tree Matching algorithm; then a formal representation of web page cluster center is described and a Kmeans-like MKmeans(Merge-Kmeans) clustering algorithm for fast clustering is applied; Moreover, we use local sensitive hashing technique to quickly find the most similar cluster in a large scale cluster set and improve the efficiency in terms of the incremental clustering.

Keywords:	DOM tree layered vectors web page cluster center local sensitive hashing fast incremental clustering

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏