一种海量数据快速聚类算法 A Fast Clustering Algorithm for Massive Data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种海量数据快速聚类算法

引用本文：	何倩,李双富,黄焕,徐红. 一种海量数据快速聚类算法[J]. 北京邮电大学学报, 2020, 43(3): 118-124. DOI: 10.13190/j.jbupt.2019-078

作者姓名：	何倩李双富黄焕徐红

作者单位：	1. 桂林电子科技大学卫星导航定位与位置服务国家地方联合工程研究中心, 桂林 541004; 2. 广西交科集团有限公司, 南宁 530007

基金项目：	国家自然科学基金项目（61661015，61967005）；广西创新驱动重大专项项目（AA17202024）；广西科技创新团队项目（2019GXNSFGA245004）

摘要：	为满足海量数据处理要求，提出了一种基于网格的K-means快速聚类算法（SPGK）.设计基于网格质心的聚类簇个数选取算法，对数据进行网格划分得到每个网格的质心，将质心作为K-means聚类的样本点，从而减少K-means的欧氏距离计算次数.该算法基于Spark平台实现并行计算，进一步地提高了算法的运行效率.SPGK不但能够获得良好的聚类效果，而且缩减了欧氏距离计算次数，适用于海量数据的快速聚类.在千万级数据集上的实验结果表明，SPGK的性能明显优于现有的K-means++和基于K均值聚类的递归划分方法.
关键词：	快速聚类 Spark 最佳聚类初始点网格划分
收稿时间：	2019-05-11
A Fast Clustering Algorithm for Massive Data

HE Qian,LI Shuang-fu,HUANG Huan,XU Hong. A Fast Clustering Algorithm for Massive Data[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(3): 118-124. DOI: 10.13190/j.jbupt.2019-078

Authors:	HE Qian LI Shuang-fu HUANG Huan XU Hong

Affiliation:	1. State and Local Joint Engineering Research Center for Satellite Navigation and Location Service, Guilin University of Electronic Technology, Guilin 541004, China; 2. Guangxi Jiaoke Group Company Limited, Nanning 530007, China

Abstract:	To meet the requirements of massive data processing, a grid-based K-means fast clustering algorithm (SPGK) is proposed. Selection for optimal clustering initial point and the number of clusters algorithm is presented. The grids of different clusters are meshed to obtain the centroid of each grid. These centroid points are used as sample points for K-means clustering, thereby reducing the number of Euclidean distance calculations of K-means. SPGK realizes parallel computation based on Spark platform, which further improves the running efficiency of the algorithm. SPGK not only obtains good clustering effect but also greatly reduces the number of Euclidean distance calculations, which is suitable for fast clustering of mass data. With 10 millions of data, the experiments show that SPGK is superior to the existing K-means++ and recursive partition based K-means clustering algorithms obviously.

Keywords:	fast clustering Spark best initial clustering point grid generation

	点击此处可从《北京邮电大学学报》浏览原始摘要信息
	点击此处可从《北京邮电大学学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏