首页 | 本学科首页   官方微博 | 高级检索  
     

一种海量数据快速聚类算法
引用本文:何倩,李双富,黄焕,徐红. 一种海量数据快速聚类算法[J]. 北京邮电大学学报, 2020, 43(3): 118-124. DOI: 10.13190/j.jbupt.2019-078
作者姓名:何倩  李双富  黄焕  徐红
作者单位:1. 桂林电子科技大学 卫星导航定位与位置服务国家地方联合工程研究中心, 桂林 541004;
2. 广西交科集团有限公司, 南宁 530007
基金项目:国家自然科学基金项目(61661015,61967005);广西创新驱动重大专项项目(AA17202024);广西科技创新团队项目(2019GXNSFGA245004)
摘    要:为满足海量数据处理要求,提出了一种基于网格的K-means快速聚类算法(SPGK).设计基于网格质心的聚类簇个数选取算法,对数据进行网格划分得到每个网格的质心,将质心作为K-means聚类的样本点,从而减少K-means的欧氏距离计算次数.该算法基于Spark平台实现并行计算,进一步地提高了算法的运行效率.SPGK不但能够获得良好的聚类效果,而且缩减了欧氏距离计算次数,适用于海量数据的快速聚类.在千万级数据集上的实验结果表明,SPGK的性能明显优于现有的K-means++和基于K均值聚类的递归划分方法.

关 键 词:快速聚类  Spark  最佳聚类初始点  网格划分  
收稿时间:2019-05-11

A Fast Clustering Algorithm for Massive Data
HE Qian,LI Shuang-fu,HUANG Huan,XU Hong. A Fast Clustering Algorithm for Massive Data[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(3): 118-124. DOI: 10.13190/j.jbupt.2019-078
Authors:HE Qian  LI Shuang-fu  HUANG Huan  XU Hong
Affiliation:1. State and Local Joint Engineering Research Center for Satellite Navigation and Location Service, Guilin University of Electronic Technology, Guilin 541004, China;
2. Guangxi Jiaoke Group Company Limited, Nanning 530007, China
Abstract:To meet the requirements of massive data processing, a grid-based K-means fast clustering algorithm (SPGK) is proposed. Selection for optimal clustering initial point and the number of clusters algorithm is presented. The grids of different clusters are meshed to obtain the centroid of each grid. These centroid points are used as sample points for K-means clustering, thereby reducing the number of Euclidean distance calculations of K-means. SPGK realizes parallel computation based on Spark platform, which further improves the running efficiency of the algorithm. SPGK not only obtains good clustering effect but also greatly reduces the number of Euclidean distance calculations, which is suitable for fast clustering of mass data. With 10 millions of data, the experiments show that SPGK is superior to the existing K-means++ and recursive partition based K-means clustering algorithms obviously.
Keywords:fast clustering  Spark  best initial clustering point  grid generation  
点击此处可从《北京邮电大学学报》浏览原始摘要信息
点击此处可从《北京邮电大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号