首页 | 本学科首页   官方微博 | 高级检索  
     

基于改进Canopy-Kmeans算法的并行化研究
引用本文:王 林,贾钧琛.基于改进Canopy-Kmeans算法的并行化研究[J].计算机测量与控制,2021,29(2):176-179.
作者姓名:王 林  贾钧琛
作者单位:西安理工大学自动化与信息工程学院,西安710048;西安理工大学自动化与信息工程学院,西安710048
基金项目:陕西省科技计划重点项目(2017ZDCXL-GY-05-03)
摘    要:随着互联网数据的快速增长,原始的K-means算法已经不足以应对大规模数据的聚类需求;为此,提出一种改进的Canopy-K-means聚类算法;首先面对Canopy算法中心点随机选取的不足,引入“最大最小原则”优化Canopy中心点的选取;接着借助三角不等式定理对K-means算法进行优化,减少冗余的距离计算,加快算法的收敛速度;最后结合MapReduce框架并行化实现改进的Canopy-K-means算法;基于构建的微博数据集,对优化后的Canopy-K-means算法进行测试;试验结果表明:对不同数据规模的微博数据集,优化后算法的准确率较K-means算法提高了约15%,较原始的Canopy-K-means算法提高了约7%,算法的执行效率和扩展性也有较大提升。

关 键 词:Canopy-K-means算法  文本聚类  最大最小原则  三角不等式  MAPREDUCE
收稿时间:2020/6/22 0:00:00
修稿时间:2020/7/7 0:00:00

Research on Parallelization Based on Improved Canopy-Kmeans AlgorithmWang Lin Jia Junche
Wang Lin,Jia Junchen.Research on Parallelization Based on Improved Canopy-Kmeans AlgorithmWang Lin Jia Junche[J].Computer Measurement & Control,2021,29(2):176-179.
Authors:Wang Lin  Jia Junchen
Affiliation:(School of Automation and Information Engineering,Xi'an University of Technology,Xi'an 710048,China)
Abstract:With the rapid growth of Internet data,the original K-means algorithm is no longer sufficient to meet the clustering needs of large-scale data.To this end,an improved Canopy-K-means clustering algorithm is proposed.Faced with the shortcomings of the random selection of the center point of the Canopy algorithm,the“maximum and minimum principle”was introduced to optimize the selection of the Canopy center point;then the K-means algorithm was optimized with the help of the triangle inequality theorem to reduce redundant distance calculations and accelerate the convergence rate of the algorithm;finally combined with MapReduce framework parallelization to achieve improved Canopy-K-means algorithm.Based on the constructed Weibo dataset,the optimized Canopy-K-means algorithm is tested.The test results show that the accuracy of the optimized algorithm is about 15%higher than that of the K-means algorithm and about 7%higher than that of the original Canopy-K-means algorithm.The execution efficiency and scalability of the algorithm are also improved.Greatly improved.
Keywords:Canopy-Kmeans algorithm  text clustering  maximum and minimum principle  triangle inequality  MapReduce
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机测量与控制》浏览原始摘要信息
点击此处可从《计算机测量与控制》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号