首页 | 本学科首页   官方微博 | 高级检索  
     

Spark平台下聚类算法的性能比较
引用本文:海沫,张游.Spark平台下聚类算法的性能比较[J].计算机科学,2017,44(Z6):414-418.
作者姓名:海沫  张游
作者单位:中央财经大学信息学院 北京100081;电子科技大学网络与数据安全四川省重点实验室 成都610054,卡内基梅隆大学海因茨学院信息系统管理系 匹兹堡999039
基金项目:本文受网络与数据安全四川省重点实验室开放课题(NDSMS201604),中央财经大学青年教师发展基金项目(QJJ1634)资助
摘    要:通过实验,从运行时间、加速比、可扩展性和规模增长性4个方面比较了 Spark平台中3种典型的聚类算法即K-means聚类算法、二分K-means聚类算法和高斯混合聚类算法 的性能。实验结果表明:1)随着节点个数的增加,3种算法对百兆以上规模数据集聚类的运行时间明显减少;2)当数据集规模大于500MB时,3种算法的加速比均有明显提高,且随着节点个数的增加,加速比近似于线性增长;3)3种算法的可扩展性随着节点个数的增加而降低,当数据集规模大于500MB时,相对于K-means和高斯混合算法,二分K-means算法的可扩展性最差;4)当数据集规模大于100MB时,高斯混合算法的规模增长性远高于K-means和二分K-means算法。

关 键 词:Spark  K-means聚类  二分K-means聚类  高斯混合聚类  运行时间  加速比  可扩展性  规模增长性

Performance Comparison of Clustering Algorithms in Spark
HAI Mo and ZHANG You.Performance Comparison of Clustering Algorithms in Spark[J].Computer Science,2017,44(Z6):414-418.
Authors:HAI Mo and ZHANG You
Affiliation:School of Information,Central University of Finance and Economics,Beijing 100081,China;Network and Data Security Key Laboratory of Sichuan Province,University of Electronic Science and Technology of China,Chengdu 610054,China and School of Information Systems & Management,Heinz College,Carnegie Mellon University,Pittsburgh 999039,USA
Abstract:The performance of three typical clustering algorithms which are K-means,Bisecting K-means and Gaussian Mixture in Spark,were compared by the experiments from runtime,speedup,scalability and size up.The results show that when the scale of the dataset is hundreds of megabytes,as the number of nodes increases,the runtime of the three algorithms decreases more obviously.When the scale of the dataset is larger than 500MB,the speedup of the three algorithms increases more obviously,and the speedup increases linearly with the increase of the number of nodes.The scala-bility of the three algorithms decreases with the increase of the number of nodes.When the scale of the dataset is larger than 500MB,the scalability of the Bisecting K-means algorithm is the lowest compared to that of the K-means and Gaussian Mixture algorithm.When the scale of the dataset is larger than 100MB,the sizeup of the Gaussian Mixture algorithm is much larger than that of K-means algorithm and bisecting K-mean algorithm.
Keywords:Spark  K-means clustering  Bisecting K-means clustering  Gaussian mixture clustering  Runtime  Speedup  Scalability  Sizeup
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号