Spark平台下聚类算法的性能比较 Performance Comparison of Clustering Algorithms in Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Spark平台下聚类算法的性能比较

引用本文：	海沫,张游.Spark平台下聚类算法的性能比较[J].计算机科学,2017,44(Z6):414-418.

作者姓名：	海沫张游

作者单位：	中央财经大学信息学院北京100081;电子科技大学网络与数据安全四川省重点实验室成都610054,卡内基梅隆大学海因茨学院信息系统管理系匹兹堡999039

基金项目：	本文受网络与数据安全四川省重点实验室开放课题(NDSMS201604),中央财经大学青年教师发展基金项目(QJJ1634)资助

摘要：	通过实验,从运行时间、加速比、可扩展性和规模增长性4个方面比较了 Spark平台中3种典型的聚类算法即K-means聚类算法、二分K-means聚类算法和高斯混合聚类算法的性能。实验结果表明:1)随着节点个数的增加,3种算法对百兆以上规模数据集聚类的运行时间明显减少；2)当数据集规模大于500MB时,3种算法的加速比均有明显提高,且随着节点个数的增加,加速比近似于线性增长；3)3种算法的可扩展性随着节点个数的增加而降低,当数据集规模大于500MB时,相对于K-means和高斯混合算法,二分K-means算法的可扩展性最差；4)当数据集规模大于100MB时,高斯混合算法的规模增长性远高于K-means和二分K-means算法。
关键词：	Spark K-means聚类二分K-means聚类高斯混合聚类运行时间加速比可扩展性规模增长性
Performance Comparison of Clustering Algorithms in Spark

HAI Mo and ZHANG You.Performance Comparison of Clustering Algorithms in Spark[J].Computer Science,2017,44(Z6):414-418.

Authors:	HAI Mo and ZHANG You

Affiliation:	School of Information,Central University of Finance and Economics,Beijing 100081,China;Network and Data Security Key Laboratory of Sichuan Province,University of Electronic Science and Technology of China,Chengdu 610054,China and School of Information Systems & Management,Heinz College,Carnegie Mellon University,Pittsburgh 999039,USA

Abstract:	The performance of three typical clustering algorithms which are K-means,Bisecting K-means and Gaussian Mixture in Spark,were compared by the experiments from runtime,speedup,scalability and size up.The results show that when the scale of the dataset is hundreds of megabytes,as the number of nodes increases,the runtime of the three algorithms decreases more obviously.When the scale of the dataset is larger than 500MB,the speedup of the three algorithms increases more obviously,and the speedup increases linearly with the increase of the number of nodes.The scala-bility of the three algorithms decreases with the increase of the number of nodes.When the scale of the dataset is larger than 500MB,the scalability of the Bisecting K-means algorithm is the lowest compared to that of the K-means and Gaussian Mixture algorithm.When the scale of the dataset is larger than 100MB,the sizeup of the Gaussian Mixture algorithm is much larger than that of K-means algorithm and bisecting K-mean algorithm.

Keywords:	Spark K-means clustering Bisecting K-means clustering Gaussian mixture clustering Runtime Speedup Scalability Sizeup

	点击此处可从《计算机科学》浏览原始摘要信息
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏