基于Spark并行的密度峰值聚类算法 Spark-based parallel density peak clustering algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark并行的密度峰值聚类算法

引用本文：	孙伟鹏.基于Spark并行的密度峰值聚类算法[J].计算机应用研究,2020,37(1):163-166,171.

作者姓名：	孙伟鹏

作者单位：	江南大学物联网工程学院,江苏无锡214122;中船重工集团第七○二研究所软件工程中心,江苏无锡214082

基金项目：	国家自然科学基金;青年创新基金

摘要：	针对FSDP聚类算法在计算数据对象的局部密度与最小距离时，由于需要遍历整个数据集而导致算法的整体时间复杂度较高的问题，提出了一种基于Spark的并行FSDP聚类算法SFSDP。首先，通过空间网格划分将待聚类数据集划分成多个数据量相对均衡的数据分区；然后，利用改进的FSDP聚类算法并行地对各个分区内的数据执行聚类分析；最后，通过将分区间的局部簇集合并，生成全局簇集。实验结果表明，SFSDP与FSDP算法相比能够有效地进行大规模数据集的聚类分析，并且算法在准确性和扩展性方面都有很好的表现。
关键词：	聚类密度峰值空间划分并行 Spark
收稿时间：	2018/4/24 0:00:00
修稿时间：	2019/11/29 0:00:00
Spark-based parallel density peak clustering algorithm

Affiliation:	School of IoT Engineering, Jiangnan University

Abstract:	In view of the problem that the overall time complexity of the FSDP clustering algorithm is high because the algorithm needs to traverse the entire data set when calculating the local density and minimum distance of data objects, this paper presented a Spark-based parallel FSDP clustering algorithm called SFSDP. First, the algorithm divided the dataset into multiple data partitions with relatively equal size by spatial meshing. Then, it used the improved FSDP clustering algorithm to performed the clustering analysis on the data in each partition parallelly. It generated the global clusters by grouping together local clusters between partitions. Experimental results show that SFSDP algorithm can effectively perform large-scale dataset clustering analysis compared with FSDP algorithm, and the algorithm has a good performance in terms of accuracy and scalability.

Keywords:	clustering density peak space division parallel Spark
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏