Spark框架优化的大规模谱聚类并行算法 Spark framework based optimized large-scale spectral clustering parallel algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Spark框架优化的大规模谱聚类并行算法

引用本文：	崔艺馨,陈晓东.Spark框架优化的大规模谱聚类并行算法[J].计算机应用,2020,40(1):168-172.

作者姓名：	崔艺馨陈晓东

作者单位：	太原工业学院网络与信息中心, 太原 030008

摘要：	为解决谱聚类在大规模数据集上存在的计算耗时和无法聚类等性能瓶颈制约，提出了基于Spark技术的大规模数据集谱聚类的并行化算法。首先，通过单向循环迭代优化相似矩阵的构建，避免重复计算；然后，通过位置变换和标量乘法替换来优化Laplacian矩阵的构建与正规化，降低存储需求；最后，采用近似特征向量计算来进一步减少计算量。不同测试数据集上的实验结果表明：随着测试数据集的规模增加，所提算法的单向循环迭代和近似特征值计算的运行时间呈线性增长，增长缓慢，其近似特征向量计算与精确特征向量计算取得相近的聚类效果，并且算法在大规模数据集上表现出良好的可扩展性。在获得较好的谱聚类性能的基础上，改进算法提高了运行效率，有效缓解了谱聚类的计算耗时及无法聚类问题。
关键词：	大规模谱聚类相似矩阵稀疏化单向循环迭代近似特征向量分布式Spark并行计算
收稿时间：	2019-06-21
修稿时间：	2019-09-22
Spark framework based optimized large-scale spectral clustering parallel algorithm

CUI Yixin,CHEN Xiaodong.Spark framework based optimized large-scale spectral clustering parallel algorithm[J].journal of Computer Applications,2020,40(1):168-172.

Authors:	CUI Yixin CHEN Xiaodong

Affiliation:	Network and Information Center, Taiyuan Institute of Technology, Taiyuan Shanxi 030008, China

Abstract:	To solve the performance bottlenecks such as time-consuming computation and inability of clustering in spectral clustering on large-scale datasets, a spectral clustering parallelization algorithm suitable for large-scale datasets was proposed based on Spark technology. Firstly, the similar matrices were constructed through one-way loop iteration to avoid double counting. Then, the construction and normalization of Laplacian matrices were optimized by position transformation and scalar multiplication replacement in order to reduce the storage requirements. Finally, the approximate eigenvector calculation was used to further reduce the computational cost. The experimental results on different test datasets show that, as the size of test dataset increases, the proposed algorithm has the running time of one-way loop iteration and the approximate eigenvector calculation increased linearly with slow speed, the clustering effects of approximate eigenvector calculation are similar to those of exact eigenvector calculation, and the algorithm shows good extensibility on large-scale datasets. On the basis of obtaining better spectral clustering performance, the improved algorithm increases operation efficiency, and effectively alleviates high computational cost and the problem of clustering.

Keywords:	large-scale spectral clustering similar matrix sparsification one-way loop iteration approximate eigenvector distributed Spark parallel computing
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏