首页 | 本学科首页   官方微博 | 高级检索  
     

Spark框架优化的大规模谱聚类并行算法
引用本文:崔艺馨,陈晓东.Spark框架优化的大规模谱聚类并行算法[J].计算机应用,2020,40(1):168-172.
作者姓名:崔艺馨  陈晓东
作者单位:太原工业学院 网络与信息中心, 太原 030008
摘    要:为解决谱聚类在大规模数据集上存在的计算耗时和无法聚类等性能瓶颈制约,提出了基于Spark技术的大规模数据集谱聚类的并行化算法。首先,通过单向循环迭代优化相似矩阵的构建,避免重复计算;然后,通过位置变换和标量乘法替换来优化Laplacian矩阵的构建与正规化,降低存储需求;最后,采用近似特征向量计算来进一步减少计算量。不同测试数据集上的实验结果表明:随着测试数据集的规模增加,所提算法的单向循环迭代和近似特征值计算的运行时间呈线性增长,增长缓慢,其近似特征向量计算与精确特征向量计算取得相近的聚类效果,并且算法在大规模数据集上表现出良好的可扩展性。在获得较好的谱聚类性能的基础上,改进算法提高了运行效率,有效缓解了谱聚类的计算耗时及无法聚类问题。

关 键 词:大规模谱聚类  相似矩阵稀疏化  单向循环迭代  近似特征向量  分布式Spark并行计算  
收稿时间:2019-06-21
修稿时间:2019-09-22

Spark framework based optimized large-scale spectral clustering parallel algorithm
CUI Yixin,CHEN Xiaodong.Spark framework based optimized large-scale spectral clustering parallel algorithm[J].journal of Computer Applications,2020,40(1):168-172.
Authors:CUI Yixin  CHEN Xiaodong
Affiliation:Network and Information Center, Taiyuan Institute of Technology, Taiyuan Shanxi 030008, China
Abstract:To solve the performance bottlenecks such as time-consuming computation and inability of clustering in spectral clustering on large-scale datasets, a spectral clustering parallelization algorithm suitable for large-scale datasets was proposed based on Spark technology. Firstly, the similar matrices were constructed through one-way loop iteration to avoid double counting. Then, the construction and normalization of Laplacian matrices were optimized by position transformation and scalar multiplication replacement in order to reduce the storage requirements. Finally, the approximate eigenvector calculation was used to further reduce the computational cost. The experimental results on different test datasets show that, as the size of test dataset increases, the proposed algorithm has the running time of one-way loop iteration and the approximate eigenvector calculation increased linearly with slow speed, the clustering effects of approximate eigenvector calculation are similar to those of exact eigenvector calculation, and the algorithm shows good extensibility on large-scale datasets. On the basis of obtaining better spectral clustering performance, the improved algorithm increases operation efficiency, and effectively alleviates high computational cost and the problem of clustering.
Keywords:large-scale spectral clustering  similar matrix sparsification  one-way loop iteration  approximate eigenvector  distributed Spark parallel computing
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号