首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce的主成分分析算法研究
引用本文:易秀双,刘勇,李婕,王兴伟.基于MapReduce的主成分分析算法研究[J].计算机科学,2017,44(2):65-69.
作者姓名:易秀双  刘勇  李婕  王兴伟
作者单位:东北大学计算机科学与工程学院 沈阳110819,东北大学计算机科学与工程学院 沈阳110819,东北大学计算机科学与工程学院 沈阳110819,东北大学软件学院 沈阳110819
基金项目:本文受国家杰出青年科学基金资助
摘    要:随着MapReduce并行化框架的流行,各种数据挖掘算法的并行化也成为了当下研究的热点。主成分分析(Principle Components Analysis,PCA)算法的并行化也得到了越来越多的关注。通过对目前PCA算法的并行化研究的成果进行总结,发现这些PCA算法并行程度并不完全,特别是特征值计算过程。整个PCA算法流程分为两个阶段:相关系数矩阵求解阶段和矩阵的奇异值分解(Singular Value Decomposition,SVD)阶段。通过当前最流行的并行框架MapReduce,融合矩阵的QR分解,提出了一种奇异值分解的并行实现方法。利用随机产生的不同维度大小的双浮点矩阵比较并行奇异值分解相对传统串行环境下的算法效率的提升情况,并分析算法效率。之后,将并行奇异值分解融合到PCA算法中,同时提出相关系数矩阵的并行计算过程,将PCA计算的两个部分完全并行化。利用不同维度的矩阵对提出的并行PCA算法与已存在的未完全并行PCA算法、常规的PCA算法的运算速度进行比较,分析完全并行化PCA算法的加速比,最终得出所提算法在处理一定规模的大数据情况下的时间消耗要少许多。

关 键 词:主成分分析  奇异值分解  MapReduce
收稿时间:2015/11/13 0:00:00
修稿时间:2016/9/3 0:00:00

Research of Distributed Principle Components Analysis Algorithm Based on MapReduce
YI Xiu-shuang,LIU Yong,LI Jie and WANG Xing-wei.Research of Distributed Principle Components Analysis Algorithm Based on MapReduce[J].Computer Science,2017,44(2):65-69.
Authors:YI Xiu-shuang  LIU Yong  LI Jie and WANG Xing-wei
Affiliation:School of Computer Science and Engineering,Northeastern University,Shengyang 110819,China,School of Computer Science and Engineering,Northeastern University,Shengyang 110819,China,School of Computer Science and Engineering,Northeastern University,Shengyang 110819,China and School of Software,Northeastern University,Shengyang 110819,China
Abstract:With the population of the parallel framework of MapReduce,the parallelization of various types of Data Mi-ning algorithms is becoming a hot area of research.Principle components analysis(PCA) algorithm is getting more and more attention too.Summarizing the recent research result of the parallelization of PCA,we found that these PCA algorithms are not fully parallelized,especially the process of calculating the eigenvalue of the matrix.Whole process of PCA algorithm is divided into two stages,which are solution of the correlation coefficient matrix and the singular value decomposition of the matrix.Through the combination of the most popular MapReduce parallel framework and the QR decomposition of matrix,a new way to parallel the SVD was proposed in this paper.Analyzing the calculation speed of the parallel algorithm through the experiment on the data set which is consisted of random produced double floating point matrix of different dimensions,the result with the traditional serial algorithm was compared to show the efficiency improvement of the mentioned algorithm.Then we integrated the SVD algorithm into the PCA algorithm,and proposed the parallel computing process of the correlation coefficient matrix,and this will parallel the two stages of PCA algorithm.Subsequently,we conducted the comparison between the existing not fully parallelized PCA algorithm and normal PCA algorithm with the proposed algorithm on different dimensions of the matrix.Then analying the speed-up ratio of the proposed algorithm,we can find that our algorithm will consume less time when processing massive data set.
Keywords:Principle components analysis  SVD  MapReduce
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号