首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce框架的分布式软K段主曲线算法
引用本文:胡作梁张红云.基于MapReduce框架的分布式软K段主曲线算法[J].数据采集与处理,2017,32(3):507-515.
作者姓名:胡作梁张红云
作者单位:1.同济大学计算机科学与技术系,上海,201804;2.同济大学嵌入式系统与服务计算教育部重点实验室,上海,201804
摘    要:传统的主曲线算法在小规模数据集上能获得良好的效果,但单节点的计算和存储能力都不能满足海量数据主曲线的提取要求,而算法分布式并行化是目前解决该类问题最有效的途径之一。本文提出基于MapReduce框架的分布式软K段主曲线算法 (Distributed soft k-segments principal curve,DisSKPC)。首先,基于分布式K-Means算法,采用递归粒化方法对数据集进行粒化,以确定粒的大小并保证粒中数据的关联性。然后调用软K段主曲线算法计算每个粒数据的局部主成分线段,并提出用噪声方差来消除在高密集、高曲率的数据区域可能产生的过拟合线段。最后借助哈密顿路径和贪婪算法连接这些局部主成分线段,形成一条通过数据云中间的最佳曲线。实验结果表明,本文所提出的DisSKPC算法具有良好的可行性和扩展性。

关 键 词:分布式并行化  主曲线  数据粒化  MapReduce

Distributed Soft K-Segments Algorithm for Principal Curves Based on MapReduce
Affiliation:1.Department of Computer Science and Technology, Tongji University, Shanghai, 201804, China; 2.Key Laboratory of Embedded Systems and Service Computing of Ministry of Education,Tongji University, Shanghai, 201804, China
Abstract:The traditional principal curves algorithm can obtain good results on small datasets. But the computing and storage resources of a single node cannot meet the requirements of the extraction of principal curves on massive datasets. Distributed parallel computing is one of the most effective way to solve the problems. Therefore, we proposed a distributed soft K-segments algorithm for principal curves based on MapReduce, named DisSKPC. First, we recursively granulated all the numerical data into information granules to limit each granular size and ensure the relevance of the data in the granules using the distributed K-Means algorithm. Then we calculated the local principal component segments of each granule and eliminated over-fitting segments that may arise in the area of high-density and high-curvature using the noise variance. Finally, we connected these local principal component segments using the Hamiltonian path and greedy algorithm, forming a best curve through the middle of the data cloud. Experimental results demonstrate the feasibility and scalability of the proposed DisSKPC algorithm.
Keywords:distributed parallel  principal curves  data granulation  MapReduce
点击此处可从《数据采集与处理》浏览原始摘要信息
点击此处可从《数据采集与处理》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号