首页 | 本学科首页   官方微博 | 高级检索  
     

Spark环境下的并行模糊C均值聚类算法
引用本文:王桂兰,周国亮,萨初日拉,朱永利. Spark环境下的并行模糊C均值聚类算法[J]. 计算机应用, 2016, 36(2): 342-347. DOI: 10.11772/j.issn.1001-9081.2016.02.0342
作者姓名:王桂兰  周国亮  萨初日拉  朱永利
作者单位:华北电力大学 信息与网络管理中心, 河北 保定 071003
基金项目:中央高校基本科研业务费专项资金资助项目(13MS103);河北省自然科学基金资助项目(F2014502069)。
摘    要:针对聚类算法需要处理数据集的规模越来越大、时效性要求越来越高,对算法的大数据适应能力和性能要求更高的问题,提出一种在Spark分布式内存计算平台下的模糊C均值(FCM)算法Spark-FCM。首先对矩阵通过水平分割实现分布式存储,不同向量存储在不同节点;然后基于FCM算法的计算特点,设计了分布式和缓存敏感的常用矩阵操作,包括乘法、转置和加法等;最后基于矩阵操作和Spark平台特点,设计了Spark-FCM算法,主要数据结构采用分布式矩阵存储,具有节点间数据移动少和每个步骤分布式计算特点。通过在单机和集群环境下测试,算法具有良好的可扩展性,并可以适应大规模数据集,算法性能与数据量成线性关系,集群环境下性能比单机提高2~3倍。

关 键 词:Spark  模糊C均值  矩阵运算  内存计算  
收稿时间:2015-08-29
修稿时间:2015-09-13

Parallel fuzzy C-means clustering algorithm in Spark
WANG Guilan,ZHOU Guoliang,SA Churila,ZHU Yongli. Parallel fuzzy C-means clustering algorithm in Spark[J]. Journal of Computer Applications, 2016, 36(2): 342-347. DOI: 10.11772/j.issn.1001-9081.2016.02.0342
Authors:WANG Guilan  ZHOU Guoliang  SA Churila  ZHU Yongli
Affiliation:Network and Information Management Center, North China Electric Power University, Baoding Hebei 071003, China
Abstract:With the growing data volume and timeliness requirement, the clustering algorithms need to be adaptive to big data and higher performance. A new algorithm named Spark Fuzzy C-Means (FCM) was proposed based on Spark distributed in-memory computing platform. Firstly, the matrix was partitioned into vector set horizontally and distributedly stored, which meant different vectors were distributed in different nodes. Then based on the characteristics of FCM algorithm, matrix operations were redesigned considering distributed storage and cache sensitivity, including multiplication, addition and transpose. Finally, Spark-FCM algorithm which combined with matrix operations and Spark platform was implemented. The primary data structures of the algorithm adopted distributed matrix storage with fewer moving data between nodes and distributed computing in each step. The test results in stand-alone and cluster environments show that Spark-FCM has good scalability and can adjust to large-scale data sets, the performance and the size of data shows a linear relationship, and the performance in cluster environment is 2 to 3 times higher than that in stand-alone.
Keywords:Spark   Fuzzy C-Means(FCM)   matrix computing   in-memory computing
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号