首页 | 本学科首页   官方微博 | 高级检索  
     

基于网格密度和局部敏感哈希函数的并行化聚类算法
引用本文:毛伊敏,陶涛,曹文梁.基于网格密度和局部敏感哈希函数的并行化聚类算法[J].计算机应用研究,2021,38(5):1422-1427.
作者姓名:毛伊敏  陶涛  曹文梁
作者单位:江西理工大学信息工程学院,江西赣州341000;东莞职业技术学院计算机工程系,广东东莞518172
基金项目:国家重点研发计划资助项目(2018YFC1504705);国家自然科学基金资助项目(41562019);广东省普通高校特色创新(自然科学)资助项目(2019GKTSCX142,2017GKTSCX101)。
摘    要:针对大数据背景下基于划分的聚类算法中存在初始中心敏感,节点间通信开销大以及集群效率低下等问题,提出了基于网格密度和局部敏感哈希函数的PBGDLSH-MR并行化聚类算法。首先,对初始数据集提出网格密度策略(GDS)获取初始中心点,有效避免了随机选取引起的初始中心敏感的问题;其次,提出基于局部敏感哈希函数的数据分区(DP-LSH)用于投射关联性较大的数据对象到同一子数据集中,得到map上的数据分区,并设计相似性度量公式(SI)对数据分区结果进行评价,从而降低了节点间的通信开销;接着设计自适应分组策略(AGS)处理数据分区中数据倾斜的问题,进而有效地提高了集群效率;最后,结合MapReduce计算模型并行挖掘簇中心,生成最终聚类结果。实验结果表明,PBGDLSH-MR算法的聚类效果更佳,同时在大数据环境下能有效地提高并行计算的效率。

关 键 词:大数据  并行化聚类  网格密度  哈希函数  MAPREDUCE
收稿时间:2020/4/24 0:00:00
修稿时间:2021/4/12 0:00:00

Partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce
Mao YiMin,TaoTao and Cao Wenliang.Partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce[J].Application Research of Computers,2021,38(5):1422-1427.
Authors:Mao YiMin  TaoTao and Cao Wenliang
Affiliation:(School of Information Engineering,Jiangxi University of Science&Technology,Ganzhou Jiangxi 341000,China;Dept.of Computer Engineering,Dongguan Polytechnic,Dongguan Guangdong 518172,China)
Abstract:Aiming at the problems of sensitivity of initial center,high communication overhead of nodes and low efficiency of cluster in big data clustering algorithm based on partitioning,this paper proposed a partitioning-based clustering algorithm using grid density and locality sensitive hash function based on MapReduce,named PBGDLSH-MR.Firstly,based on the initial dataset,it proposed the GDS(grid density strategy)to get the initial clustering center,which avoided the sensitivity of initial center caused by random selection of initial cluster center.Secondly,it proposed the DP-LSH(data partitioning based on locality sensitive hash functions)to map more closely related data objects into the same subdataset and get data partitions on the map.Meanwhile,it designed a formula SI(similarity improvement)to evaluate the data partitioning results,reduced the communication overhead between nodes.In addition,this paper designed an AGS(adaptive grouping strategy)to handle data skew in data partitions,which improved the cluster efficiency.Finally,based on MapReduce,it mined the cluster centers in parallel to gene-rate the final clustering results.The experimental results show that the PBGDLSH-MR has better clustering results and performs better parallelization in big data.
Keywords:big data  parallelize clustering  grid density  hash functions  MapReduce
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号