首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce的高维数据频繁项集挖掘
引用本文:赵欣灿,朱云,毛伊敏.基于MapReduce的高维数据频繁项集挖掘[J].计算机工程,2022,48(3):81-89.
作者姓名:赵欣灿  朱云  毛伊敏
作者单位:1. 江西理工大学 理学院, 江西赣州 341000;2. 江西理工大学 信息工程学院, 江西赣州 341000
基金项目:国家重点研发计划(2018YFC1504705);;国家自然科学基金(41562019);;江西省教育厅科技项目(GJJ151528,GJJ151531);
摘    要:传统的数据挖掘算法在面向大规模高维数据的挖掘过程中,存在数据特征捕捉准确率低、节点负载不均衡、数据交互频繁、频繁项集紧凑化程度低等问题。提出基于MapReduce的并行挖掘算法PARDG-MR,结合高维数据特征,设计基于维度粒化算法和负载均衡算法的DGPL策略,并对数据进行预处理,以解决高维复杂数据特征属性捕捉困难及数据划分中节点负载不均衡的问题。通过构建基于PJPFP-Tree树的频繁项集并行挖掘策略PARM,实现频繁项集的并行化分组过程,从而提高数据处理的运行效率。在此基础上,提出基于剪枝前缀推论的整合节点剪枝算法PJPFP,提高频繁项集挖掘过程中的剪枝效率,增强频繁项集的紧凑化程度。在Webdocs、NDC、Gisette 3个数据集上的实验结果表明,相比PFP-growth、PWARM、MRPrePost算法,该算法的运行时间平均缩短了约20%,能够有效提高数据挖掘效率且降低内存空间。

关 键 词:高维数据  频繁项集  维度粒化  并行化  候选剪枝策略  
收稿时间:2020-12-10
修稿时间:2021-03-09

Frequent Itemset Mining of High-Dimensional Data Based on MapReduce
ZHAO Xincan,ZHU Yun,MAO Yimin.Frequent Itemset Mining of High-Dimensional Data Based on MapReduce[J].Computer Engineering,2022,48(3):81-89.
Authors:ZHAO Xincan  ZHU Yun  MAO Yimin
Affiliation:1. School of Science, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China;2. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
Abstract:In the mining process of large-scale high-dimensional data,the traditional data mining algorithm has some problem,such as low accuracy of data feature capture,unbalanced node load,frequent data interaction,and low compactness of frequent itemset.Therefore,this paper proposes a parallel mining algorithm,PARDG-MR which is based on MapReduce. By combining the characteristics of high-dimensional data,a DGPL strategy based on the dimensional granulation algorithm and load balancing algorithm are designed.Data are preprocessed to solve the problems of difficult feature attribute capture of high-dimensional complex data and unbalanced node load in the data division. The parallel grouping process of frequent itemset is realized by constructing a parallel mining strategy PARM of frequent itemset based on the PJPFP-Tree to improve the operation efficiency of data processing.On this basis,it proposes an integrated node pruning algorithm PJPFP based on the pruning prefix inference,which improves the pruning efficiency in the process of frequent itemset mining to enhance the compactness of frequent itemset and improve the overall mining efficiency of the algorithm.The experimental results on Webdocs,NDC,and Gisette data sets show that compared with PFP-growth,PWARM and MRPrePost algorithms,the running time of PARDG-MR is shorter by approximately 20% on average,therefore,it is more effective and efficient in data mining.
Keywords:high-dimensional data  frequent itemset  dimensional granulation  parallel  candidate pruning strategy
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号