首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark框架的FP-Growth大数据频繁项集挖掘算法
引用本文:邵梁,何星舟,尚俊娜.基于Spark框架的FP-Growth大数据频繁项集挖掘算法[J].计算机应用研究,2018,35(10).
作者姓名:邵梁  何星舟  尚俊娜
作者单位:浙江建设职业技术学院,浙江工业大学,杭州电子科技大学
基金项目:国家自然科学基金(No. 166223123);浙江省自然科学基金(No. jg20160405)
摘    要:针对大数据中的频繁项集挖掘问题,提出一种基于Spark框架的FP-Growth频繁项集并行挖掘算法。首先,根据垂直布局思想将数据按照事务标识符垂直排列,以此解决扫描整个数据集的缺陷。然后,通过FP-Growth算法构建频繁模式树,并生成频繁1-项集。接着,通过扫描垂直数据集来计算项集的支持度,从而识别出非频繁项,并将其从数据集中删除以降低数据尺寸。最后,通过迭代过程来生成频繁 -项集。在标准数据集上的实验结果表明,该算法能够有效挖掘出频繁项集,在执行时间方面具有很大的优越性。

关 键 词:大数据  频繁项集挖掘  Spark框架  FP-Growth算法  垂直布局
收稿时间:2017/5/11 0:00:00
修稿时间:2018/8/29 0:00:00

Frequent Item sets Mining Algorithm for big data based on FP-Growth and Spark Framework
SHAO Liang,HE Xing-zhou and SHANG Jun-na.Frequent Item sets Mining Algorithm for big data based on FP-Growth and Spark Framework[J].Application Research of Computers,2018,35(10).
Authors:SHAO Liang  HE Xing-zhou and SHANG Jun-na
Affiliation:Zhejiang College of Construction,S Hangzhou,Zhejiang,,
Abstract:For the issues of the frequent item sets mining in big data, a parallel frequent item sets mining algorithm based on FP-Growth and Spark framework is proposed. Firstly, the data is arranged vertically according to the transaction identifier, in order to solve the defects of scan the entire data set. Then, the FP-Growth algorithm is used to construct the frequent pattern tree and generate frequent 1-item sets. After that, the support of the item set is calculated by scanning the vertical data set, so as to identify the non-frequent items, and delete them from the data set to reduce the data size. Finally, the iterative process is used to generate frequent item sets. The experimental results on the standard dataset show that the algorithm can effectively excavate frequent item sets and have great superiority in execution time.
Keywords:Big data  Frequent Item sets Mining  Spark framework  FP-Growth algorithm  Vertical layout
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号