首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于Spark的不确定数据集频繁模式挖掘算法
引用本文:杨阳,丁家满,李海滨,贾连印,游进国,姜瑛. 一种基于Spark的不确定数据集频繁模式挖掘算法[J]. 信息与控制, 2019, 48(3): 257-264. DOI: 10.13976/j.cnki.xk.2019.8371
作者姓名:杨阳  丁家满  李海滨  贾连印  游进国  姜瑛
作者单位:昆明理工大学信息工程与自动化学院, 云南 昆明 650500
基金项目:国家自然科学基金资助项目(51467007,61562054,61462050)
摘    要:如何在海量不确定数据集中提高频繁模式挖掘性能是目前研究的热点.传统算法大多是以期望、概率或者权重等单一指标为数据项集支持度,在大数据背景下,同时考虑概率和权重支持度的算法难以兼顾其执行效率.为此,本文提出一种基于Spark的不确定数据集频繁模式挖掘算法(UWEFP),首先,为了同时兼顾数据项的概率和权重,计算一项集的最大概率权重值并进行剪枝;然后,为了减少对数据集的多次扫描,结合Spark框架的优点,设计了一种具有FP-tree特征的新颖的UWEFP-tree结构进行模式树的构建及挖掘;最后在Spark环境下,以UCI数据集进行实验验证.实验结果表明本文的方法在保证挖掘结果的同时,提高了效率.

关 键 词:不确定数据  数据挖掘  频繁模式  SPARK
收稿时间:2018-07-25

A Spark-based Frequent Patterns Mining Algorithm for Uncertain Datasets
YANG Yang,DING Jiaman,LI Haibin,JIA Lianyin,YOU Jinguo,JIANG Ying. A Spark-based Frequent Patterns Mining Algorithm for Uncertain Datasets[J]. Information and Control, 2019, 48(3): 257-264. DOI: 10.13976/j.cnki.xk.2019.8371
Authors:YANG Yang  DING Jiaman  LI Haibin  JIA Lianyin  YOU Jinguo  JIANG Ying
Affiliation:Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Abstract:In recent years, improving the performance of mining frequent patterns in massive uncertain datasets has become an active research topic. Most traditional algorithms for mining frequent patterns consider only a single factor of data items-any of expectation, probability, or weight, while for those algorithms that consider both probability and weight, it is difficult to balance execution efficiency when big data are involved. Therefore, we propose a Spark framework-based algorithm for mining frequent patterns according to expected weight for uncertain datasets (UWEFP for short). To consider both the probabilities and weights of items, UWEFP first calculates the maximum probability weight value of one set and to prune them. A novel UWEFP-tree structure with the advantages of Spark framework is designed to mine frequent patterns; it has the FP-tree characteristics and reduces the time of scanning the datasets. Finally, in the Spark environment, UCI datasets are used to verify the algorithm. The experimental results show that the proposed algorithm is effective and has excellent performance.
Keywords:uncertain data  data mining  frequent patterns  Spark  
本文献已被 维普 等数据库收录!
点击此处可从《信息与控制》浏览原始摘要信息
点击此处可从《信息与控制》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号