首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于MapReduce模型的高效频繁项集挖掘算法
引用本文:朱坤,黄瑞章,张娜娜.一种基于MapReduce模型的高效频繁项集挖掘算法[J].计算机科学,2017,44(7):31-37.
作者姓名:朱坤  黄瑞章  张娜娜
作者单位:贵州大学计算机科学与技术学院 贵阳550025 贵州省公共大数据重点实验室 贵阳550025,贵州大学计算机科学与技术学院 贵阳550025 贵州省公共大数据重点实验室 贵阳550025,贵州大学计算机科学与技术学院 贵阳550025 贵州省公共大数据重点实验室 贵阳550025
基金项目:本文受国家自然科学基金(61462011,61202089),高等学校博士学科专项科研基金(20125201120006),贵州大学引进人才科研项目(2011015)资助
摘    要:由于互联网技术急速发展及其用户迅速地增加,很多网络服务公司每天不得不处理TB级甚至更大规模的数据量。在如今的大数据时代,如何挖掘有用的信息正变成一个重要的问题。关于数据挖掘(Data Mining)的算法在很多领域中已经被广泛运用,挖掘频繁项集是数据挖掘中最常见且最主要的应用之一,Apriori则是从一个大的数据集中挖掘出频繁项集的最为典型的算法。然而,当数据集比较大或使用单一主机时,内存将会被快速消耗,计算时间也将急剧增加,使得算法性能较低,基于MapReduce的分布式和并行计算则被提出。文中提出了一种改进的MMRA (Matrix MapReduce Algorithm)算法,它通过将分块数据转换成矩阵来挖掘所有的频繁k项集;然后将提出的算法和目前已经存在的两种算法(one-phase算法、k-phase算法)进行比较。采用Hadoop-MapReduce作为实验平台,并行和分布式计算为处理大数据集提供了一个潜在的解决方案。实验结果表明,改进算法的性能优于其他两种算法。

关 键 词:Hadoop  MapReduce  分布式计算  数据挖掘  频繁项集挖掘  Apriori算法
收稿时间:2016/8/20 0:00:00
修稿时间:2016/10/25 0:00:00

Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model
ZHU Kun,HUANG Rui-zhang and ZHANG Na-na.Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model[J].Computer Science,2017,44(7):31-37.
Authors:ZHU Kun  HUANG Rui-zhang and ZHANG Na-na
Affiliation:College of Computer Science and Technology,Guizhou University,Guiyang 550025,China Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China,College of Computer Science and Technology,Guizhou University,Guiyang 550025,China Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China and College of Computer Science and Technology,Guizhou University,Guiyang 550025,China Guizhou Provincial Key Laboratory of Public Big Data,Guiyang 550025,China
Abstract:Along with the rapid development of Internet and the rapid growing group of users,many Internet services companies have to manage TB size or higher amount of data every day.How to find useful information in this big data era is becoming an important problem.The data mining algorithm has been widely used in many fields,and finding frequent itemsets is one of the most common and primary applications of data mining,and Apriori algorithm is the most typical algorithm for finding frequent itemsets from a big transaction database.However,when the dataset size is rather huge or a single host is used,the memory would be quickly exhausted and the computation time would also increase dramatically,which make the algorithm performance inefficient.Parallel and distributed computing based on the MapReduce framework has been proposed.An improved reformative MMRA (Matrix MapReduce Algorithm) algorithm which should convert the blocked data into matrixs to find all frequent k-itemsets was proposed in this paper,and the proposed algorithm was compared with current two existed algorithms(one-phase algorithm and k-phase algorithm).Adapting Hadoop-MapReduce as the experiment platform,parallel and distributed computing offer a potential solution for processing vast amount of data.Experimental results show that the proposed algorithm outperforms the other two algorithms.
Keywords:Hadoop  MapReduce  Distributed computing  Data mining  Frequent itemset mining  Apriori algorithm
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号