首页 | 本学科首页   官方微博 | 高级检索  
     

基于Hadoop平台的TFIDF算法并行化研究
引用本文:王静宇,赵伟燕.基于Hadoop平台的TFIDF算法并行化研究[J].计算机工程与科学,2014,36(6):1018-1022.
作者姓名:王静宇  赵伟燕
基金项目:国家自然科学基金资助项目(61163025);内蒙古自然科学基金资助项目(2012MS0912);内蒙古教育厅科研资助项目(Njzy12110)
摘    要:针对大数据集下文本分类算法在单机上训练和测试过程效率低下的问题,提出了基于Hadoop分布式平台的TFIDF文本分类算法,并给出了算法实现的具体流程。通过MapReduce编程模型实现了考虑到词在文档中位置的并行化TFIDF文本分类算法,并与传统串行算法进行了对比,同时在单机和集群模式下进行了实验。实验表明,使用并行化的TFIDF文本分类算法可实现对海量数据的高速有效分类,并使算法性能得到优化。

关 键 词:文本分类  MapReduce  并行化  TFIDF算法  
收稿时间:2012-12-22
修稿时间:2014-06-25

Research on parallelizing the TFIDF algorithm based on Hadoop
WANG Jing yu,ZHAO Wei yan.Research on parallelizing the TFIDF algorithm based on Hadoop[J].Computer Engineering & Science,2014,36(6):1018-1022.
Authors:WANG Jing yu  ZHAO Wei yan
Affiliation:(1.School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083; 2.College of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014010,China)
Abstract:Aiming to improve the efficiency of text classification algorithm on a large data set during the training and testing process, the TFIDF text classification algorithm based on the Hadoop distribution platform is proposed, and its implementation process is given. By using the MapReduce programming model, the parallelized TFIDF text classification algorithm is implemented, which takes the word locations into consideration. Comparative experiments are conducted between the improved TFIDF algorithm and the traditional serial algorithm in both the standalone mode and the cluster mode. The experimental results show that the improved TFIDF text classification algorithm can achieve high speed mass data classification and optimize performance.Key words
Keywords:text classification  MapReduce  parallelization  TFIDF algorithm  
本文献已被 CNKI 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号