首页 | 本学科首页   官方微博 | 高级检索  
     


News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark
Authors:Zhuo Zhou  Jiaohua Qin  Xuyu Xiang  Yun Tan  Qiang Liu  Neal N. Xiong
Affiliation:1.School of Computer & Software, Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing, 210044, China.2 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China.3 Department of Economics, Finance, Insurance and Risk Management University of Central Arkansas, Conway, 72035, USA.
Abstract:Due to the slow processing speed of text topic clustering in stand-alonearchitecture under the background of big data, this paper takes news text as the researchobject and proposes LDA text topic clustering algorithm based on Spark big dataplatform. Since the TF-IDF (term frequency-inverse document frequency) algorithmunder Spark is irreversible to word mapping, the mapped words indexes cannot be tracedback to the original words. In this paper, an optimized method is proposed that TF-IDFunder Spark to ensure the text words can be restored. Firstly, the text feature is extractedby the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then thefeatures are inputted to the LDA (Latent Dirichlet Allocation) topic model for training.Finally, the text topic clustering is obtained. Experimental results show that for large datasamples, the processing speed of LDA topic model clustering has been improved basedSpark. At the same time, compared with the LDA topic model based on word frequencyinput, the model proposed in this paper has a reduction of perplexity.
Keywords:News text topic clustering   spark platform   countvectorizer algorithm   TFIDF algorithm   latent dirichlet allocation model.
点击此处可从《计算机、材料和连续体(英文)》浏览原始摘要信息
点击此处可从《计算机、材料和连续体(英文)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号