首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于聚类的微博关键词提取方法的研究与实现
引用本文:孙兴东,李爱平,李树栋.一种基于聚类的微博关键词提取方法的研究与实现[J].信息网络安全,2014(12):27-31.
作者姓名:孙兴东  李爱平  李树栋
作者单位:国防科学技术大学计算机学院,湖南长沙410073
基金项目:国家科技支撑计划[2012BAH38800].国家自然科学基金[61202362,61262057];中国博士后科学基金[2013M542560]
摘    要:文章提出了一种基于聚类的微博关键词提取方法。实验过程分三个步骤进行。第一步,对微博文本进行预处理和分词处理,再运用TF-IDF算法与TextRank算法计算词语权重,针对微博短文本的特性在计算词语权重时运用加权计算的方法,在得到词语权重后使用聚类算法提取候选关键词;第二步,根据n-gram语言模型的理论,取n的值为2定义最大左邻概率和最大右邻概率,据此对候选关键词进行扩展;第三步,根据语义扩展模型中邻接变化数和语义单元数的概念,对扩展后的关键词进行筛选,得到最终的提取结果。实验结果表明在处理短文本时Text Ramk算法比TF-IDF算法表现更佳,同时该方法能够有效地提取出微博中的关键词。

关 键 词:微博  聚类算法  TF-IDF  TextRank  n-gram语言模型

Research and Implementation of Micro-blog Keyword Extraction Method Based on Clustering
SUN Xing-dong,LI Ai-ping,LI Shu-dong.Research and Implementation of Micro-blog Keyword Extraction Method Based on Clustering[J].Netinfo Security,2014(12):27-31.
Authors:SUN Xing-dong  LI Ai-ping  LI Shu-dong
Affiliation:( College of Computer Science, National University of Defense Technology, Changsha Hunan 410073, China)
Abstract:This paper presented a Micro-blog keyword extraction based on Clustering. It achieved in three steps. At ifrst, the experiment pre-processed and breaked word on the microblogs, then used TF-IDF and TextRank algorithm to calculate word weight, according to the characteristics of short text microblogging used a combination of the two methods calculate weighting terms and extracted candidate keyword by clustering algorithm. Secondly, taked n is 2 deifnes the maximum probability left neighbor and maximum probability right neighbor based on the theory of n-gram language model, accordingly extended the candidate keywords into key phrases. At last, the result ifltered according to the concept of accessory variety and semantic number of units in the semantics extension model. The experimental results show this method can effectively extracted the microblogs keywords and TextRank performed better than the TF-IDF when processed short text .
Keywords:TF-IDF  TextRank  micro-blog keyword  clustering algorithm  TF-IDF  TextRank  n-gram language model
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号