首页 | 本学科首页   官方微博 | 高级检索  
     

微博文本聚类中特征扩展策略研究
引用本文:段旭磊,张仰森,郭正斌.微博文本聚类中特征扩展策略研究[J].计算机工程与应用,2017,53(13):90-94.
作者姓名:段旭磊  张仰森  郭正斌
作者单位:北京信息科技大学 智能信息处理研究所,北京 100192
摘    要:针对微博文本高维、稀疏的特点,比较基于同义词词林等外部知识库的文本扩展策略,利用Word2vec训练微博语料,并构建微博上下文相关词词表,通过种子词表和微博标签信息去扩展微博文本流中的关键词,最后提出了提取微博文本关键词及区分词向量中相似词和相关词的方法。实验结果证明,微博短文本经过Word2vec词向量相关词及微博标签扩展后,其聚类效果有了明显提高。

关 键 词:微博文本  高维稀疏  关键词提取  相似词  相关词  特征扩展  聚类  

Feature extension of cluster analysis based on Microblog
DUAN Xulei,ZHANG Yangsen,GUO Zhengbin.Feature extension of cluster analysis based on Microblog[J].Computer Engineering and Applications,2017,53(13):90-94.
Authors:DUAN Xulei  ZHANG Yangsen  GUO Zhengbin
Affiliation:Institute of Intelligence Information Processing, Beijing information Science and Technology University, Beijing 100192, China
Abstract:Microblog has become the soil of information generated and spread today. But the information in the Microblog is different from the news Web page or blog information. In the Microblog, these characteristics, which the texts are high-dimensional and sparse, bring great challenges to the Microblog text processing. According to the characteristics of Microblog, this paper compares the methods that the expansion strategy of short text based on HowNet and Cilin, it proposes that using Word2vec to train the corpus of Microblog, and constructs a related vocabulary words of the Microblog context, then uses the seed words and Microblog label information to expand Microblog text, and puts forward the methods of extracting Microblog text keywords and distinguishing the similar words and related words. Finally, the experiments show that by using the Word2vec to extend Microblog is better, and the effect of cluster analysis for Microblog text has been significantly improved.
Keywords:Microblog text  high dimension and sparse  keyword extraction  similar words  related words  feature expansion  clustering  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号