首页 | 本学科首页   官方微博 | 高级检索  
     

结合语义改进的K-means短文本聚类算法
引用本文:邱云飞,赵 彬,林明明,王 伟.结合语义改进的K-means短文本聚类算法[J].计算机工程与应用,2016,52(19):78-83.
作者姓名:邱云飞  赵 彬  林明明  王 伟
作者单位:辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
摘    要:针对短文本聚类存在的三个主要挑战,特征关键词的稀疏性、高维空间处理的复杂性和簇的可理解性,提出了一种结合语义改进的K-means短文本聚类算法。该算法通过词语集合表示短文本,缓解了短文本特征关键词的稀疏性问题;通过挖掘短文本集的最大频繁词集获取初始聚类中心,有效克服了K-means聚类算法对初始聚类中心敏感的缺点,解决了簇的理解性问题;通过结合TF-IDF值的语义相似度计算文档之间的相似度,避免了高维空间的运算。实验结果表明,从语义角度出发实现的短文本聚类算法优于传统的短文本聚类算法。

关 键 词:文本挖掘  短文本聚类  K-means算法  最大频繁词集  知网  语义相似度  

Improved K-means clustering algorithm combined semantic similarity of short text
QIU Yunfei,ZHAO Bin,LIN Mingming,WANG Wei.Improved K-means clustering algorithm combined semantic similarity of short text[J].Computer Engineering and Applications,2016,52(19):78-83.
Authors:QIU Yunfei  ZHAO Bin  LIN Mingming  WANG Wei
Affiliation:School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
Abstract:Nowadays, there are three major challenges for short text clustering, the sparsity of feature key, the complexity of processing in high-dimensional space and the comprehensibility of clusters. For these challenges, a K-means clustering algorithm is proposed, which is improved by combining with semantic. Short text is described by collection of words in this algorithm, it alleviates the sparsity problem of characteristics of short text keywords. The clustering center can be obtained by mining the maximum frequent word set of short text collection, which effectively overcomes the defect that K-means clustering algorithm is sensitive to the initial clustering center, it solves the problem of the comprehensibility of clusters, and avoids the operation in high-dimensional space. The experimental results show that short text clustering algorithm combined with semantic is better than traditional algorithms.
Keywords:text mining  clustering of short text  K-means algorithm  maximum frequent word set  HowNet  semantic similarity  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号