首页 | 本学科首页   官方微博 | 高级检索  
     

具有词判别力学习能力的短文本聚类概率模型研究
引用本文:牛亚男.具有词判别力学习能力的短文本聚类概率模型研究[J].计算机应用研究,2018,35(12).
作者姓名:牛亚男
作者单位:北京交通大学 计算机科学与信息技术学院
基金项目:国家自然科学基金(61473030),国家自然基金面上项目)
摘    要:社交媒体的广泛使用使短文本聚类成为一个重要的研究课题。但短文本词向量的高维、稀疏性限制了传统文本聚类方法在短文本中的效果,并且由于词的稀疏性,词对簇结构的判别能力对短文本类结构的学习显得尤为重要。本文我们提出了一种基于概率模型的具有词判别力学习能力的短文本聚类框架,并在经典文本聚类模型LDA(Ldatant Drichilet Allocation)、BTM(Biterm Topic Model)和GSDMM(Gibbs Sampling Drichilet Mutitional Mixture model)模型中验证了词判别力学习对类结构学习的有效性。通过Gibbs采样算法对模型中的参数进行求解。最后在真实数据集上的实验结果显示具有词判别力学习的概率模型可以提高已有模型的聚类效果。

关 键 词:短文本聚类  概率模型  判别力  
收稿时间:2017/7/10 0:00:00
修稿时间:2017/9/6 0:00:00

Research on Short - text Clustering Probability Model with Word Discrimination Ability
Niu Yanan.Research on Short - text Clustering Probability Model with Word Discrimination Ability[J].Application Research of Computers,2018,35(12).
Authors:Niu Yanan
Affiliation:School of Computer Science and Technology,Beijing Jiaotong University
Abstract:The widespread use of social media makes short texts clustering an important research topic. However, the high-dimensions and sparseness of the short-text word vector limits the effect of the conventional document clustering approach in short text, and because of the sparseness of the word, the discriminant of the word on class structure is very important for short text clustering. In this paper, we propose a short text clustering framework based on probabilistic model with word discriminant learning ability, and validates the effectiveness of word discrimination learning in class structure learning in LDA (Latent Dirichlet Allocation), BTM (Biterm Topic Model) and GSDMM (Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model). The experimental results on the real data set show that the probability model with word discriminant learning can improve the clustering effect of the existing model.
Keywords:short text clustering  probabilities model  discriminant
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号