改进特征权重的短文本聚类算法 Short Text Clustering Algorithm with Improved Feature Weight期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

改进特征权重的短文本聚类算法

引用本文：	马存,郭锐锋,高岑,孙咏. 改进特征权重的短文本聚类算法[J]. 计算机系统应用, 2018, 27(9): 210-214

作者姓名：	马存郭锐锋高岑孙咏

作者单位：	中国科学院大学, 北京 100049;中国科学院沈阳计算技术研究所, 沈阳 110168,中国科学院沈阳计算技术研究所, 沈阳 110168,中国科学院沈阳计算技术研究所, 沈阳 110168,中国科学院沈阳计算技术研究所, 沈阳 110168

摘要：	短文本的研究一直是自然语言处理领域的热门话题，由于短文本特征稀疏、用语口语化严重的特点，它的聚类模型存在维度高、主题聚焦性差、语义信息不明显的问题.针对对上述问题的研究，本文提出了一种改进特征权重的短文本聚类算法.首先，定义多因子权重规则，基于词性和符号情感分析构造综合评估函数，结合词项和文本内容相关度进行特征词选择；接着，使用Skip-gram模型（Continuous Skip-gram Model）在大规模语料中训练得到表示特征词语义的词向量；最后，利用RWMD算法计算短文本之间的相似度并将其应用K-Means算法中进行聚类.最后在3个测试集上的聚类效果表明，该算法有效提高了短文本聚类的准确率.
关键词：	特征权重情感分析词向量 RWMD距离
收稿时间：	2018-01-27
修稿时间：	2018-03-07
Short Text Clustering Algorithm with Improved Feature Weight

MA Cun,GUO Rui-Feng,GAO Cen and SUN Yong. Short Text Clustering Algorithm with Improved Feature Weight[J]. Computer Systems& Applications, 2018, 27(9): 210-214

Authors:	MA Cun GUO Rui-Feng GAO Cen SUN Yong

Affiliation:	University of Chinese Academy of Sciences, Beijing 100049, China;Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China,Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China,Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China and Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China

Abstract:	Short text research has been a hot topic in the field of natural language processing. Due to the sparseness of short texts and serious colloquialisms, its clustering model has the problems of high dimensionality, poor focus of theme, and unclear semantic information. In view of the above problems, this study proposes a short text clustering algorithm with improving the feature weight. Firstly, the rules of multi-factor weight are defined, the comprehensive evaluation function is constructed based on part-of-speech and symbolic sentiment analysis, and the feature words are selected according to the relevancy between the term and the text content. Then, a word skip vector model (continuous skip-gram model) trained in large-scale corpus to obtain a word vector representing the semantic meaning of the feature words. Finally, the RWMD algorithm is used to calculate the similarity between short texts and the K-means algorithm is used to cluster them. The clustering results on the three test sets show that the algorithm effectively improves the accuracy of short text clustering.

Keywords:	feature weight emotion analysis word vector RWMD distance

	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏