基于Labeled-LDA模型的文本特征提取方法 Text feature extract method based on Labeled-LDA mode期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Labeled-LDA模型的文本特征提取方法

引用本文：	王瑞,龙华,邵玉斌,杜庆治. 基于Labeled-LDA模型的文本特征提取方法[J]. 电子测量技术, 2020, 0(1): 141-146

作者姓名：	王瑞龙华邵玉斌杜庆治

作者单位：	昆明理工大学信息工程与自动化学院

摘要：	针对LDA主题模型文本特征提取时主题识别不明确的问题,提出一种基于Labeled-LDA模型的文本特征提取方法。使用LDA主题模型对文本隐含主题中的主题词进行提取,根据TF-IDF算法实现对文本类别的关键词进行提取。通过文本simhash算法对提取出的主题词与关键词进行相似度计算,找到文本隐含主题的类别并提取特征词。实验表明结合后的特征提取方法比TF-IDF、传统LDA主题模型的文本特征提取方法,获得更高的分类精度,其中准确度提高了3.40%,召回率提高了4.40%,F值提高了3.92%。
关键词：	Labeled-LDA TF-IDF Simhash 文本特征提取
Text feature extract method based on Labeled-LDA mode

Wang Rui,Long Hua,Shao Yubin,Du Qingzhi. Text feature extract method based on Labeled-LDA mode[J]. Electronic Measurement Technology, 2020, 0(1): 141-146

Authors:	Wang Rui Long Hua Shao Yubin Du Qingzhi

Affiliation:	(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650000,China)

Abstract:	Due to the unclear topic recognition problem in text feature extraction of LDA topic model, this paper proposed a text feature extraction method based on Labeled-LDA model. In the proposed method, firstly, we utilized the LDA topic model to extract topic words in the text of the implied topics, and then implemented the TF-IDF algorithm to extract keywords from categories in text. Secondly, the Simhash algorithm was adopted to calculate the degree of similarity between the topic words and keywords, and then to find the category of the implied topics in the text and to extract the feature words as well. Experiments show that the combined feature extraction method performs well and can achieve higher classification accuracy than the text feature extraction method of TF-IDF and traditional LDA topic models. Among them, the accuracy increased by 3.40%, the recall rate increased by 4.40%, and the F value increased by 3.92%.

Keywords:	Labeled-LDA TF-IDF Simhash text feature
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏