首页 | 本学科首页   官方微博 | 高级检索  
     

基于Labeled-LDA模型的文本特征提取方法
引用本文:王瑞,龙华,邵玉斌,杜庆治. 基于Labeled-LDA模型的文本特征提取方法[J]. 电子测量技术, 2020, 0(1): 141-146
作者姓名:王瑞  龙华  邵玉斌  杜庆治
作者单位:昆明理工大学信息工程与自动化学院
摘    要:针对LDA主题模型文本特征提取时主题识别不明确的问题,提出一种基于Labeled-LDA模型的文本特征提取方法。使用LDA主题模型对文本隐含主题中的主题词进行提取,根据TF-IDF算法实现对文本类别的关键词进行提取。通过文本simhash算法对提取出的主题词与关键词进行相似度计算,找到文本隐含主题的类别并提取特征词。实验表明结合后的特征提取方法比TF-IDF、传统LDA主题模型的文本特征提取方法,获得更高的分类精度,其中准确度提高了3.40%,召回率提高了4.40%,F值提高了3.92%。

关 键 词:Labeled-LDA  TF-IDF  Simhash  文本特征提取

Text feature extract method based on Labeled-LDA mode
Wang Rui,Long Hua,Shao Yubin,Du Qingzhi. Text feature extract method based on Labeled-LDA mode[J]. Electronic Measurement Technology, 2020, 0(1): 141-146
Authors:Wang Rui  Long Hua  Shao Yubin  Du Qingzhi
Affiliation:(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650000,China)
Abstract:Due to the unclear topic recognition problem in text feature extraction of LDA topic model, this paper proposed a text feature extraction method based on Labeled-LDA model. In the proposed method, firstly, we utilized the LDA topic model to extract topic words in the text of the implied topics, and then implemented the TF-IDF algorithm to extract keywords from categories in text. Secondly, the Simhash algorithm was adopted to calculate the degree of similarity between the topic words and keywords, and then to find the category of the implied topics in the text and to extract the feature words as well. Experiments show that the combined feature extraction method performs well and can achieve higher classification accuracy than the text feature extraction method of TF-IDF and traditional LDA topic models. Among them, the accuracy increased by 3.40%, the recall rate increased by 4.40%, and the F value increased by 3.92%.
Keywords:Labeled-LDA  TF-IDF  Simhash  text feature
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号