首页 | 本学科首页   官方微博 | 高级检索  
     

基于潜在狄利克雷分配模型和互信息的无监督特征选取法
引用本文:董元元,陈基漓,唐小侠. 基于潜在狄利克雷分配模型和互信息的无监督特征选取法[J]. 计算机应用, 2012, 32(8): 2250-2257. DOI: 10.3724/SP.J.1087.2012.02250
作者姓名:董元元  陈基漓  唐小侠
作者单位:1. 桂林理工大学 信息科学与工程学院,广西 桂林 5410042. 桂林理工大学 理学院,广西 桂林 541004
摘    要:为解决互信息(MI)在特征选取中的类别缺失和倾向低频词问题,提出 LDA-σ方法。该方法使用潜在狄利克雷分配模型(LDA)提取潜在主题,以“词—主题”间互信息的标准差作为特征评估函数。在Reuters-21578语料集上提取特征词并进行分类,LDA-σ方法的微平均F1最高达0.9096;宏平均F1优于其他算法,最高达0.7823。实验表明,LDA-σ方法可用于文本特征选取。

关 键 词:潜在狄利克雷分配模型  互信息  评价函数  
收稿时间:2012-01-09
修稿时间:2012-03-04

Unsupervised feature selection method based on latent Dirichlet allocation model and mutual information
DONG Yuan-yuan , CHEN Ji-li , TANG Xiao-xia. Unsupervised feature selection method based on latent Dirichlet allocation model and mutual information[J]. Journal of Computer Applications, 2012, 32(8): 2250-2257. DOI: 10.3724/SP.J.1087.2012.02250
Authors:DONG Yuan-yuan    CHEN Ji-li    TANG Xiao-xia
Affiliation:1. College of Information Science and Engineering, Guilin University of Technology, Guilin Guangxi 541004, China2. College of Science, Guilin University of Technology, Guilin Guangxi 541004, China
Abstract:To solve the category-deficiency and the tendency of selecting low-frequency words in feature selection process based on Mutual Information(MI),the method named LDA-σ was presented.Firstly,the latent topics were extracted by the Latent Dirichlet Allocation(LDA) model,and then the standard deviation of "Word-Topic" MI was calculated as the feature evaluation function.When conducting feature selection and categorization in Reuters-21578,the micro average F1 of LDA-σ reached up to 0.909 6,and the highest macro average F1 of LDA-σ was 0.782 3,which were higher than that of other algorithms.The experimental results indicate that LDA-σ can be applied to feature selection in text sets.
Keywords:Latent Dirichlet Allocation(LDA) model  Mutual Information(MI)  evaluation function
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号