首页 | 本学科首页   官方微博 | 高级检索  
     

融合SLDA主题模型的不均衡文本分类方法
引用本文:唐焕玲,刘艳红,郑涵,窦全胜,鲁明羽.融合SLDA主题模型的不均衡文本分类方法[J].计算机工程与应用,2021,57(12):144-154.
作者姓名:唐焕玲  刘艳红  郑涵  窦全胜  鲁明羽
作者单位:1.山东工商学院 计算机科学与技术学院,山东 烟台 264005 2.山东省高等学校协同创新中心:未来智能计算,山东 烟台 264005 3.山东省高校智能信息处理重点实验室(山东工商学院),山东 烟台 264005 4.大连海事大学 信息科学技术学院,辽宁 大连 116026
摘    要:在标签均衡分布且标注样本足够多的数据集上,监督式分类算法通常可以取得比较好的分类效果。然而,在实际应用中样本的标签分布通常是不均衡的,分类算法的分类性能就变得比较差。为此,结合SLDA(Supervised LDA)有监督主题模型,提出一种不均衡文本分类新算法ITC-SLDA(Imbalanced Text Categorization based on Supervised LDA)。基于SLDA主题模型,建立主题与稀少类别之间的精确映射,以提高少数类的分类精度。利用SLDA模型对未标注样本进行标注,提出一种新的未标注样本的置信度计算方法,以及类别约束的采样策略,旨在有效采样未标注样本,最终降低不均衡文本的倾斜度,提升不均衡文本的分类性能。实验结果表明,所提方法能明显提高不均衡文本分类任务中的Macro-F1和G-mean值。

关 键 词:有监督主题模型  半监督学习  不均衡文本  分类  

Imbalanced Text Categorization Method with SLDA Topic Model
TANG Huanling,LIU Yanhong,ZHENG Han,DOU Quansheng,LU Mingyu.Imbalanced Text Categorization Method with SLDA Topic Model[J].Computer Engineering and Applications,2021,57(12):144-154.
Authors:TANG Huanling  LIU Yanhong  ZHENG Han  DOU Quansheng  LU Mingyu
Abstract:Supervised categorization algorithms can yield better categorization performance in datasets with enough and balanced labels. However, various real-world categorization tasks suffer from the class imbalance problem which has been known to hinder the learning performance of categorization algorithms. This paper, demonstrates that SLDA model is capable of solving the class imbalance problem by sampling unlabeled instances. In order to yield a better prediction performance with minority classes, the semantic relationship between topics and minority classes is derived by the SLDA topic model. An efficient way of calculating confidence and sampling valuable unlabeled instances is proposed. The proposed method reduces the skewness of the imbalanced datasets efficiently and improves the categorization performance of minority classes. Our experimental results show that the the proposed method, ITC-SLDA algorithm, can significantly improve Macro-F1 and G-mean values in imbalanced text categorization.
Keywords:supervised topic model  semi-supervised learning  imbalanced text  categorization  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号