首页 | 本学科首页   官方微博 | 高级检索  
     

标签语义增强的弱监督文本分类模型
引用本文:林呈宇,王雷,薛聪.标签语义增强的弱监督文本分类模型[J].计算机应用,2023,43(2):335-342.
作者姓名:林呈宇  王雷  薛聪
作者单位:中国科学院 信息工程研究所,北京 100093
中国科学院大学 网络空间安全学院,北京 100049
基金项目:国家自然科学基金重点项目(U1636220)
摘    要:针对弱监督文本分类任务中存在的类别词表噪声和标签噪声问题,提出了一种标签语义增强的弱监督文本分类模型。首先,基于单词上下文语义表示对类别词表去噪,从而构建高度准确的类别词表;然后,构建基于MASK机制的词类别预测任务对预训练模型BERT进行微调,以学习单词与类别的关系;最后,利用引入标签语义的自训练模块来充分利用所有数据信息并减少标签噪声的影响,以实现词级到句子级语义的转换,从而准确预测文本序列类别。实验结果表明,与目前最先进的弱监督文本分类模型LOTClass相比,所提方法在THUCNews、AG News和IMDB公开数据集上,分类准确率分别提高了5.29、1.41和1.86个百分点。

关 键 词:弱监督文本分类  BERT  MASK机制  标签语义  标签噪声  自训练
收稿时间:2022-01-06
修稿时间:2022-03-22

Weakly-supervised text classification with label semantic enhancement
Chengyu LIN,Lei WANG,Cong XUE.Weakly-supervised text classification with label semantic enhancement[J].journal of Computer Applications,2023,43(2):335-342.
Authors:Chengyu LIN  Lei WANG  Cong XUE
Affiliation:Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China
School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100049,China
Abstract:Aiming at the problem of category vocabulary noise and label noise in weakly-supervised text classification tasks, a weakly-supervised text classification model with label semantic enhancement was proposed. Firstly, the category vocabulary was denoised on the basis of the contextual semantic representation of the words in order to construct a highly accurate category vocabulary. Then, a word category prediction task based on MASK mechanism was constructed to fine-tune the pre-training model BERT (Bidirectional Encoder Representations from Transformers), so as to learn the relationship between words and categories. Finally, a self-training module with label semantics introduced was used to make full use of all data information and reduce the impact of label noise in order to achieve word-level to sentence-level semantic conversion, thereby accurately predicting text sequence categories. Experimental results show that compared with the current state-of-the-art weakly-supervised text classification model LOTClass (Label-name-Only Text Classification), the proposed method improves the classification accuracy by 5.29, 1.41 and 1.86 percentage points respectively on the public datasets THUCNews, AG News and IMDB.
Keywords:weakly-supervised text classification  BERT (Bidirectional Encoder Representations from Transformers)  MASK mechanism  label semantics  label noise  self-training  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号