首页 | 本学科首页   官方微博 | 高级检索  
     

知识图谱增强的科普文本分类模型
引用本文:唐望径,许斌,仝美涵,韩美奂,王黎明,钟琦. 知识图谱增强的科普文本分类模型[J]. 计算机应用, 2022, 42(4): 1072-1078. DOI: 10.11772/j.issn.1001-9081.2021071278
作者姓名:唐望径  许斌  仝美涵  韩美奂  王黎明  钟琦
作者单位:清华大学 计算机科学与技术系,北京 100084
北京交通大学 计算机与信息技术学院,北京 100044
清华大学 深圳国际研究生院,广东 深圳 518055
中国科普研究所,北京 100081
基金项目:此项工作得到了中国科普研究所委托合作项目
摘    要:科普文本分类是将科普文章按照科普分类体系进行划分的任务。针对科普文章篇幅超过千字,模型难以聚焦关键信息,造成传统模型分类性能不佳的问题,提出一种结合知识图谱进行两级筛选的科普长文本分类模型,来减少主题无关信息的干扰,提升模型的分类性能。首先,采用四步法构建科普领域的知识图谱;然后,将该知识图谱作为距离监督器,并通过训练句子过滤器来过滤掉无关信息;最后,使用注意力机制对过滤后的句子集做进一步的信息筛选,并实现基于注意力的主题分类模型。在所构建的科普文本分类数据集(PSCD)上的实验结果表明,基于领域知识图谱的知识增强的文本分类算法模型具有更高的F1-Score,相较于TextCNN模型和BERT模型,在F1-Score上分别提升了2.88个百分点和1.88个百分点,验证了知识图谱对于长文本信息筛选的有效性。

关 键 词:科普文本分类  知识图谱  两级筛选  长文本分类  注意力  
收稿时间:2021-07-16
修稿时间:2021-09-07

Popular science text classification model enhanced by knowledge graph
TANG Wangjing,XU Bin,TONG Meihan,HAN Meihuan,WANG Liming,ZHONG Qi. Popular science text classification model enhanced by knowledge graph[J]. Journal of Computer Applications, 2022, 42(4): 1072-1078. DOI: 10.11772/j.issn.1001-9081.2021071278
Authors:TANG Wangjing  XU Bin  TONG Meihan  HAN Meihuan  WANG Liming  ZHONG Qi
Affiliation:Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
Tsinghua Shenzhen International Graduate School,Shenzhen Guangdong 518055,China
China Research Institute for Science Popularization,Beijing 100081,China
Abstract:Popular science text classification aims to classify the popular science articles according to the popular science classification system. Concerning the problem that the length of popular science articles often exceeds 1 000 words, which leads to the model hard to focus on key points and causes poor classification performance of the traditional models, a model for long text classification combining knowledge graph to perform two-level screening was proposed to reduce the interference of topic-irrelevant information and improve the performance of model classification. First, a four-step method was used to construct a knowledge graph for the domains of popular science. Then, this knowledge graph was used as a distance monitor to filter out irrelevant information through training sentence filters. Finally, the attention mechanism was used to further filter the information of the filtered sentence set, and the attention-based topic classification model was completed. Experimental results on the constructed Popular Science Classification Dataset (PSCD) show that the text classification algorithm model based on the domain knowledge graph information enhancement has higher F1-Score. Compared with the TextCNN model and the BERT (Bidirectional Encoder Representations from Transformers) model, the proposed model has the F1-Score increased by 2.88 percentage points and 1.88 percentage points respectively, verifying the effectiveness of knowledge graph to long text information screening.
Keywords:popular science text classification  knowledge graph  two-level screening  long text classification  attention  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号