首页 | 本学科首页   官方微博 | 高级检索  
     

基于自蒸馏与自集成的问答模型
引用本文:王同结,李烨. 基于自蒸馏与自集成的问答模型[J]. 计算机应用研究, 2024, 41(1): 212-216
作者姓名:王同结  李烨
作者单位:上海理工大学光电信息与计算机工程学院
摘    要:知识蒸馏结合预训练语言模型是构建问答模型的主要方法之一,然而,这类方法存在知识转移效率低下、训练教师模型耗时严重、教师模型和学生模型能力不匹配等问题。针对上述问题,提出了一种基于自蒸馏与自集成的问答模型SD-SE-BERT。其中:自集成基于滑窗机制设计;学生模型采用BERT;教师模型由训练过程中得到的若干学生模型基于其验证集性能进行加权平均组合得到;损失函数利用集成后的输出结果和真实标签指导当前轮次的学生模型进行训练。在SQuAD1.1数据集上的实验结果表明,SD-SE-BERT的EM指标和F1指标相比较BERT模型分别提高7.5和4.9,并且模型性能优于其他代表性的单模型和蒸馏模型;相较于大型语言模型ChatGLM-6B的微调结果,EM指标提高4.5,F1指标提高2.5。证明SD-SE-BERT可以利用模型自身的监督信息来提高模型组合不同文本数据特征的能力,无须训练复杂的教师模型,避免了教师模型与学生模型不匹配的问题。

关 键 词:问答模型  知识蒸馏  集成学习  BERT
收稿时间:2023-05-13
修稿时间:2023-07-21

Question answering mode based on self-distillation and self-ensemble
Wang Tongjie and wang. Question answering mode based on self-distillation and self-ensemble[J]. Application Research of Computers, 2024, 41(1): 212-216
Authors:Wang Tongjie and wang
Affiliation:School of Optical-Electrical and Computer Engineering,University ofShanghai for Science and Technology,
Abstract:Knowledge distillation combined with pre-trained language models is one of the primary methods for constructing question-answering models. However, these methods suffer from inefficiencies in knowledge transfer, time-consuming teacher model training, and mismatched capabilities between teacher and student models. To address these issues, this paper proposed a question-answering model based on self-distillation and self-ensemble, named SD-SE-BERT. The self-ensemble mechanism was designed based on a sliding window; the student model used BERT; the teacher model was derived from a weighted average combination of several student models during the training process, based on their performance on the validation set. The loss function used the output of the ensemble and the true labels to guide the training of the student model in the current round. Experimental results on the SQuAD1.1 dataset show that the EM and F1 scores of SD-SE-BERT are respectively 7.5 and 4.9 higher than those of the BERT model, and the model''s performance surpasses other representative single models and distillation models. Compared to the fine-tuning results of the large-scale language model ChatGLM-6B, the EM score was improved by 4.5, and the F1 score by 2.5. It proves that SD-SE-BERT can leverage the model''s supervision information to enhance the model''s capacity to combine different text data features, eliminating the need for complex teacher-model training and avoiding the problem of mismatch between teacher and student models.
Keywords:question answering model   knowledge distillation   ensemble learning   BERT
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号