首页 | 本学科首页   官方微博 | 高级检索  
     

基于指针网络融入混淆集知识的中文语法纠错
引用本文:李嘉诚,沈嘉钰,龚晨,李正华,张民.基于指针网络融入混淆集知识的中文语法纠错[J].中文信息学报,2022,36(4):29-38.
作者姓名:李嘉诚  沈嘉钰  龚晨  李正华  张民
作者单位:苏州大学 计算机科学与技术学院,江苏 苏州 215006
基金项目:国家自然科学基金(62176173,61876116)
摘    要:在中文语法纠错(CGEC)任务上,虽然替换类错误在数据集中占比最多,但还没有研究者尝试过将音近和形近知识融入基于神经网络的语法纠错模型中。针对这一问题,该文做了两方面的尝试。首先,该文提出了一种基于指针网络融入混淆集知识的语法纠错模型。具体而言,该模型在序列到编辑(Seq2Edit)语法纠错模型基础上,利用指针网络融入汉字之间的音近和形近知识。其次,在训练数据预处理阶段,即从错误-正确句对抽取编辑序列过程中,该文提出一种混淆集指导的编辑距离算法,从而更好地抽取音近和形近字的替换类编辑。实验结果表明,该文提出的两点改进均能提高模型性能,且作用互补;该文所提出的模型在NLPCC 2018评测数据集上达到了目前最优性能。实验分析表明,与基准Seq2Edit语法纠错模型相比,该文模型的性能提升大部分来自于替换类错误的纠正。

关 键 词:语法纠错  混淆集  指针网络  

Incorporating Confusion Set Knowledge with Pointer Network for Chinese Grammatical Error Correction
LI Jiacheng,SHEN Jiayu,GONG Chen,LI Zhenghua,ZHANG Min.Incorporating Confusion Set Knowledge with Pointer Network for Chinese Grammatical Error Correction[J].Journal of Chinese Information Processing,2022,36(4):29-38.
Authors:LI Jiacheng  SHEN Jiayu  GONG Chen  LI Zhenghua  ZHANG Min
Affiliation:School of Lanpyter Science and Technology, Soockow University, Suzhou, Jiangsu 215006, China
Abstract:For Chinese Grammatical Error Correction (CGEC) task, although substitution errors account for the largest proportion of all the errors in the data set, no researcher has tried to incorporate phonological and visual similarity knowledge into the neural network-based GEC model. To tackle this problem, the article makes two attempts. First, this paper proposes a GEC model which incorporates with the confusion set knowledge based on the pointer network. Specifically, this model is Seq2Edit-based GEC model and use the pointer network to incorporate phonological and visual similarity knowledge. Second, during the training data pre-process stage, i.e., in the process of extracting edit sequences from wrong-correct sentence pairs, this paper proposes a confusion set guided edit distance algorithm to better extract substitution edit of phonological and visual similarity characters. The experimental results show that the two proposed methods can both improve the performance of the model and can provide complementary contributions; and the proposed model achieves the current state-of-the-art results in the NLPCC 2018 evaluation data set. Experimental analysis shows that compared with the baseline Seq2Edit GEC model, the overall performance gain of our proposed model is mostly contributed by correction of substitution errors.
Keywords:grammatical error correction  confusion set  pointer network  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号