首页 | 本学科首页   官方微博 | 高级检索  
     

基于数据增广和复制的中文语法错误纠正方法
引用本文:汪权彬,谭营.基于数据增广和复制的中文语法错误纠正方法[J].智能系统学报,2020,15(1):99-106.
作者姓名:汪权彬  谭营
作者单位:北京大学 信息科学技术学院, 北京 100871
基金项目:国家重点研发计划资助项目(2018AAA0100300、2018AAA0102301);国家重点基础研究发展计划项目(2015CB352302);国家自然科学基金项目(61673025、61375119);北京市自然科学基金项目(4162029).
摘    要:中文作为一种使用很广泛的文字,因其同印欧语系文字的天然差别,使得汉语初学者往往会出现各种各样的语法错误。本文针对初学者在汉语书写中可能出现的错别字、语序错误等,提出一种自动化的语法纠正方法。首先,本文在自注意力模型中引入复制机制,构建新的C-Transformer模型。构建从错误文本序列到正确文本序列的文本语法错误纠正模型,其次,在公开数据集的基础上,本文利用序列到序列学习的方式从正确文本学习对应的不同形式的错误文本,并设计基于通顺度、语义和句法度量的错误文本筛选方法;最后,还结合中文象形文字的特点,构造同形、同音词表,按词表映射的方式人工构造错误样本扩充训练数据。实验结果表明,本文的方法能够很好地纠正错别字、语序不当、缺失、冗余等错误,并在中文文本语法错误纠正标准测试集上取得了目前最好的结果。

关 键 词:自注意力机制  复制机制  序列到序列学习  中文  语法错误纠正  神经网络  文本生成  通顺度

Chinese grammatical error correction method based on data augmentation and copy mechanism
WANG Quanbin,TAN Ying.Chinese grammatical error correction method based on data augmentation and copy mechanism[J].CAAL Transactions on Intelligent Systems,2020,15(1):99-106.
Authors:WANG Quanbin  TAN Ying
Affiliation:School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
Abstract:Chinese is a widely used language. However, due to its natural difference between Indo-European languages, Chinese learners tend to make various grammatical errors. This article proposes an automatic grammar correction method for those who will make errors like typos or improper words order. First, we built the C-Transformer model that adopts copy mechanism in the self-attention model to translate wrong text sequence to the correct one. Second, based on the public data set, a pure sequence to sequence method is utilized to generate wrong text corresponding to the correct one, and an error text filter is designed based on fluency, semantic, and syntactic measurements. Finally, since Chinese words are pictographic, based on the collected homographs and homophones dictionaries, some error samples are artificially constructed to expand training data. The experimental results show that our method can well correct typos, improper word order, missing, redundancy and other errors, and achieved the state-of-the-art performance on the standard test set of Chinese text grammatical error correction.
Keywords:self-attention mechanism  copy mechanism  sequence to sequence learning  Chinese  grammatical error correction  neural networks  text generation  fluency
本文献已被 维普 等数据库收录!
点击此处可从《智能系统学报》浏览原始摘要信息
点击此处可从《智能系统学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号