首页 | 本学科首页   官方微博 | 高级检索  
     

基于弱标注数据的汉语分词领域移植
引用本文:朱运,李正华,黄德朋,张民.基于弱标注数据的汉语分词领域移植[J].中文信息学报,2019,33(9):1-8.
作者姓名:朱运  李正华  黄德朋  张民
作者单位:苏州大学 计算机科学与技术学院,江苏 苏州 215006
基金项目:国家自然科学基金(61525205,61876116)
摘    要:近年来,基于神经网络的分词模型在封闭领域文本上取得了很高的性能。然而,在领域移植场景下,即测试数据与训练数据的领域差异较大时,分词的性能会显著下降。该文尝试利用自动获取的弱标注数据来提升领域移植场景下的分词性能。首先,对目前性能最好的BiLSTM-CRF分词模型进行扩展,引入适用于弱标注数据的损失函数;进而提出一种简单有效的数据筛选方法,从海量弱标注数据中筛选和目前领域更相关的数据;最后,该文发现数据预处理和在神经网络中引入传统特征均可以有效提高分词性能。在SIGHAN Bakeoff 2010和ZhuXian标注测试集上的实验结果表明,该文所提方法可有效提升汉语分词领域移植性能,平均F值提高了3.6%。

关 键 词:汉语分词  领域移植  弱标注数据  

Domain Adaptation for Chinese Word Segmentation Using Partial Annotations
ZHU Yun,LI Zhenghua,HUANG Depeng,ZHANG Min.Domain Adaptation for Chinese Word Segmentation Using Partial Annotations[J].Journal of Chinese Information Processing,2019,33(9):1-8.
Authors:ZHU Yun  LI Zhenghua  HUANG Depeng  ZHANG Min
Affiliation:School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
Abstract:In recent years, neural Chinese word segmentation (WS) models have achieved very high performance on closed domain texts. However, the performance drops dramatically in the domain adaptation scenario, where the test data is different from the training data. This paper attempts to improve cross-domain WS performance by employing the automatically collected WS data with partial annotation. Firstly, we extend the currently state-of-the-art BiLSTM-CRF WS model by introducing a new loss function to accommodate partially annotated data. Then, we propose a simple yet effective data selection method to extract target-domain related data from large amounts of partially annotated data. Finally, we employ a data preprocessing method and integrate traditional sparse features into the neural model, both leading to improved performance. The experimental results on the benchmark SIGHAN Bakeoff 2010 and ZhuXian datasets show that our proposed approaches effectively improve the performance of our baseline model by 3.6% in F1.
Keywords:Chinese word segmentation  domain adaptation  partially annotated data  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号