首页 | 本学科首页   官方微博 | 高级检索  
     

基于异构数据联合训练的中文分词法
引用本文:姜猛,王子牛,高建瓴.基于异构数据联合训练的中文分词法[J].电子科技,2019,32(4):29-33.
作者姓名:姜猛  王子牛  高建瓴
作者单位:1. 贵州大学 大数据与信息工程学院,贵州 贵阳 5500252. 贵州大学 网络与信息化管理中心,贵州 贵阳 550025
基金项目:贵州省科学技术基金(黔科合J字[2015]2045);贵州大学研究生创新基金(研理工2017016)
摘    要:中文分词技术作为中文信息处理中的关键基础技术之一,基于深度学习模型的中文分词法受到广泛关注。然而,深度学习模型需要大规模数据训练才能获得良好的性能,而当前中文分词语料数据相对缺乏且标准不一。文中提出了一种简单有效的异构数据处理方法,对不同语料数据加上两个人工设定的标识符,使用处理过的数据应用于双向长短期记忆网络结合条件随机场(Bi-LSTM-CRF)的中文分词模型的联合训练。实验结果表明,基于异构数据联合训练的Bi-LSTM-CRF模型比单一数据训练的模型具有更好的分词性能。

关 键 词:中文分词  深度学习  Bi-LSTM-CRF  异构数据  联合训练  语料库  
收稿时间:2018-03-18

Chinese Word Segmentation Based on Joint Training of Heterogeneous Data
JIANG Meng,WANG Ziniu,GAO Jianling.Chinese Word Segmentation Based on Joint Training of Heterogeneous Data[J].Electronic Science and Technology,2019,32(4):29-33.
Authors:JIANG Meng  WANG Ziniu  GAO Jianling
Affiliation:1. School of Big Data & Information Engineering,Guizhou University,Guiyang 550025,China;2. Network and Information Management Center,Guizhou University,Guiyang 550025,China
Abstract:Chinese word segmentation technology is one of the key basic technologies in Chinese information processing. The Chinese word segmentation method based on deep learning model is widely concerned. However, the deep learning model requires large-scale data training to obtain good performance, but the current Chinese sub-word data is relatively lacking and the standards are not the same. This paper proposes a simple and effective method of heterogeneous data processing. Firstly, two artificially-set identifiers are added to different corpus data, and then the processed data is applied to the joint training of Bi-LSTM-CRF Chinese word segmentation model. Experimental results show that the Bi-LSTM-CRF model based on heterogeneous data joint training has better segmentation performance than the single data training model.
Keywords:Chinese word segmentation  deep learning  Bi-LSTM-CRF  heterogeneous data  joint training  corpus  
本文献已被 万方数据 等数据库收录!
点击此处可从《电子科技》浏览原始摘要信息
点击此处可从《电子科技》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号