首页 | 本学科首页   官方微博 | 高级检索  
     

动态迁移实体块信息的跨领域中文实体识别模型
引用本文:吴炳潮,邓成龙,关贝,陈晓霖,昝道广,常志军,肖尊严,曲大成,王永吉.动态迁移实体块信息的跨领域中文实体识别模型[J].软件学报,2022,33(10):3776-3792.
作者姓名:吴炳潮  邓成龙  关贝  陈晓霖  昝道广  常志军  肖尊严  曲大成  王永吉
作者单位:中国科学院 软件研究所 协同创新中心, 北京 100190;中国科学院大学, 北京 100049;中国科学院 文献情报中心, 北京 100190;北京理工大学 计算机学院, 北京 100081;中国科学院 软件研究所 协同创新中心, 北京 100190;计算机科学国家重点实验室(中国科学院 软件研究所), 北京 100190;中国科学院大学, 北京 100049
基金项目:国家重点研发计划(2017YFB1002303)
摘    要:由于中文文本之间没有分隔符,难以识别中文命名实体的边界.此外,在垂直领域中难以获取充足的标记完整的语料,例如医疗领域和金融领域等垂直领域.为解决上述不足,提出一种动态迁移实体块信息的跨领域中文实体识别模型(TES-NER),将跨领域共享的实体块信息(entity span)通过基于门机制(gate mechanism)的动态融合层,从语料充足的通用领域(源领域)动态迁移到垂直领域(目标领域)上的中文命名实体模型,其中,实体块信息用于表示中文命名实体的范围.TES-NER模型首先通过双向长短期记忆神经网络(BiLSTM)和全连接网络(FCN)构建跨领域共享实体块识别模块,用于识别跨领域共享的实体块信息以确定中文命名实体的边界;然后,通过独立的基于字的双向长短期记忆神经网络和条件随机场(BiLSTM-CRF)构建中文命名实体识别模块,用于识别领域指定的中文命名实体;最后构建动态融合层,将实体块识别模块抽取得到的跨领域共享实体块信息通过门机制动态决定迁移到领域指定的命名实体识别模型上的量.设置通用领域(源领域)数据集为标记语料充足的新闻领域数据集(MSRA),垂直领域(目标领域)数据集为混合领域(OntoNotes 5.0)、金融领域(Resume)和医学领域(CCKS 2017)这3个数据集,其中,混合领域数据集(OntoNotes 5.0)是融合了6个不同垂直领域的数据集.实验结果表明,提出的模型在OntoNotes 5.0、Resume和CCKS 2017这3个垂直领域数据集上的F1值相比于双向长短期记忆和条件随机场模型(BiLSTM-CRF)分别高出2.18%、1.68%和0.99%.

关 键 词:命名实体识别  迁移学习  跨领域  动态融合  双向长短期记忆神经网络
收稿时间:2020/10/16 0:00:00
修稿时间:2020/12/15 0:00:00

Dynamically Transfer Entity Span Information for Cross-domain Chinese Named Entity Recognition
WU Bing-Chao,DENG Cheng-Long,GUAN Bei,CHEN Xiao-Lin,ZAN Dao-Guang,CHANG Zhi-Jun,XIAO Zun-Yan,QU Da-Cheng,WANG Yong-Ji.Dynamically Transfer Entity Span Information for Cross-domain Chinese Named Entity Recognition[J].Journal of Software,2022,33(10):3776-3792.
Authors:WU Bing-Chao  DENG Cheng-Long  GUAN Bei  CHEN Xiao-Lin  ZAN Dao-Guang  CHANG Zhi-Jun  XIAO Zun-Yan  QU Da-Cheng  WANG Yong-Ji
Affiliation:Collaborative Innovation Center, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;National Science Library, Chinese Academy of Sciences, Beijing 100190, China;School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China; Collaborative Innovation Center, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:Boundaries identification of Chinese named entities is a difficult problem because of no separator between Chinese texts. Furthermore, the lack of well-marked NER data makes Chinese named entity recognition (NER) tasks more challenging in vertical domains, such as clinical domain and financial domain. To address aforementioned issues, this study proposes a novel cross-domain Chinese NER model by dynamically transferring entity span information (TES-NER). The cross-domain shared entity span information is transferred from the general domain (source domain) with sufficient corpus to the Chinese NER model on the vertical domain (target domain) through a dynamic fusion layer based on the gate mechanism, where the entity span information is used to represent the scope of the Chinese named entities. Specifically, TES-NER first introduces a cross-domain shared entity span recognition module based on a bidirectional long short-term memory (BiLSTM) layer and a fully connected neural network (FCN) which are used to identify the cross-domain shared entity span information to determine the boundaries of the Chinese named entities. Then, a Chinese NER module is constructed to identify the domain-specific Chinese named entities by applying independent BiLSTM with conditional random field models (BiLSTM-CRF). Finally, a dynamic fusion layer is designed to dynamically determine the amount of the cross-domain shared entity span information extracted from the entity span recognition module, which is used to transfer the knowledge to the domain-specific NER model through the gate mechanism. This study sets the general domain (source domain) dataset as the news domain dataset (MSRA) with sufficient labeled corpus, while the vertical domain (target domain) datasets are composed of three datasets: Mixed domain (OntoNotes 5.0), financial domain (Resume), and medical domain (CCKS 2017). Among them, the mixed domain dataset (OntoNotes 5.0) is a corpus integrating six different vertical domains. The F1 values of the model proposed in this study are 2.18%, 1.68%, and 0.99% higher than BiLSTM-CRF, respectively.
Keywords:named entity recognition (NER)  transfer learning  cross-domain  dynamic fusion  bidirectional long short-term memory (BiLSTM) neural network
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号