首页 | 本学科首页   官方微博 | 高级检索  
     

基于半监督学习和规则相结合的中医古籍命名实体识别研究
引用本文:包振山,宋秉彦,张文博,孙超. 基于半监督学习和规则相结合的中医古籍命名实体识别研究[J]. 中文信息学报, 2022, 36(6): 90-100
作者姓名:包振山  宋秉彦  张文博  孙超
作者单位:1.北京工业大学 计算机学院,北京 100124;
2.首都医科大学 中医药学院,北京 100069
基金项目:北京市教委科技计划一般项目(KM202110025021);北京中医药“薪火传承3+3工程”崔锡章中医文化传承工作室;首都医科大学校科研培育基金(PYZ19167)
摘    要:目前针对中医古籍实体识别研究较少,且大多使用有监督学习方法。但古籍数字化程度低、标注语料稀少,且其语言多为文言文,专业术语也不断发展,现有方法无法有效解决以上问题。故而,该文在构建了中医古籍语料库的基础上,通过对中医古籍中实体名的分析研究,提出了一种基于半监督学习和规则相结合的中医古籍实体识别方法。以条件随机场模型为基本框架,在引入词、词性、词典等有监督特征的同时也引入了通过词向量获得的无监督语义特征,对比不同特征组合的识别性能,确定最优的半监督学习模型,并与其他模型进行了对比。之后,结合古籍语言学特点构建规则库对其进行基于规则的后处理。实验结果中最终F值达到83.18%,证明了该方法的有效性。

关 键 词:半监督学习  条件随机场  命名实体识别  中医古籍

Named Entity Recognition in Traditional Chinese Medicine Books Combining Semi-supervised Learning and Rule-based Approach
BAO Zhenshan,SONG Bingyan,ZHANG Wenbo,SUN Chao. Named Entity Recognition in Traditional Chinese Medicine Books Combining Semi-supervised Learning and Rule-based Approach[J]. Journal of Chinese Information Processing, 2022, 36(6): 90-100
Authors:BAO Zhenshan  SONG Bingyan  ZHANG Wenbo  SUN Chao
Affiliation:1.College of Computer Science and Technology, Beijing University of Technology, Beijing 100124, China;
2.School of Traditional Chinese Medicine, Capital Medical University, Beijing 100069, China
Abstract:The named entity recognition of traditional Chinese medicine books is a less addressed topic. Considering the difficulty and cost in annotating such professional text in classical Chinese, this paper proposes a method for identifying traditional Chinese medicine entities based on a combination of semi-supervised learning and rules. Under the framework of the conditional random fields model, supervised features such as lexical features and dictionary features are introduced together with the unsupervised semantic features derived from word vectors. The optimal semi-supervised learning model is gained by examining the performance of different feature combinations. Finally, the recognition results of the model are analyzed and a rule based post-processing is established with the linguistic characteristics of ancient books. Experiments results reveals 83.18% F-score, which proves the validity of this method.
Keywords:semi-supervised    conditional random fields    named entity recognition    traditional Chinese medicine books  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号