首页 | 本学科首页   官方微博 | 高级检索  
     

古汉语词义标注语料库的构建及应用研究
引用本文:舒蕾,郭懿鸾,王慧萍,张学涛,胡韧奋.古汉语词义标注语料库的构建及应用研究[J].中文信息学报,2022,36(5):21-30.
作者姓名:舒蕾  郭懿鸾  王慧萍  张学涛  胡韧奋
作者单位:1.北京师范大学 中文信息处理研究所,北京 100875;
2.北京师范大学 人文宗教高等研究院,北京 100875;
3.北京师范大学 汉语文化学院,北京 100875
基金项目:国家自然科学基金(62006021);北京市社会科学基金青年学术带头人项目(21DTR037)
摘    要:古汉语以单音节词为主,其一词多义现象十分突出,这为现代人理解古文含义带来了一定的挑战。为了更好地实现古汉语词义的分析和判别,该研究基于传统辞书和语料库反映的语言事实,设计了针对古汉语多义词的词义划分原则,并对常用古汉语单音节词进行词义级别的知识整理,据此对包含多义词的语料开展词义标注。现有的语料库包含3.87万条标注数据,规模超过117.6万字,丰富了古代汉语领域的语言资源。实验显示,基于该语料库和BERT语言模型,词义判别算法准确率达到80%左右。进一步地,该文以词义历时演变分析和义族归纳为案例,初步探索了语料库与词义消歧技术在语言本体研究和词典编撰等领域的应用。

关 键 词:古代汉语  语料库  词义标注  词义消歧  

The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation
SHU Lei,GUO Yiluan,WANG Huiping,ZHANG Xuetao,HU Renfen.The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation[J].Journal of Chinese Information Processing,2022,36(5):21-30.
Authors:SHU Lei  GUO Yiluan  WANG Huiping  ZHANG Xuetao  HU Renfen
Affiliation:1.Institute of Chinese Information Processing, Beijing Normal University, Beijing 100875, China;
2.Institute for Advanced Study of the Humanities and Religion, Beijing Normal University, Beijing 100875, China;
3.College of Chinese Language and Culture, Beijing Normal University, Beijing 100875, China
Abstract:Due to the dominant monosyllabic words, polysemy is a challenge for modern people to understand the ancient Chinese. Based on the linguistic knowledge in traditional dictionaries, this paper designs the principles of semantic division of polysemous words in ancient Chinese, and categorizes the knowledge of popular monosyllabic words in ancient Chinese. With these guidelines, the annotated corpus has accumulated up to 38 700 sentences with more than1 176 000 Chinese characters. Experiments show that the accuracy of BERT based word sense disambiguation model trained on the corpus achieves about 80%. Furthermore, this paper explores the application of the corpus built and the technique of word sense disambiguation in the study of language ontology and dictionary compilation via diachronic evolution analysis of word meaning and the induction of sense families.
Keywords:ancient Chinese  corpus  word sense annotation  word sense disambiguation  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号