上古汉语分词及词性标注语料库的构建——以《淮南子》为范例 The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

上古汉语分词及词性标注语料库的构建——以《淮南子》为范例

引用本文：	留金腾,宋彦,夏飞.上古汉语分词及词性标注语料库的构建——以《淮南子》为范例[J].中文信息学报,2013,27(6):6-16.

作者姓名：	留金腾宋彦夏飞

作者单位：	1. 香港城市大学中文、翻译及语言学系,中国香港; 2. 香港理工大学香港专上学院,中国香港; 3. 华盛顿大学语言学系,美国华盛顿西雅图

摘要：	该文介绍了以《淮南子》为文本的上古汉语分词及词性标注语料库及其构建过程。该文采取了自动分词与词性标注并结合人工校正的方法构建该语料库,其中自动过程使用领域适应方法优化标注模型,在分词和词性标注上均显著提升了标注性能。分析了上古汉语的词汇特点,并以此为基础描述了一些显式的词汇形态特征,将其运用于我们的自动分词及词性标注中,特别对词性标注系统带来了有效帮助。总结并分析了自动分词和词性标注中出现的错误,最后描述了整个语料库的词汇和词性分布特点。提出的方法在《淮南子》的标注过程中得到了验证,为日后扩展到其他古汉语资源提供了参考。同时,基于该文工作得到的《淮南子》语料库也为日后的古汉语研究提供了有益的资源。
关键词：	上古汉语语料库分词词性标注领域适应
The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi

LAU Kam tang,SONG Yan,XIA Fei.The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi[J].Journal of Chinese Information Processing,2013,27(6):6-16.

Authors:	LAU Kam tang SONG Yan XIA Fei

Affiliation:	1. Department of Chinese, Translation & Linguistics, City University of Hong Kong, Hong Kong, China;2. Hong Kong Community College, The Hong Kong Polytechnic University, Hong Kong, China;3. Department of Linguistics, University of Washington, Seattle, Washington, USA

Abstract:	In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as by adding linguistic and morphological features derived from the characteristics of Archaic Chinese language. The experimental results showed the effectiveness of our approach. In particular, the domain adaptation techniques and the added features significantly improve POS tagging performance. During our manual correction, we categorize the errors resulted from the automatic segmentation and POS tagging process, and investigate the sources of those errors. Finally, we give the statistics of the resulted corpus on the distributions of words and POS tags. Our work is a preliminary study that could be easily extended to annotating other Archaic Chinese text, and the resulted corpus is a valuable resource for research on Archaic Chinese language. Key wordsArchaic Chinese corpus; word segmentation; Part-of-speech Tagging; domain adaptation

Keywords:	Archaic Chinese corpus word segmentation Part-of-speech Tagging domain adaptation

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏