首页 | 本学科首页   官方微博 | 高级检索  
     

分词规范亟需补充的三方面内容
引用本文:李玉梅,陈晓,姜自霞,易江燕,靳光瑾,黄昌宁.分词规范亟需补充的三方面内容[J].中文信息学报,2007,21(5):1-7.
作者姓名:李玉梅  陈晓  姜自霞  易江燕  靳光瑾  黄昌宁
作者单位:1. 教育部语言文字应用研究所,北京 100010;
2. 微软亚洲研究院,北京 100080
基金项目:国家重点基础研究发展计划(973计划);国家语委科研重大项目
摘    要:本文认为,为提高语料库的分词标注质量应在分词规范中补充三个内容: ①命名实体(人名、地名、机构名)标注细则;②表义字串(日期、时间、百分数等)标注细则;③歧义字串的消解细则。因为一方面命名实体和表义字串已被不少分词语料库视为分词单位,另一方面在以往的分词规范中几乎从不谈及歧义消解问题。其实人们对歧义字串的语感往往是不同的。因此有必要在规范中对典型的歧义字串予以说明。实践表明,在规范中交待清楚以上三方面内容,就可以在很大程度上避免标注的错误和不一致性。

关 键 词:计算机应用  中文信息处理  语料库  分词规范  分词歧义消解  
文章编号:1003-0077(2007)05-0003-05
收稿时间:2007-04-10
修稿时间:2007-06-29

Three Complements to Make Better Guideline of Chinese Word Segmentation
LI Yu-mei,CHEN Xiao,JIANG Zi-xia,YI Jiang-yan,JIN Guang-jin,HUANG Chang-ning.Three Complements to Make Better Guideline of Chinese Word Segmentation[J].Journal of Chinese Information Processing,2007,21(5):1-7.
Authors:LI Yu-mei  CHEN Xiao  JIANG Zi-xia  YI Jiang-yan  JIN Guang-jin  HUANG Chang-ning
Affiliation:1. Institute of Applied Linguisitics, Ministry of Education, P.R.C, Beijing 100010, China;
2. Microsoft Research Asia, Beijing 100080, China
Abstract:Three complements are proposed in this paper to make better guideline of Chinese word segmentation,which are essential for building high quality Chinese segmented corpora.They are named entity(person name,location name and organization name) tagging rules,factoid(date,time,percentage,etc.) tagging rules and disambiguation rules.Because named entities and factoids are considered as segmentation units in many corpora,and the disambiguation problem is seldom defined in former segmentation guidelines.Actually,people always have different intuitions of ambiguity strings,so it is necessary to explain them in segmentation guidelines.Our practices have shown that specifying particular segmentation rules can help to decrease errors and inconsistencies in annotated corpus.
Keywords:computer application  Chinese information processing  corpus  guideline of Chinese word segmentation  word segmentation disambiguation
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号