首页 | 本学科首页   官方微博 | 高级检索  
     

中文分词十年回顾
引用本文:黄昌宁,赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19
作者姓名:黄昌宁  赵海
作者单位:1. 微软亚洲研究院,北京 100080; 2. 香港城市大学,香港
基金项目:国家自然科学基金资助项目(60621062);国家973资助项目(2003CB317007,2004CB318108)
摘    要:过去的十年间,尤其是2003年国际中文分词评测活动Bakeoff开展以来,中文自动分词技术有了可喜的进步。其主要表现为: (1)通过“分词规范+词表+分词语料库”的方法,使中文词语在真实文本中得到了可计算的定义,这是实现计算机自动分词和可比评测的基础;(2)实践证明,基于手工规则的分词系统在评测中不敌基于统计学习的分词系统;(3)在Bakeoff数据上的评估结果表明,未登录词造成的分词精度失落至少比分词歧义大5倍以上;(4)实验证明,能够大幅度提高未登录词识别性能的字标注统计学习方法优于以往的基于词(或词典)的方法,并使自动分词系统的精度达到了新高。

关 键 词:计算机应用   中文信息处理  中文分词  词语定义  未登录词识别  字标注分词方法  
文章编号:1003-0077(2007)03-0008-012
收稿时间:2007-03-22
修稿时间:2007-03-222007-03-22

Chinese Word Segmentation:A Decade Review
HUANG Chang-ning,ZHAO Hai. Chinese Word Segmentation:A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19
Authors:HUANG Chang-ning  ZHAO Hai
Affiliation:1. Microsoft Research Asia, Beijing 100080, China; 2. City University of Hong Kong, Hong Kong, China
Abstract:During the last decade, especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003, the study in automatic Chinese word segmentation has been greatly improved. Those improvements could be summarized as following: (1) on the computation sense Chinese words in real text have been well-defined by “segmentation guidelines + lexicon + segmented corpus”; (2) practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule-based systems; (3) the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out-of-vocabulary (OOV) words is at least five times greater than that of segmentation ambiguities; (4) the better performance of OOV recognition the higher accuracy of the segmentation system in whole, and the accuracy of statistic segmentation systems with character-based tagging approach outperforms any other word-based system.
Keywords:computer application   Chinese information processing   Chinese word segmentation (CWS)   definition of words   out-of-vocabulary (OOV) word recognition   Character-based tagging approach of CWS  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号