首页 | 本学科首页   官方微博 | 高级检索  
     

边界模板和局部统计相结合的中国人名识别
引用本文:李中国,刘颖. 边界模板和局部统计相结合的中国人名识别[J]. 中文信息学报, 2006, 20(5): 46-52
作者姓名:李中国  刘颖
作者单位:清华大学中文系计算语言学研究室
基金项目:清华大学校科研和教改项目
摘    要:本文提出了一种基于篇章信息的中国人名识别算法。我们从标注语料中提取人名左右边界词语及人名用字频度作为系统知识源。识别过程是:首先利用带有频度的边界模板识别出可能的人名,并把识别结果扩散到整篇文章以召回数据稀疏导致的遗漏人名。然后应用上下文局部统计量及几条启发式规则对识别结果进行边界校正。该算法具有线性时间复杂度,大规模开放测试(针对1354篇新闻报道约304万字,含人名3.7万个)的正确率为94.52%,召回率为98.97%,效果非常令人满意。

关 键 词:计算机应用  中文信息处理  人名识别  命名实体识别  边界模板  局部统计量  词法分析  
文章编号:1003-0077(2006)05-0044-07
收稿时间:2005-09-14
修稿时间:2006-07-13

Chinese Name Recognition Based on Boundary Templates and Local Frequency
LI Zhong-guo,LIU Ying. Chinese Name Recognition Based on Boundary Templates and Local Frequency[J]. Journal of Chinese Information Processing, 2006, 20(5): 46-52
Authors:LI Zhong-guo  LIU Ying
Affiliation:Lab of Computational Linguistics , Department of Chinese Language and Literature , Tsinghua University
Abstract:In this paper an effective algorithm for Chinese person name recognition is proposed.Person name's left and right boundary words and person name's character frequency are extracted from tagged corpus,which will be used as the knowledge for recognition.First we use these boundary templates to find possible person names.Then these recognized person names are used to match the missed occurrence in the text.At last,the local frequency obtained from the whole text is used to check and correct the name boundaries.The time complexity of this algorithm is linear,and the test result on 1,354 news articles(with 3.04 million Chinese characters and 37,014 Chinese names in all) gives the precision of 94.52% and the recall of 98.97%,which is fairly satisfying in comparison with other published algorithms.
Keywords:computer application  Chinese information processing  person name recognition  named entity recognition  boundary template  local frequency  lexical analysis
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号