首页 | 本学科首页   官方微博 | 高级检索  
     

基于有效子串标注的中文分词
引用本文:赵海,揭春雨. 基于有效子串标注的中文分词[J]. 中文信息学报, 2007, 21(5): 8-13
作者姓名:赵海  揭春雨
作者单位:香港城市大学 中文翻译及语言学系,香港 九龙达之路83号
基金项目:香港城市大学SRG项目;香港特别行政区资助CERG研究项目
摘    要:由于基于已切分语料的学习方法和体系的兴起,中文分词在本世纪的头几年取得了显著的突破。尤其是2003年国际中文分词评测活动Bakeoff开展以来,基于字标注的统计学习方法引起了广泛关注。本文探讨这一学习框架的推广问题,以一种更为可靠的算法寻找更长的标注单元来实现中文分词的大规模语料学习,同时改进已有工作的不足。我们提出子串标注的一般化框架,包括两个步骤,一是确定有效子串词典的迭代最大匹配过滤算法,二是在给定文本上实现子串单元识别的双词典最大匹配算法。该方法的有效性在Bakeoff-2005评测语料上获得了验证。

关 键 词:计算机应用  中文信息处理  中文分词  基于子串标注的分词  
文章编号:1003-0077(2007)05-0008-06
收稿时间:2007-04-25
修稿时间:2007-04-252007-06-25

Effective Subsequence-Based Tagging for Chinese Word Segmentation
ZHAO Hai,Chunyu Kit. Effective Subsequence-Based Tagging for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2007, 21(5): 8-13
Authors:ZHAO Hai  Chunyu Kit
Affiliation:Department of Chinese, Translation and Linguistics, City University of Hong Kong,
83 Tat Avenue, Kowloon, Hong Kong SAR, China
Abstract:The research of automatic Chinese word segmentation has been advancing rapidly in recent years,especially after the First International Chinese Word Segmentation Bakeoff held in 2003.In particular,character-based tagging has claimed a great success in this field.In this paper,we attempt to generalize this method to subsequence-based tagging.Our goal is to find longer tagging units through a reliable algorithm.We propose a two-step framework to serve this purpose.In the first step,an iterative maximum matching filtering algorithm is applied to obtain an effective subsequence lexicon,while in the second step,a bi-lexicon based maximum matching algorithm is employed for identifying subsequence units.The effectiveness of this approach is verified by our experiments using two closed test data sets from Bakeoff-2005.
Keywords:computer application  Chinese information processing  Chinese word segmentation(CWS)  subsequence-based tagging approach of CWS
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号