首页 | 本学科首页   官方微博 | 高级检索  
     

基于字的词位标注汉语分词
引用本文:于江德,睢丹,樊孝忠. 基于字的词位标注汉语分词[J]. 山东大学学报(工学版), 2010, 40(5): 117-122
作者姓名:于江德  睢丹  樊孝忠
作者单位:1. 安阳师范学院计算机与信息工程学院, 河南 安阳 455002;2. 北京理工大学计算机科学技术学院, 北京 100081
基金项目:高等学校博士学科点专项科研基金资助项目 
摘    要:近年来基于字的词位标注方法极大地提高了汉语分词的性能,该方法将汉语分词转化为字的词位标注问题,借助于优秀的序列标注模型,基于字的词位标注汉语分词方法逐渐成为汉语分词的主要技术路线。该方法中特征模板选择至关重要,采用四词位标注集,使用条件随机场模型进一步研究基于字的词位标注汉语分词技术,在第三届和第四届国际汉语分词评测Bakeoff语料上进行封闭测试,并对比了不同特征模板集对分词性能的影响。实验表明采用的特征模板集:TMPT-10′较传统的特征模板集分词性能更好。

关 键 词:汉语分词  条件随机场  词位标注  特征模板  
收稿时间:2010-01-30

Word-position-based tagging for Chinese word segmentation
YU Jiang-de,SUI Dan,FAN Xiao-zhong. Word-position-based tagging for Chinese word segmentation[J]. Journal of Shandong University of Technology, 2010, 40(5): 117-122
Authors:YU Jiang-de  SUI Dan  FAN Xiao-zhong
Affiliation:1. School of Computer and Information Engineering, Anyang Normal University, Anyang 455002, China;2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract: The performance of Chinese word segmentation has been greatly improved by word-position-based approaches in recent years. This approach treats Chinese word segmentation as a word position tagging problem. With the help of powerful sequence tagging model, word-position-based method quickly rose as a mainstream technique in this field. Feature template selection is crucial in this method. We further studied this technique via using four word positions and conditional random fields. Closed evaluations are performed on corpus from the third and the fourth international Chinese word segmentation Bakeoff, and comparative experiments are performed on different feature templates. Experimental results show that the feature template set: TMPT-10′  is much better performance than the traditional template set. 
Keywords:Chinese word segmentation  conditional random fields  word-position tagging  feature template
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《山东大学学报(工学版)》浏览原始摘要信息
点击此处可从《山东大学学报(工学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号