首页 | 本学科首页   官方微博 | 高级检索  
     

基于N-最短路径方法的中文词语粗分模型
引用本文:张华平,刘群.基于N-最短路径方法的中文词语粗分模型[J].中文信息学报,2002,16(5):3-9.
作者姓名:张华平  刘群
作者单位:中国科学院计算技术研究所软件实验室
基金项目:国家重点基础研究项目(G1998030507-4、G1998030510).
摘    要:预处理过程的词语粗切分, 是整个中文词语分析的基础环节, 对最终的召回率、准确率、运行效率起着重要的作用。词语粗分必须能为后续的过程提供少量的、高召回率的、中间结果。本文提出了一种基于N-最短路径方法的粗分模型, 旨在兼顾高召回率和高效率。在此基础上, 引入了词频的统计数据, 对原有模型进行改进, 建立了更实用的统计模型。针对人民日报一个月的语料库(共计185,192个句子), 作者进行了粗分实验。按句子进行统计, 2-最短路径非统计粗分模型的召回率为99.73%;在10-最短路径统计粗分模型中, 平均6.12个粗分结果得到的召回率高达99.94%, 比最大匹配方法高出15%, 比以前最好的切词方法至少高出6.4%。而粗分结果数的平均值较全切分减少了64倍。实验结果表明:N-最短路径方法是一种预处理过程中实用、有效的的词语粗分手段。

关 键 词:N-最短路径方法  粗分  中文词语分析  
修稿时间:2001年12月18

Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method
ZHANG Hua-ping,LIU Qun.Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method[J].Journal of Chinese Information Processing,2002,16(5):3-9.
Authors:ZHANG Hua-ping  LIU Qun
Affiliation:Software division Institute of Computing Technology The Chinese Academy of Sciences Beijing 100080 China
Abstract:As the very first step of Chinese word segmentation,rough segmentation tries to cover the correct segmentation with as few candidates as possible. This paper presents a model of rough segmentation, which is based on the N-shortest-paths method,to achieve the goal. In parallel,a statistical model can easily be obtained by attaching frequencies to the edges of the word-graphs. Experiments have been made on a one-month news corpus of 185,192 sentences from the People s Daily. By sentence,the recalling rate of the non-statistical model based on 2-shortest-paths method is 99.73 % . When the statistical model is applied, a recalling rate as high as 99. 94 % , nearly 6.4% higher than known best approach and 15% higher than the maximum matching segmentation, can be reached with 6.12 candidates on average. In addition, the average number of segmentation candidates is reduced by 64 times as compared to the approach of full segmentation. The result shows that the N-shortest-paths method is effective for the task of rough segmentation.
Keywords:N-shortest paths method  words rough segmentation  Chinese lexical analysis
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号