首页 | 本学科首页   官方微博 | 高级检索  
     

利用词性信息改进Katz平滑算法
引用本文:赵岩,王晓龙,徐志明,刘秉权.利用词性信息改进Katz平滑算法[J].哈尔滨工业大学学报,2007,39(9):1445-1448.
作者姓名:赵岩  王晓龙  徐志明  刘秉权
作者单位:哈尔滨工业大学,计算机科学与技术学院,哈尔滨,150001
基金项目:国家自然科学基金重点资助项目(60435020),国家高技术研究发展计划资助项目(2002AA117010-09)
摘    要:对已有的N-gram平滑算法进行了系统地分析,分别实现了Absolute、W-B和Katz平滑算法.为解决传统Katz平滑算法在处理某些汉语固定搭配时无法进行概率折扣的问题,利用词性信息构造了新的折扣系数.新的折扣系数使词频越大,折扣越小,后接词越多,折扣越大,满足平滑算法对折扣系数的期望.试验结果表明:新的Katz平滑算法降低了N-gram模型的交叉熵,在汉语分词中应用改进的平滑算法也提高了分词结果的F量度.

关 键 词:N-gram模型  数据稀疏  词性信息  Katz平滑
文章编号:0367-6234(2007)09-1445-04
修稿时间:2005-05-08

Improved Katz smoothing algorithms with POS information
ZHAO Yan,WANG Xiao-long,XU Zhi-ming,LIU Bing-quan.Improved Katz smoothing algorithms with POS information[J].Journal of Harbin Institute of Technology,2007,39(9):1445-1448.
Authors:ZHAO Yan  WANG Xiao-long  XU Zhi-ming  LIU Bing-quan
Affiliation:School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Abstract:This paper reviewed existing smoothing methods for N-gram model firstly,and implemented the Absolute,W-B and Katz smoothing algorithms respectively.Traditional Katz algorithm couldn't discount the probability when it smoothed Chinese collocation.We constructed new discounting coefficient based on Part-of-Speech information to resolve this problem.Calculated by the new discounting coefficient,discount could decrease when word frequency increased,and the more count of following word,the more discount.All this satisfied demand of smoothing methods.Experiment result showed that improved Katz smoothing algorithm could not only decrease the cross entropy of language model,but also increase the F measure when applied to Chinese word segmentation.
Keywords:N-gram model  data sparseness  POS information  Katz smoothing
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号