首页 | 本学科首页   官方微博 | 高级检索  
     

基于决策树的汉语未登录词识别
引用本文:秦文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报,2004,18(1):15-20.
作者姓名:秦文  苑春法
作者单位:智能技术与系统国家重点实验室,清华大学计算机系
基金项目:国家自然科学基金,国家重点基础研究发展计划(973计划)
摘    要:未登录词识别是汉语分词处理中的一个难点。在大规模中文文本的自动分词处理中,未登录词是造成分词错识误的一个重要原因。本文首先把未登录词识别问题看成一种分类问题。即分词程序处理后产生的分词碎片分为‘合’(合成未登录词)和‘分’(分为两单字词)两类。然后用决策树的方法来解决这个分类的问题。从语料库及现代汉语语素数据库中共统计出六类知识:前字前位成词概率、后字后位成词概率、前字自由度、后字自由度、互信息、单字词共现概率。用这些知识作为属性构建了训练集。最后用C4.5算法生成了决策树。在分词程序已经识别出一定数量的未登录词而仍有分词碎片情况下使用该方法,开放测试的召回率:69.42%,正确率:40.41%。实验结果表明,基于决策树的未登录词识别是一种值得继续探讨的方法。

关 键 词:人工智能  自然语言处理  未登录词识别  数据挖掘  决策树  C4.5算法  
文章编号:1003-0077(2004)01-0014-06
修稿时间:2003年5月8日

Identification of Chinese Unknown Word Based on Decision Tree
QIN Wen,YUAN Chun fa.Identification of Chinese Unknown Word Based on Decision Tree[J].Journal of Chinese Information Processing,2004,18(1):15-20.
Authors:QIN Wen  YUAN Chun fa
Affiliation:State Key Laboratory of Intelligent Technology & System , Dept. of Computer Science & Technology , Tsinghua University
Abstract:Unknown words can cause segmentation mistakes in the automatic word segmentation processing of large Chinese texts. Meanwhile the recognition of unknown words is a difficult point in word segmentation processing. This article suggests the recognition of unknown words as a question of classification first, that is, the segmentation fragments, upon the segmentation processing, are divided into two categories as "combination" (combining an unknown words) and "segregation" (segregating to two single character words). Then, decision tree is used to solve this problem of classification. Six aspects are summarized from the Corpus and the modern Chinese morpheme database: front position formation probability of former character, back -end position formation probability of latter character, former character freedom, latter character freedom, mutual information and single character words co -occurred probability. Training set is constructed using these as attributes. And lastly, the decision tree is produced using C4 5 algorithm. After word segmentation processing, some unknown words have been recognized, but there are still some segmentation fragments usually. In this case our method should be used. For an open test, its recall rate is 69 42%; its precision is 40 41%. Experimental results show that shis recognition method based on decision tree is worth to continue to study in the future.
Keywords:artificial intelligence  natural language processing  unknown word recognition  data mining  decision tree  C4  5 algorithm
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号