基于论坛语料识别中文未登录词的方法 Algorithm to recognize unknown Chinese words based on BBS corpus期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于论坛语料识别中文未登录词的方法

引用本文：	都菁,熊海灵.基于论坛语料识别中文未登录词的方法[J].计算机工程与设计,2010,31(3).

作者姓名：	都菁熊海灵

作者单位：	西南大学,计算机与信息科学学院,重庆,400715

基金项目：	国家自然科学基金项目，西南大学博士基金项目

摘要：	为解决中文分词中未登录词识别效率低的问题,提出了基于论坛语料识别中文未登录词的新方法.利用网络蜘蛛下载论坛网页构建一个语料库,并对该语料库进行周期性的更新以获取具备较强时效性的语料;利用构造出的新统计量MD(由Mutual Information函数和Duplicated Combination Frequency函数构造)对语料库进行分词产生候选词表;最后通过对比候选词表与原始词表发现未登录词,并将识别出的未登陆词扩充到词库中.实验结果表明,该方法可以有效提高未登录词的识别效率.
关键词：	未登录词中文分词网络蜘蛛论坛语料
Algorithm to recognize unknown Chinese words based on BBS corpus

DU Jing,XIONG Hai-ling.Algorithm to recognize unknown Chinese words based on BBS corpus[J].Computer Engineering and Design,2010,31(3).

Authors:	DU Jing XIONG Hai-ling

Affiliation:	DU Jing,XIONG Hai-ling (College of Computer , Information Science,Southwest University,Chongqing 400715,China)

Abstract:	To deal with the problem of low efficiency of Chinese unknown word segmentation,a new method based on BBS corpus is presented.Network spider is used to download BBS web pages to build a corpus and this corpus is updated periodically in order to obtain a strong limitation.The new statistic MD(constructed by the mutual information function and duplicated combination frequency function) is used to segment the corpus to generate a candidate word list.By comparing candidate words list and the previous lexicon to...

Keywords:	unknown word Chinese word segmentation web spider BBS corpus
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏