面向信息检索的自适应中文分词系统 Information Retrieval Oriented Adaptive Chinese Word Segmentation System期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向信息检索的自适应中文分词系统

引用本文：	曹勇刚,曹羽中,金茂忠,刘超.面向信息检索的自适应中文分词系统[J].软件学报,2006,17(3):356-363.

作者姓名：	曹勇刚曹羽中金茂忠刘超

作者单位：	北京航空航天大学计算机学院,北京,100083

摘要：	新词的识别和歧义的消解是影响信息检索系统准确度的重要因素.提出了一种基于统计模型的、面向信息检索的自适应中文分词算法.基于此算法,设计和实现了一个全新的分词系统BUAASEISEG.它能够识别任意领域的各类新词,也能进行歧义消解和切分任意合理长度的词.它采用迭代式二元切分方法,对目标文档进行在线词频统计,使用离线词频词典或搜索引擎的倒排索引,筛选候选词并进行歧义消解.在统计模型的基础上,采用姓氏列表、量词表以及停词列表进行后处理,进一步提高了准确度.通过与著名的ICTCLAS分词系统针对新闻和论文进行对比评测,表明BUAASEISEG在新词识别和歧义消解方面有明显的优势.
关键词：	分词系统分词算法信息检索新词识别歧义消解
收稿时间：	8/2/2005 12:00:00 AM
修稿时间：	2005-10-11
Information Retrieval Oriented Adaptive Chinese Word Segmentation System

CAO Yong-Gang,CAO Yu-Zhong,JIN Mao-Zhong and LIU Chao.Information Retrieval Oriented Adaptive Chinese Word Segmentation System[J].Journal of Software,2006,17(3):356-363.

Authors:	CAO Yong-Gang CAO Yu-Zhong JIN Mao-Zhong and LIU Chao

Affiliation:	School of Computer Science and Engineering, BeiHang University, Beijing 100083, China

Abstract:	New words recognition and ambiguity resolving have vital effect on information retrieval precision. This paper presents a statistical model based algorithm for adaptive Chinese word segmentation. Then, a new word segmentation system called BUAASEISEG is designed and implemented using this algorithm. BUAASEISEG can recognize new words in various domains and do disambiguation and segment words with arbitrary length. It uses an iterative bigram method to do word segmentation. Through online statistical analysis on target article and using the offline words frequencies dictionary or the inverted index of the search engine, the candidate words selection and disambiguation are done. On the basis of the statistical methods, post-process using stopwords list, quantity suffix words list and surname list are used for further precision improvement. The comparative evaluation with the famous Chinese word segmentation system ICTCLAS, using news and papers as testing text, shows that BUAASEISEG outperforms ICTCLAS in new words recognition and disambiguation.

Keywords:	word segmentation system word segmentation algorithm information retrieval new word recognition disambiguation
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏