一种基于生语料的领域词典生成方法 Method of Special Domain Lexicon Construction Based on Raw Materials期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于生语料的领域词典生成方法

引用本文：	孙霞,郑庆华,王朝静,张素娟. 一种基于生语料的领域词典生成方法[J]. 小型微型计算机系统, 2005, 26(6): 1088-1092

作者姓名：	孙霞郑庆华王朝静张素娟

作者单位：	西安交通大学,计算机系,陕西,西安,710049

基金项目：	国家自然科学基金项目(60373105)资助，国家“十五”重大科技攻关项目(2001BA101A01)资助，教育部优秀青年教师基金项目资助.

摘要：	为了实现准确分词，实用的汉语信息处理系统都需有其专用的领域词典．针对现有词典构造方法存在的不足，本文提出了一种领域词典的构造方法；利用通用词典对领域生语料进行分词处理，并提出了基于切分单元的最大匹配算法，从而得到候选词串集，然后利用规则对其进行优化，最终生成领域词典．词典的生成过程基本上是自动完成的，人工干预少，易于更新；目前．本方法生成的领域词典已经应用于我们自主开发的“基于Web的智能答疑系统”中，并取得了较好的效果．
关键词：	领域词典通用词典词频统计最大匹配
文章编号：	1000-1220(2005)06-1088-05
Method of Special Domain Lexicon Construction Based on Raw Materials

SUN Xia,ZHENG Qing-Hua,WANG Zhao-jing,ZHANG Su-Juan. Method of Special Domain Lexicon Construction Based on Raw Materials[J]. Mini-micro Systems, 2005, 26(6): 1088-1092

Authors:	SUN Xia ZHENG Qing-Hua WANG Zhao-jing ZHANG Su-Juan

Abstract:	Special domain lexicon is very vital to any practical Chinese information processing system, especially to Chinese word segmentation. Aiming at the limitation of the current methods of special domain lexicon construction, a novel Chinese lexicon construction approach for word segmentation is proposed in this paper. It is based on a large amount of raw materials for some one special domain collected ahead, the longest repeated string patterns are extracted from each raw material after word segmentation based on open domain lexicon. Then, the non-meaningful words are trimmed to improve word extraction accuracy from possible candidate word set, moreover, using some optimization rules to filter the non-meaningful words further and finally the special domain lexicon is constructed. The proposed method has already been implemented and applied in our Web answering system. The experimental result shows it is practical, effective and extendable.

Keywords:	special domain lexicon open domain lexicon word frequency maximum match
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏