首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于生语料的领域词典生成方法
引用本文:孙霞,郑庆华,王朝静,张素娟. 一种基于生语料的领域词典生成方法[J]. 小型微型计算机系统, 2005, 26(6): 1088-1092
作者姓名:孙霞  郑庆华  王朝静  张素娟
作者单位:西安交通大学,计算机系,陕西,西安,710049
基金项目:国家自然科学基金项目(60373105)资助,国家“十五”重大科技攻关项目(2001BA101A01)资助,教育部优秀青年教师基金项目资助.
摘    要:为了实现准确分词,实用的汉语信息处理系统都需有其专用的领域词典.针对现有词典构造方法存在的不足,本文提出了一种领域词典的构造方法;利用通用词典对领域生语料进行分词处理,并提出了基于切分单元的最大匹配算法,从而得到候选词串集,然后利用规则对其进行优化,最终生成领域词典.词典的生成过程基本上是自动完成的,人工干预少,易于更新;目前.本方法生成的领域词典已经应用于我们自主开发的“基于Web的智能答疑系统”中,并取得了较好的效果.

关 键 词:领域词典 通用词典 词频统计 最大匹配
文章编号:1000-1220(2005)06-1088-05

Method of Special Domain Lexicon Construction Based on Raw Materials
SUN Xia,ZHENG Qing-Hua,WANG Zhao-jing,ZHANG Su-Juan. Method of Special Domain Lexicon Construction Based on Raw Materials[J]. Mini-micro Systems, 2005, 26(6): 1088-1092
Authors:SUN Xia  ZHENG Qing-Hua  WANG Zhao-jing  ZHANG Su-Juan
Abstract:Special domain lexicon is very vital to any practical Chinese information processing system, especially to Chinese word segmentation. Aiming at the limitation of the current methods of special domain lexicon construction, a novel Chinese lexicon construction approach for word segmentation is proposed in this paper. It is based on a large amount of raw materials for some one special domain collected ahead, the longest repeated string patterns are extracted from each raw material after word segmentation based on open domain lexicon. Then, the non-meaningful words are trimmed to improve word extraction accuracy from possible candidate word set, moreover, using some optimization rules to filter the non-meaningful words further and finally the special domain lexicon is constructed. The proposed method has already been implemented and applied in our Web answering system. The experimental result shows it is practical, effective and extendable.
Keywords:special domain lexicon  open domain lexicon  word frequency  maximum match
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号