首页 | 本学科首页   官方微博 | 高级检索  
     

基于信息熵与词语活跃度的领域词抽取
引用本文:王成,吕学强,王弘蔚,王涛. 基于信息熵与词语活跃度的领域词抽取[J]. 北京机械工业学院学报, 2011, 0(5): 49-52,58
作者姓名:王成  吕学强  王弘蔚  王涛
作者单位:北京信息科技大学中文信息处理研究中心,北京100101
基金项目:核高基项目(2010ZX01042-002-002); 国家自然科学基金项目(60872133); 北京市自然科学基金项目(4092015)
摘    要:提出了一种基于信息墒和词语活跃度的领域词抽取方法,通过对语料进行预处理,提取出候选领域词,计算所有候选领域词的正规化类间分布(NCD)和正规化类内分布(NDD),设置阈值对候选领域词过滤,最后分析了双字候选领域词中包含的常见噪音词语,使用词语活跃度对候选领域词中的双字词语进行过滤,该方法综合考虑了领域词在类别中的概率分布和领域词的内部特征。实验结果表明,该方法在领域词的识别上具有较好的准确率和召回率。

关 键 词:领域词抽取  领域词过滤  信息熵  词语活跃度  知识获取  自然语言处理

Domain Terms extraction based on entropy and word activity
WANG Cheng,Lü Xue-qiang,WANG Hong-wei,WANG Tao. Domain Terms extraction based on entropy and word activity[J]. Journal of Beijing Institute of Machinery, 2011, 0(5): 49-52,58
Authors:WANG Cheng  Lü Xue-qiang  WANG Hong-wei  WANG Tao
Affiliation:(Chinese Information Processing Research Center,Beijing Information Science and Technology University,Beijing,100101)
Abstract:An algorithm of domain terms extraction based on entropy and word activity is presented.The corpus is preprocessed in order to extract candidate terms and the values of Normalization Corpus Distribution(NCD) and Normalization Domain Distribution(NDD) of candidate terms are calculated by means of adjusting the thresholds of NCD and NDD.Because noise words account for a relativiely large proportion of two-word candidate terms,a method of filtering the two-word candidate terms based on the word activity is adopted.This method takes into account probability distribution in category and internal features of domain terms.The experiment results show that this algorithm is a method of domain terms extraction that has better precision and recall rate.
Keywords:domain terms extraction  terms filtration  entropy  word activity  knowledge acquisition  natural language processing
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号