首页 | 本学科首页   官方微博 | 高级检索  
     

基于词汇同现模型的关键词自动提取方法研究
引用本文:肖红,许少华.基于词汇同现模型的关键词自动提取方法研究[J].沈阳理工大学学报,2009,28(5):38-41.
作者姓名:肖红  许少华
作者单位:大庆石油学院,计算机与信息技术学院,黑龙江,大庆163318
基金项目:国家自然科学基金资助项目,黑龙江省自然科学基金资助项目 
摘    要:关键词提取是中文信息处理的一个关键环节。提出一种关键词自动提取的有效方法,首先对普通词典进行了扩充,在普通词典的基础上结合大量的训练样本对词典进行训练得到一个带有TFxIDF值和互信息的优化词典。然后在此词典上按段落进行切词,对切词结果集根据词频、权重、同现关系和互信息排序后筛选出候选关键词。最后根据候选词的上位词和下位词进行词汇合并,设定一个阀值,取出其中的n个词作为文章的关键词。通过小数据测试样本集的抽取实验结果表明,文中方法在一定程度上能够提高关键词提取的正确率,得到了较为满意的效果.

关 键 词:关键词自动提取  同现关系  互信息  TF×IDF

A Method of Automatic Keyword Extraction based on Co-occurrence Model
XIAO Hong,XU Shao-hua.A Method of Automatic Keyword Extraction based on Co-occurrence Model[J].Transactions of Shenyang Ligong University,2009,28(5):38-41.
Authors:XIAO Hong  XU Shao-hua
Affiliation:( Dept. of Computer &Information Technology, Daqing Petroleum Institute, Daqing 163318, China)
Abstract:Keyword Extraction is a key problem in Chinese language processing. Firstly, an effective way for automatically extracting keywords was proposed in this paper, which extends the normal dic- tionary and constructs an optimum one with the TF × IDF and MI factor in vocabulary by training massive sample data sets on the base of normal dictionary. Secondly, based on the optimum diction- ary, all segment word items are sorted and the candidate words are selected in terms of the word frequency, weight, co-occurrence relationship and MI factor. With application of the candidate word's epigynous and hypogynous, the word items are merged. Finally, by setting a threshold that confined the number of keywords, the final keywords of document are obtained. It is shown by the experimental results that the method can improve the accuracy of automatic keywords extraction in certain extent, and that the more satisfied results are presented in min data-set.
Keywords:automatic keyword extraction  co-occurrence relationship  mutual information  TF × IDF
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号