首页 | 本学科首页   官方微博 | 高级检索  
     

基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类
引用本文:阿力甫·阿不都克里木,李晓.基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类[J].计算机科学,2016,43(12):36-40.
作者姓名:阿力甫·阿不都克里木  李晓
作者单位:中国科学院新疆理化技术研究所 乌鲁木齐830011;中国科学院大学 北京100039;新疆多语种信息技术重点实验室 乌鲁木齐830046,中国科学院新疆理化技术研究所 乌鲁木齐830011;中国科学院大学 北京100039
基金项目:本文受新疆多语种信息技术重点实验室开放课题(XJDX0905-2013-06)资助
摘    要:针对维吾尔语文本的分类问题,提出一种基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类方法。首先,对输入文本进行预处理,滤除非维吾尔语的字符和停用词;然后,利用词语语义相似度、词语位置和词频重要性加权的TextRank算法提取文本关键词集合;最后,根据互信息相似度度量,计算输入文本关键词集和各类关键词集的相似度,最终实现文本的分类。实验结果表明,该方案能够 提取出具有较高识别度的关键词,当关键词集大小为1250时,平均分类率达到了91.2%。

关 键 词:维吾尔语  文本分类  关键词提取  TextRank算法  互信息相似度
收稿时间:2016/3/23 0:00:00
修稿时间:2016/5/21 0:00:00

Uyghur Keyword Extraction and Text Classification Based on TextRank Algorithm and Mutual Information Similarity
Ghalip ABDUKERIM and LI Xiao.Uyghur Keyword Extraction and Text Classification Based on TextRank Algorithm and Mutual Information Similarity[J].Computer Science,2016,43(12):36-40.
Authors:Ghalip ABDUKERIM and LI Xiao
Affiliation:Xinjiang Technical Institute of Physical and Chemistry,Chinese Academy of Sciences,Urumqi 830011,China;University of Chinese Academy of Sciences,Beijing 100039,China;Xinjiang Key Laboratory of Multi-language Information Technology,Urumqi 830046,China and Xinjiang Technical Institute of Physical and Chemistry,Chinese Academy of Sciences,Urumqi 830011,China;University of Chinese Academy of Sciences,Beijing 100039,China
Abstract:This paper proposed Uyghur keyword extraction and text classification scheme based on TextRank algorithm and mutual information similarity for the issues of classification in Uyghur language text.Firstly,the input document is pre-processed to filter out non-Uyghur characters and stop words.Then,keywords set in the text is extracted through using the TextRank algorithm which is weighted by semantic similarity of words,position of words and importance of frequency.Finally,the similarity between keyword sets in the input text and a variety of keyword sets is measured according to the mutual information similarity,and the text classification is realized.The experimental results show that this scheme can efficiently extract the keywords,and the average classification rate reaches 91.2% when the set size is 1250.
Keywords:Uyghur language  Text categorization  Keyword extraction  TextRank algorithm  Mutual information similarity
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号