首页 | 本学科首页   官方微博 | 高级检索  
     

一种新的微博短文本特征词选择算法
引用本文:黄贤英,陈红阳,刘英涛,熊李媛.一种新的微博短文本特征词选择算法[J].计算机工程与科学,2015,37(9):1761-1767.
作者姓名:黄贤英  陈红阳  刘英涛  熊李媛
作者单位:;1.重庆理工大学计算机科学与工程学院
基金项目:国家自然科学基金资助项目(61173184);重庆市教委科技计划项目(KJ100821);重庆市科委自然科学基金资助项目(CSTC2012jjA40030)
摘    要:针对微博短文本有效特征较稀疏且难以提取,从而影响微博文本表示、分类与聚类准确性的问题,提出一种基于统计与语义信息相结合的微博短文本特征词选择算法。该算法基于词性组合匹配规则,根据词项的TF-IDF、词性与词长因子构造综合评估函数,结合词项与文本内容的语义相关度,对微博短文本进行特征词选择,以使挑选出来的特征词能准确表示微博短文本内容主题。将新的特征词选择算法与朴素贝叶斯分类算法相结合,对微博分类语料集进行实验,结果表明,相比其它的传统算法,新算法使得微博短文本分类准确率更高,表明该算法选取出来的特征词能够更准确地表示微博短文本内容主题。

关 键 词:微博短文本  特征词选择  统计与语义信息  词性组合  朴素贝叶斯分类算法
收稿时间:2014-10-28
修稿时间:2015-09-25

A novel algorithm for feature selection on micro-blog short texts
HUANG Xian ying,CHEN Hong yang,LIU Ying tao,XIONG Li yuan.A novel algorithm for feature selection on micro-blog short texts[J].Computer Engineering & Science,2015,37(9):1761-1767.
Authors:HUANG Xian ying  CHEN Hong yang  LIU Ying tao  XIONG Li yuan
Affiliation:(College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
Abstract:The valid features of micro blog short texts are sparse and difficult to extract, which reduces the accuracy of text representation, classification and clustering. We propose a novel algorithm for feature selection on micro blog short texts based on statistics and semantic information. We utilize Term Frequency Inverse Document Frequency (TF IDF), POS and the length of term to construct the evaluation function, and together with the semantic relevance between term and micro blog short texts, the feature selection on micro blog short texts is achieved, which guarantees that the selected terms can represent the meaning of micro blog short texts more accurately. The new feature selection algorithm is integrated with Naive Bayesian categorization algorithm, and the experiments on an open micro blog corpus show the proposed algorithm can acquire a higher precision rate of text categorization compared with the traditional strategies, indicating that the selected terms by the proposed algorithm can represent the topic of micro-blog short text more accurately.
Keywords:micro-blog short text  feature selection  statistics and semantic information  POS grouping  Naive Bayesian classification algorithm  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号