首页 | 本学科首页   官方微博 | 高级检索  
     

一种半监督的中文垃圾微博过滤方法
引用本文:姚子瑜,屠守中,黄民烈,朱小燕.一种半监督的中文垃圾微博过滤方法[J].中文信息学报,2016,30(5):176-186.
作者姓名:姚子瑜  屠守中  黄民烈  朱小燕
作者单位:清华大学 计算机科学与技术系,北京 100084
基金项目:国家自然科学基金(61332007,61272227)
摘    要:微博作为目前国内外最活跃的信息分享平台之一,其中却充斥着大量的垃圾内容。因此,如何从给定话题的微博数据中,过滤掉与话题不相关的垃圾微博、保留话题相关微博,成为迫切需要解决的问题。该文提出了一种半监督的中文微博过滤方法,基于朴素贝叶斯分类模型和最大期望算法,实现了利用少量标注数据的垃圾微博过滤算法,其优势是仅仅利用少量标注数据就可以获得较为理想的过滤性能。分别对十个话题140 000余条新浪微博数据进行过滤,该文提出的模型准确度和F值优于朴素贝叶斯和支持向量机模型。


关 键 词:垃圾微博过滤  半监督学习  EM算法  朴素贝叶斯
  

A Semi-supervised Method for Filtering Chinese Spam Tweets
YAO Ziyu,TU Shouzhong,HUANG Minlie,ZHU Xiaoyan.A Semi-supervised Method for Filtering Chinese Spam Tweets[J].Journal of Chinese Information Processing,2016,30(5):176-186.
Authors:YAO Ziyu  TU Shouzhong  HUANG Minlie  ZHU Xiaoyan
Affiliation:Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Abstract:Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.
Keywords:spam tweet  naive bayesian classifier  expectation maximization  semi-supervised learning  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号