一种半监督的中文垃圾微博过滤方法 A Semi-supervised Method for Filtering Chinese Spam Tweets期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种半监督的中文垃圾微博过滤方法

引用本文：	姚子瑜,屠守中,黄民烈,朱小燕.一种半监督的中文垃圾微博过滤方法[J].中文信息学报,2016,30(5):176-186.

作者姓名：	姚子瑜屠守中黄民烈朱小燕

作者单位：	清华大学计算机科学与技术系,北京 100084

基金项目：	国家自然科学基金(61332007,61272227)

摘要：	微博作为目前国内外最活跃的信息分享平台之一,其中却充斥着大量的垃圾内容。因此,如何从给定话题的微博数据中,过滤掉与话题不相关的垃圾微博、保留话题相关微博,成为迫切需要解决的问题。该文提出了一种半监督的中文微博过滤方法,基于朴素贝叶斯分类模型和最大期望算法,实现了利用少量标注数据的垃圾微博过滤算法,其优势是仅仅利用少量标注数据就可以获得较为理想的过滤性能。分别对十个话题140 000余条新浪微博数据进行过滤,该文提出的模型准确度和F值优于朴素贝叶斯和支持向量机模型。
关键词：	垃圾微博过滤半监督学习 EM算法朴素贝叶斯
A Semi-supervised Method for Filtering Chinese Spam Tweets

YAO Ziyu,TU Shouzhong,HUANG Minlie,ZHU Xiaoyan.A Semi-supervised Method for Filtering Chinese Spam Tweets[J].Journal of Chinese Information Processing,2016,30(5):176-186.

Authors:	YAO Ziyu TU Shouzhong HUANG Minlie ZHU Xiaoyan

Affiliation:	Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Abstract:	Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spam texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spam tweets is a critical and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.

Keywords:	spam tweet naive bayesian classifier expectation maximization semi-supervised learning

	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏