首页 | 本学科首页   官方微博 | 高级检索  
     

一种面向微博客文本流的噪音判别与内容相似性双重检测的过滤方法
引用本文:王琳,冯时,徐伟丽,杨卓,王大玲,张一飞.一种面向微博客文本流的噪音判别与内容相似性双重检测的过滤方法[J].计算机应用与软件,2012,29(8):25-29,94.
作者姓名:王琳  冯时  徐伟丽  杨卓  王大玲  张一飞
作者单位:1. 东北大学信息科学与工程学院 辽宁沈阳110819
2. 东北大学信息科学与工程学院 辽宁沈阳110819;医学影像计算教育部重点实验室(东北大学) 辽宁沈阳110819
摘    要:微博客作为一种新的用户信息传播载体,在网络舆情发起和传播中起着重要作用。由于用户有意(上传广告)、无意(转发)操作所带来的大量噪音微博和相似微博,对网络舆情分析和用户浏览造成极为不利的影响。检测这些噪音微博和相似微博,对微博数据进行提纯,成为一个亟待解决的问题。基于统计数据分析了噪音微博和相似微博的特点,提出一种面向微博文本流的噪音判别和内容相似性双重检测的过滤方法:通过URL链接、字符率、高频词等特征判别,过滤噪音微博;通过分段过滤和索引过滤的双重内容过滤,检测和剔除相似微博。实验表明该方法能有效地对微博数据进行提纯,高效准确地过滤掉相似微博和噪音微博。

关 键 词:微博客  噪音微博  相似微博  文本流  过滤

A FILTERING APPROACH FOR SPAM DISCRIMINATION AND CONTENT SIMILARITY DOUBLE DETECTION FOR MICROBLOG TEXT STREAM
Wang Lin , Feng Shi , Xu Weili , Yang Zhuo , Wang Daling , Zhang Yifei.A FILTERING APPROACH FOR SPAM DISCRIMINATION AND CONTENT SIMILARITY DOUBLE DETECTION FOR MICROBLOG TEXT STREAM[J].Computer Applications and Software,2012,29(8):25-29,94.
Authors:Wang Lin  Feng Shi  Xu Weili  Yang Zhuo  Wang Daling  Zhang Yifei
Affiliation:1,2 1(School of Information Science and Engineering,Northeastern University,Shenyang 110819,Liaoning,China) 2(Key Laboratory of Medical Image Computing(Northeastern University),Ministry of Education,Shenyang 110819,Liaoning,China)
Abstract:As a new carrier of user’s information dissemination,the microblog plays increasing important role in the emergence and propagation of Web public opinion.Large numbers of spam microblogs and similar microblogs caused by users’ conscious(uploading advertisements) or unconscious(resending) operations have brought adverse effects on network public opinion analysis and users browsing.To test these spam microblogs and similar microblogs as well as to purify microblogs data become the problems to be urgently resolved.In this paper,the characteristics of spam microblogs and similar microblogs are analysed based on the statistical data,and a filtering approach for spam discrimination and content similarity double detection for microblog text stream is put forward.This method can filter spam microblogs through features discrimination in regard to URL links,character rate and high frequency words,and detect and eliminate similar microblogs through double content filtering: the subsection-based and the index-based.Experiments show that this method can effectively purify the microblogs and filter out similar microblogs and spam microblogs accurately.
Keywords:Microblog Spam microblog Similar microblog Text stream Filter
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号