首页 | 本学科首页   官方微博 | 高级检索  
     

主题相似度与链接权重相结合的垃圾网页排序检测
引用本文:韦莎,朱焱.主题相似度与链接权重相结合的垃圾网页排序检测[J].计算机应用,2016,36(3):735-739.
作者姓名:韦莎  朱焱
作者单位:西南交通大学 信息科学与技术学院, 成都 610031
基金项目:四川省学术和技术带头人培养资助项目。
摘    要:针对因Web中存在由正常网页指向垃圾网页的链接,导致排序算法(Anti-TrustRank等)检测性能降低的问题,提出了一种主题相似度和链接权重相结合,共同调节网页非信任值传播的排序算法,即主题链接非信任排序(TLDR)。首先,运用隐含狄利克雷分配(LDA)模型得到所有网页的主题分布,并计算相互链接网页间的主题相似度;其次,根据Web图计算链接权重,并与主题相似度结合,得到主题链接权重矩阵;然后,利用主题链接权重调节非信任值传播,改进Anti-TrustRank和加权非信任值排序(WATR)算法,使网页得到更合理的非信任值;最后,将所有网页的非信任值进行排序,通过划分阈值检测出垃圾网页。在数据集WEBSPAM-UK2007上进行的实验结果表明,与Anti-TrustRank和WATR相比,TLDR的SpamFactor分别提高了45%和23.7%,F1-measure(阈值取600)分别提高了3.4个百分点和0.5个百分点, spam比例(前三个桶)分别提高了15个百分点和10个百分点。因此,主题与链接权重相结合的TLDR算法能有效提高垃圾网页检测性能。

关 键 词:垃圾网页检测  链接作弊  排序算法  主题相似度  非信任值传播  
收稿时间:2015-07-29
修稿时间:2015-10-03

Combining topic similarity with link weight for Web spam ranking detection
WEI Sha,ZHU Yan.Combining topic similarity with link weight for Web spam ranking detection[J].journal of Computer Applications,2016,36(3):735-739.
Authors:WEI Sha  ZHU Yan
Affiliation:School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China
Abstract:Focused on the issue that good-to-bad links in the Web degrade the detection performance of ranking algorithms (e.g. Anti-TrustRank), a distrust ranking algorithm—Topic Link Distrust Rank (TLDR) by combining topic similarity with link weight to adjust the propagation was proposed. Firstly, the topic distribution of all the pages was gotten by Latent Dirichlet Allocation (LDA), and the topic similarity of linked pages was computed. Secondly, link weight was computed according to the Web graph, and was combined with topic similarity to achieve the topic-link weight matrix. Then, the Anti-TrustRank and Weighted Anti-TrustRank (WATR) algorithm were improved by measuring the distrust scores correctly based on the topic and link weight. Finally, all the pages were ranked according to their distrust scores, and spam pages were detected by taking a threshold. The experiment results on the dataset WEBSPAM-UK2007 show that, compared with Anti-TrustRank and WATR, SpamFactor of TLDR is raised by 45% and 23.7%, F1-measure (threshold was 600) is improved by 3.4 percentage points and 0.5 percentage points, and spam ration(top 3 of the buckets) is increased by 15 percentage points and 10 percentage points, respectively.
Keywords:Web spam detection                                                                                                                        link-based spam                                                                                                                        ranking algorithm                                                                                                                        topic similarity                                                                                                                        distrust propagation
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号