主题相似度与链接权重相结合的垃圾网页排序检测 Combining topic similarity with link weight for Web spam ranking detection期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

主题相似度与链接权重相结合的垃圾网页排序检测

引用本文：	韦莎,朱焱.主题相似度与链接权重相结合的垃圾网页排序检测[J].计算机应用,2016,36(3):735-739.

作者姓名：	韦莎朱焱

作者单位：	西南交通大学信息科学与技术学院, 成都 610031

基金项目：	四川省学术和技术带头人培养资助项目。

摘要：	针对因Web中存在由正常网页指向垃圾网页的链接,导致排序算法(Anti-TrustRank等)检测性能降低的问题,提出了一种主题相似度和链接权重相结合,共同调节网页非信任值传播的排序算法,即主题链接非信任排序(TLDR)。首先,运用隐含狄利克雷分配(LDA)模型得到所有网页的主题分布,并计算相互链接网页间的主题相似度;其次,根据Web图计算链接权重,并与主题相似度结合,得到主题链接权重矩阵;然后,利用主题链接权重调节非信任值传播,改进Anti-TrustRank和加权非信任值排序(WATR)算法,使网页得到更合理的非信任值;最后,将所有网页的非信任值进行排序,通过划分阈值检测出垃圾网页。在数据集WEBSPAM-UK2007上进行的实验结果表明,与Anti-TrustRank和WATR相比,TLDR的SpamFactor分别提高了45%和23.7%,F1-measure(阈值取600)分别提高了3.4个百分点和0.5个百分点, spam比例(前三个桶)分别提高了15个百分点和10个百分点。因此,主题与链接权重相结合的TLDR算法能有效提高垃圾网页检测性能。
关键词：	垃圾网页检测链接作弊排序算法主题相似度非信任值传播
收稿时间：	2015-07-29
修稿时间：	2015-10-03
Combining topic similarity with link weight for Web spam ranking detection

WEI Sha,ZHU Yan.Combining topic similarity with link weight for Web spam ranking detection[J].journal of Computer Applications,2016,36(3):735-739.

Authors:	WEI Sha ZHU Yan

Affiliation:	School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China

Abstract:	Focused on the issue that good-to-bad links in the Web degrade the detection performance of ranking algorithms (e.g. Anti-TrustRank), a distrust ranking algorithm—Topic Link Distrust Rank (TLDR) by combining topic similarity with link weight to adjust the propagation was proposed. Firstly, the topic distribution of all the pages was gotten by Latent Dirichlet Allocation (LDA), and the topic similarity of linked pages was computed. Secondly, link weight was computed according to the Web graph, and was combined with topic similarity to achieve the topic-link weight matrix. Then, the Anti-TrustRank and Weighted Anti-TrustRank (WATR) algorithm were improved by measuring the distrust scores correctly based on the topic and link weight. Finally, all the pages were ranked according to their distrust scores, and spam pages were detected by taking a threshold. The experiment results on the dataset WEBSPAM-UK2007 show that, compared with Anti-TrustRank and WATR, SpamFactor of TLDR is raised by 45% and 23.7%, F1-measure (threshold was 600) is improved by 3.4 percentage points and 0.5 percentage points, and spam ration(top 3 of the buckets) is increased by 15 percentage points and 10 percentage points, respectively.

Keywords:	Web spam detection link-based spam ranking algorithm topic similarity distrust propagation

	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏