基于链接结构和内容相似度的聚焦爬虫系统 Focused crawler system based on combination of link structure and content similarity期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于链接结构和内容相似度的聚焦爬虫系统

引用本文：	倪贤贵,蔡明.基于链接结构和内容相似度的聚焦爬虫系统[J].计算机工程与设计,2008,29(7):1709-1711.

作者姓名：	倪贤贵蔡明

作者单位：	江南大学信息工程学院,江苏无锡,214122

摘要：	介绍了基于链接结构和内容相似度的主题Web Crawler系统结构,重点介绍了其中的联合网页链接结构和内容相似度来计算网页相关度算法.该算法计算种子网页集到抓取网页的链接数目和抓取网页到种子网页集的链接数目,及Web内容与主题的内容相似度,综合计算该网页的相关度权值,从中选择权威网页或hub网页作为种子网页,从而提高主题爬虫系统的爬行效率和抓取网页的查准率.
关键词：	聚焦爬虫链接结构内容相似度向量空间模型查准率
文章编号：	1000-7024(2008)07-1709-02
修稿时间：	2007年4月10日
Focused crawler system based on combination of link structure and content similarity

NI Xian-gui,CAI Ming.Focused crawler system based on combination of link structure and content similarity[J].Computer Engineering and Design,2008,29(7):1709-1711.

Authors:	NI Xian-gui CAI Ming

Affiliation:	NI Xian-gui,CAI Ming(School of Information Engineering,Southern Yangtze University,Wuxi 214122,China)

Abstract:	The focused crawler system architecture is introduced.The system combines the web link structure analyses and content simi-larity for web relevance evaluation.The evaluation algorithms compute the number of the web link from and to the seed repository,and the content similarity to the topic.Then the seed selector will select the web with the highest value as seed web for crawling.This will improve the harvest ratio and efficiency of the focused crawler system.

Keywords:	focused crawler link structure content similarity VSM harvest ratio
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏