聚焦爬行中网页爬行算法的改进 The Extension of Focused Crawling Strategy期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

聚焦爬行中网页爬行算法的改进

作者单位：	中南林业科技大学计算机科学学院

摘要：	因特网的迅速发展对万维网信息的查找与发现提出了巨大的挑战。对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页,为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究思路和方法。该文针对聚焦爬虫这一研究热点,对现今聚焦爬虫的爬行方法(主要是网页分析算法和网页搜索策略)做了深入分析和对比,提出了一种改进的聚焦爬行算法。这种基于类间规则的聚焦爬行方法借助baseline聚焦爬虫的架构,应用朴素的贝叶斯分类器并利用主题团间链接的统计关系构造规则找到在一定链接距离内的"未来回报"页面,并通过实验对该算法的性能进行分析、评价,证明其对聚焦爬虫的爬行收获率和覆盖率有很好的改善。
关键词：	baseline聚焦爬虫朴素的贝叶斯分类器未来回报率基于规则的聚焦爬虫通道
The Extension of Focused Crawling Strategy

Authors:	TAN Jun-shan CHEN Ke-qin

Abstract:	A focused crawler gathers relevant Web pages on a particular topic.In our work, we started with a focused-crawling approach designed by Soumen Chakrabarti, Martin van den Berg and Byron Dom, called baseline crawler. Building on this crawler, we developed a rule-based crawler, which uses simple rules derived from interclass (topic) linkage patterns to decide its next move. This rule-based crawler also enhances the baseline crawler by supporting tunneling.Initial performance results show that this rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler's harvest rate and coverage.

Keywords:
本文献已被 CNKI 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏