领域相关的Web网站抓取方法 Website Crawling for Specific Topics期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

领域相关的Web网站抓取方法

引用本文：	李刚,周立柱,郭奇,林玲.领域相关的Web网站抓取方法[J].计算机科学,2007,34(2):137-140.

作者姓名：	李刚周立柱郭奇林玲

作者单位：	清华大学计算机科学与技术系,北京,100084

基金项目：	国家自然科学基金重大国际台作项目

摘要：	本文提出了一种抓取领域相关的Web站点的方法，可以在较小的代价下准确地收集用户所关心领域内的网站。这种方法主要改进了传统的聚焦爬虫（Focused Crawler）技术，首先利用Meta-Search技术来改进传统Crawler的通过链接分析来抓取网页的方法，而后利用启发式搜索大大降低了搜索代价，通过引入一种评价领域相关性的打分方法，迭到了较好的准确率。本文详细地描述了上述算法并通过详细的实验验证了算法的效率和效果。
关键词：	Meta-Search 聚焦爬虫（Focused Crawler）启发式搜索
Website Crawling for Specific Topics

LI Gang,ZHOU Li-Zhu,GUO Qi,LIN Ling.Website Crawling for Specific Topics[J].Computer Science,2007,34(2):137-140.

Authors:	LI Gang ZHOU Li-Zhu GUO Qi LIN Ling

Affiliation:	Department of Computer Science and Technology, Tsinghua University, Beijing 100084

Abstract:	In this paper, we propose a new approach to discover the Websites for special topic in WWW with high precision and low cost. This approach improves traditional Focused Crawler techniques, different from the common Web crawler which accesses the Web graph composed by HTML pages and hyperlinks, our crawler uses Meta-Seareh to get the URLs of relevant page, then uses heuristic search method to reduce the search cost, and uses topic relevant rules to increase the precision. The experimental results show the presented approach is both effective and efficient.

Keywords:	Meta-Search Focused crawler Heuristic search
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏