首页 | 本学科首页   官方微博 | 高级检索  
     

领域相关的Web网站抓取方法
引用本文:李刚,周立柱,郭奇,林玲.领域相关的Web网站抓取方法[J].计算机科学,2007,34(2):137-140.
作者姓名:李刚  周立柱  郭奇  林玲
作者单位:清华大学计算机科学与技术系,北京,100084
基金项目:国家自然科学基金重大国际台作项目
摘    要:本文提出了一种抓取领域相关的Web站点的方法,可以在较小的代价下准确地收集用户所关心领域内的网站。这种方法主要改进了传统的聚焦爬虫(Focused Crawler)技术,首先利用Meta-Search技术来改进传统Crawler的通过链接分析来抓取网页的方法,而后利用启发式搜索大大降低了搜索代价,通过引入一种评价领域相关性的打分方法,迭到了较好的准确率。本文详细地描述了上述算法并通过详细的实验验证了算法的效率和效果。

关 键 词:Meta-Search  聚焦爬虫(Focused  Crawler)  启发式搜索

Website Crawling for Specific Topics
LI Gang,ZHOU Li-Zhu,GUO Qi,LIN Ling.Website Crawling for Specific Topics[J].Computer Science,2007,34(2):137-140.
Authors:LI Gang  ZHOU Li-Zhu  GUO Qi  LIN Ling
Affiliation:Department of Computer Science and Technology, Tsinghua University, Beijing 100084
Abstract:In this paper, we propose a new approach to discover the Websites for special topic in WWW with high precision and low cost. This approach improves traditional Focused Crawler techniques, different from the common Web crawler which accesses the Web graph composed by HTML pages and hyperlinks, our crawler uses Meta-Seareh to get the URLs of relevant page, then uses heuristic search method to reduce the search cost, and uses topic relevant rules to increase the precision. The experimental results show the presented approach is both effective and efficient.
Keywords:Meta-Search  Focused crawler  Heuristic search
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号