首页 | 本学科首页   官方微博 | 高级检索  
     

搜索引擎增量式搜集的实现与评测
引用本文:雷凯,王东海.搜索引擎增量式搜集的实现与评测[J].计算机工程,2008,34(13):78-80,1.
作者姓名:雷凯  王东海
作者单位:北京大学深圳研究生院互联网研发中心,深圳,518055
基金项目:广东省深圳市科技计划 , 北京大学深圳研究生院青年教师基金
摘    要:针对传统的周期性集中式搜索(Crawler)的弱点和增量式Crawler的难点,提出预测更新策略,给出判别网页更新的MD5算法、URL调度算法和URL缓存算法,描述系统各个模块的分布式构架的实现,建立测试集数据对算法进行评测。该系统在北大天网搜索引擎上运行半年多,更新周期缩短了20天,变化预测命中率达到79.4%,提高了时效性、扩展性和稳定性。

关 键 词:增量式搜集  网页变化预测  搜索引擎
修稿时间: 

Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine
LEI Kai,WANG Dong-hai.Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine[J].Computer Engineering,2008,34(13):78-80,1.
Authors:LEI Kai  WANG Dong-hai
Affiliation:(Center for Internet Research and Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055)
Abstract:An Implementation of incremental Web Crawler that supports update of search engine over millions of Web pages on daily basis is introduced. With analysis on the weakness of traditional periodic Crawler and difficulties in incremental Web Crawler, this paper presents key strategies on prediction of Web evolution, algorithms of locating changed Web pages based on MD5, URL scheduling and caching, describes the implementation, and evaluates the Crawler system. The incremental crawler has been integrated with TianWang search engine at Peking University for 6 months. Update cycle is reduced by 20 days, accuracy of evolution prediction reaches 79.4%, and real-time efficiency, extendibility and stability are improved.
Keywords:incremental Crawler  Web evolution prediction  search engine
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号