首页 | 本学科首页   官方微博 | 高级检索  
     

专题型网页搜集系统的设计与实现
引用本文:胡卓颖,徐可,万中英,陆玉昌,丁树良. 专题型网页搜集系统的设计与实现[J]. 计算机与现代化, 2004, 0(10): 1-5,14
作者姓名:胡卓颖  徐可  万中英  陆玉昌  丁树良
作者单位:清华大学计算机科学与技术系,北京,100084;江西师范大学计算机科学与技术学院,江西,南昌,330027;清华大学计算机科学与技术系,北京,100084;江西师范大学计算机科学与技术学院,江西,南昌,330027;清华大学计算机科学与技术系,北京,100084;江西师范大学计算机科学与技术学院,江西,南昌,330027
基金项目:国家自然科学基金资助项目(79990580),国家973资助项目(G1998030414)
摘    要:近年来人们提出了很多新的搜集思想,他们都使用了一个共同的技术——集中式搜集。集中式搜集通过分析搜索的区域,来发现与主题最相关的链接,防止访问网上不相关的区域,这可以大量地节省硬件和网络资源,使网页得到尽快的更新。为了达到这个搜索目标,本文提出了两个算法:一个是基于多层分类的网页过滤算法,试验结果表明,这种算法有较高的准确率,而且分类速度明显高于一般的分类算法;另一个是基于Web结构的URL排序算法,这个算法充分地利用了Web的结构特征和网页的分布特征。

关 键 词:URL排序  集中式搜集器  多层分类  主题过滤
文章编号:1006-2475(2004)10-0001-05

Research and Implementation of Intelligent Focused Crawler
HU Zhuo-ying,,XU Ke,WAN Zhong-ying,,LU Yu-chang,DING Shu-liang. Research and Implementation of Intelligent Focused Crawler[J]. Computer and Modernization, 2004, 0(10): 1-5,14
Authors:HU Zhuo-ying    XU Ke  WAN Zhong-ying    LU Yu-chang  DING Shu-liang
Affiliation:HU Zhuo-ying1,2,XU Ke1,WAN Zhong-ying2,3,LU Yu-chang1,DING Shu-liang2
Abstract:Several new crawling ideas have been proposed in recent years;among them a common technique is focused crawling.A focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.This leads to significant savings in hardware and network resources,and helps keep the crawl more up-to-date.To achieve such goal-directed crawling,this paper puts forward two algorithms:a Web page filtering based on multilayer classifier,the experimental result shows the algorithm has superior veracity and it is more quick than other classifiers;the other algorithm is a URL ordering algorithm based on Web structure which makes the best use of the characters of Web structure and the characters of Web pages distributing.
Keywords:URL ordering  focused crawler  multi-layer classification  topic distillation
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号