专题型网页搜集系统的设计与实现 Research and Implementation of Intelligent Focused Crawler期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

专题型网页搜集系统的设计与实现

引用本文：	胡卓颖,徐可,万中英,陆玉昌,丁树良. 专题型网页搜集系统的设计与实现[J]. 计算机与现代化, 2004, 0(10): 1-5,14

作者姓名：	胡卓颖徐可万中英陆玉昌丁树良

作者单位：	清华大学计算机科学与技术系,北京,100084;江西师范大学计算机科学与技术学院,江西,南昌,330027;清华大学计算机科学与技术系,北京,100084;江西师范大学计算机科学与技术学院,江西,南昌,330027;清华大学计算机科学与技术系,北京,100084;江西师范大学计算机科学与技术学院,江西,南昌,330027

基金项目：	国家自然科学基金资助项目(79990580)，国家973资助项目(G1998030414)

摘要：	近年来人们提出了很多新的搜集思想，他们都使用了一个共同的技术——集中式搜集。集中式搜集通过分析搜索的区域，来发现与主题最相关的链接，防止访问网上不相关的区域，这可以大量地节省硬件和网络资源，使网页得到尽快的更新。为了达到这个搜索目标，本文提出了两个算法：一个是基于多层分类的网页过滤算法，试验结果表明，这种算法有较高的准确率，而且分类速度明显高于一般的分类算法；另一个是基于Web结构的URL排序算法，这个算法充分地利用了Web的结构特征和网页的分布特征。
关键词：	URL排序集中式搜集器多层分类主题过滤
文章编号：	1006-2475(2004)10-0001-05
Research and Implementation of Intelligent Focused Crawler

HU Zhuo-ying,,XU Ke,WAN Zhong-ying,,LU Yu-chang,DING Shu-liang. Research and Implementation of Intelligent Focused Crawler[J]. Computer and Modernization, 2004, 0(10): 1-5,14

Authors:	HU Zhuo-ying XU Ke WAN Zhong-ying LU Yu-chang DING Shu-liang

Affiliation:	HU Zhuo-ying1,2,XU Ke1,WAN Zhong-ying2,3,LU Yu-chang1,DING Shu-liang2

Abstract:	Several new crawling ideas have been proposed in recent years;among them a common technique is focused crawling.A focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.This leads to significant savings in hardware and network resources,and helps keep the crawl more up-to-date.To achieve such goal-directed crawling,this paper puts forward two algorithms:a Web page filtering based on multilayer classifier,the experimental result shows the algorithm has superior veracity and it is more quick than other classifiers;the other algorithm is a URL ordering algorithm based on Web structure which makes the best use of the characters of Web structure and the characters of Web pages distributing.

Keywords:	URL ordering focused crawler multi-layer classification topic distillation
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏