首页 | 本学科首页   官方微博 | 高级检索  
     检索      

高效率WWW信息采集
引用本文:田范江,王曦东,王鼎兴.高效率WWW信息采集[J].软件学报,2001,12(1):33-40.
作者姓名:田范江  王曦东  王鼎兴
作者单位:清华大学 计算机科学与技术系,
基金项目:This project is supported by the Fundation of IBM China Research Laborato ry (IBM中国研究中心基金).
摘    要:随着WWW上的信息日益丰富,对高效率信息采集(IG)工具的需求日益迫切.由于网络资源非常昂贵,因此,信息采集属于资源受限型任务.主要目标是设计面向特定领域的高效率信息采集方法.提出了在不下载页面的情况下推测页面内容的方法,设计了不同的控制策略,并定义了多种页面下载优先级定量指标,建造了一个信息采集系统——TH-Gatherer,并进行了不同的实验以检验此方法.实验证明,可以在不实际下载页面的情况下,近似推测出候选页面的内容,采用混合尺度的基于优先级的采集方法,在采集效率方面比当前许多信息采集工具(包括Crawler和离线浏览工具)常用的宽度优先方法高4倍以上.实验结果表明,所设计的获取方法在获取效率方面比当前常用的宽度优先方法高4倍以上.此方法适用于资源受限条件下、特定领域的信息采集.

关 键 词:资源受限的信息采集  WWW应用

Efficient World-Wide-Web Information Gathering
TIAN Fan-jiang,WANG Xi-dong and WANG Ding-xing.Efficient World-Wide-Web Information Gathering[J].Journal of Software,2001,12(1):33-40.
Authors:TIAN Fan-jiang  WANG Xi-dong and WANG Ding-xing
Abstract:With the information available through World-Wide-Web becoming overwhelming, e fficient information gathering (IG) tools are necessary. Since the network resou rces are expensive, so IG is a resource-bounded task. The main purpose of this paper is to find an efficient gathering method for specific topic. This paper pr esents methods for predicting page's content without downloading it, designs dif ferent controlling strategies, and defines several kinds of page downloading pri ority measures. An IG system, TH-Gatherer, was built to test the methods, and d ifferent experiments were carried out. Through experiments, it was found that th e content of candidate pages can be predicted approximately without downloading. When the priority based gathering strategy and hybrid measure are used, the gat hering efficiency is four times of that of BFS strategy which is used by many cu rrent IG tools (including crawlers and off-line browsing tools). The method pre sented in this paper is suitable for resource-bounded, specific topic informati on gathering.
Keywords:information gathering  world-wide-web application
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号