面向主题爬取的多粒度URLs优先级计算方法 Focused Crawling Oriented Multi-Granular Priority Computation for URLs期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向主题爬取的多粒度URLs优先级计算方法

引用本文：	陈竹敏,马军,韩晓晖,雷景生.面向主题爬取的多粒度URLs优先级计算方法[J].中文信息学报,2009,23(3):31-39.

作者姓名：	陈竹敏马军韩晓晖雷景生

作者单位：	1. 山东大学计算机科学与技术学院,山东济南,250101; 2. 海南大学信息科学技术学院,海南海口,570228

基金项目：	高等学校博士学科点专项科研基金，山东省科技攻关项目，海南省自然科学基金

摘要：	垂直检索系统中主题爬虫的性能对整个系统至关重要。在设计主题爬虫时需要解决两个问题一是计算当前页面与给定主题的相关度, 二是计算待爬取URLs的访问优先级。对第一个问题,给出利用页面的主题文本块和相关链接块的相关度计算方法; 对第二个问题, 给出基于主题上下文和四种不同的粒度(即站点级、页面级、块级和链接级)的优先级计算方法。在此基础上, 提出基于上述方法的主题爬取算法。实验证明, 新算法在不增加时间复杂度的前提下, 在查准率和信息量总和方面明显优于其他三种经典的爬取算法。
关键词：	计算机应用中文信息处理主题爬取优先级计算网页分块相关度计算
Focused Crawling Oriented Multi-Granular Priority Computation for URLs

CHEN Zhumin,MA Jun,HAN Xiaohui,LEI Jingsheng.Focused Crawling Oriented Multi-Granular Priority Computation for URLs[J].Journal of Chinese Information Processing,2009,23(3):31-39.

Authors:	CHEN Zhumin MA Jun HAN Xiaohui LEI Jingsheng

Affiliation:	1. School of Computer Science and Technology Shandong University, Jinan, Shandong 250101, China; 2. College of Information Science and Technology Hainan University, Haikou Hainan, 570228, China

Abstract:	The performance of the focused crawler is crucial to a vertical search engine.Two scientific computation issues to be addressed in the design of focused crawlers are:(1) how to compute the relevance of a current visited Web page to a given topic,(2) how to compute the priorities of unvisited URLs in the queue.For the first issue,this paper describes the calculation of the relevance of a page to the topic based on the page's topical text blocks and related link blocks.For the second one,a novel approach is p...

Keywords:	computer application Chinese information processing focused crawling URLs priority computation page segmentation relevance computation
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏