基于遗传算法的定题信息搜索策略 Focused Crawling Based on Genetic Algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于遗传算法的定题信息搜索策略

引用本文：	许欢庆,王永成,孙强.基于遗传算法的定题信息搜索策略[J].中文信息学报,2003,17(1):25-31.

作者姓名：	许欢庆王永成孙强

作者单位：	上海交通大学计算机系

基金项目：	国家自然科学基金资助项目 (6 0 0 82 0 0 3)

摘要：	定题检索将信息检索限定在特定主题领域,提供主题领域内信息的检索服务。它是新一代搜索引擎的发展方向之一。定题检索的关键技术是主题相关信息的搜索。本文提出了基于遗传算法的定题信息搜索策略,提高链接于内容相似度不高的网页之后的页面被搜索的机会,扩大了相关网页的搜索范围。同时,借助超链Metadata的提示信息预测链接页面的主题相关度,加快了搜索速度。对比搜索试验证明了算法具有较好的性能。
关键词：	计算机应用中文信息处理定题检索定题信息搜索遗传算法 Hub authority
文章编号：	1003-0077(2003)01-0025-07
修稿时间：	2002年3月18日
Focused Crawling Based on Genetic Algorithm

XU Huan-qing,WANG Yong-cheng,SUN Qiang.Focused Crawling Based on Genetic Algorithm[J].Journal of Chinese Information Processing,2003,17(1):25-31.

Authors:	XU Huan-qing WANG Yong-cheng SUN Qiang

Affiliation:	Department of Computer Science ,Shanghai Jiao Tong University

Abstract:	The exponential growth of information available on the WWW makes it increasingly difficult to crawl and index the entire internet for general-purpose crawlers.Rather than collecting and indexing all accessible web documents to answer all possible ad-hoc queries,focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl,and avoids irrelevant regions of the Web.In this paper,a new focused crawling approach based on Generic Algorithm is proposed.The method electively seeks out pages that are relevant to a pre-defined set of topics using Generic Algorithm,increases the crawling chance of the web page following the web page with the low content-relevance,and broadens the relevant-searching scope of crawlers.Meanwhile,the hyperlink metadata is used to predict the topic-relevance of the web page pointed and quickens the information crawling.Experimental results indicate that our approach has better performance.

Keywords:	computer application Chinese information processing topic-specific retrieval focused crawling GA Hub authority
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏