首页 | 本学科首页   官方微博 | 高级检索  
     

基于概率模型的主题爬虫的研究和实现
引用本文:白玉昭,梁久祯.基于概率模型的主题爬虫的研究和实现[J].计算机工程与科学,2013,35(1):160-165.
作者姓名:白玉昭  梁久祯
作者单位:江南大学物联网工程学院,江苏无锡,214122
基金项目:国家自然科学基金资助项目
摘    要:在现有多种主题爬虫的基础上,提出了一种基于概率模型的主题爬虫。它综合抓取过程中获得的多方面的特征信息来进行分析,并运用概率模型计算每个URL的优先值,从而对URL进行过滤和排序。基于概率模型的主题爬虫解决了大多数爬虫抓取策略单一这个缺陷,它与以往主题爬虫的不同之处是除了使用主题相关度评价指标外,还使用了历史评价指标和网页质量评价指标,较好地解决了"主题漂移"和"隧道穿越"问题,同时保证了资源的质量。最后通过多组实验验证了其在主题网页召回率和平均主题相关度上的优越性。

关 键 词:主题爬虫  概率模型  URL过滤  URL排序  优先值
收稿时间:2011-10-28
修稿时间:2011-12-30

Research and implementation for focused crawler based on probabilistic model
BAI Yu-zhao , LIANG Jiu-zhen.Research and implementation for focused crawler based on probabilistic model[J].Computer Engineering & Science,2013,35(1):160-165.
Authors:BAI Yu-zhao  LIANG Jiu-zhen
Affiliation:(School of Internet of Things Engineering,Jiangnan University,Wuxi 214122,China)
Abstract:Based on the study and research of the existing variety of focused crawlers, the paper proposes a focused crawler using probabilistic model, which analyzes various characteristics obtained in crawl process and uses probabilistic model to calculate each URL priority so as to filter and sort URLs. The proposed focused crawler based on probabilistic model solves the deficiency that most existing crawlers usually only adopt a single strategy for fetching webs from Internet. The distinct feature of our focused crawler is that: not only subject relativity but also history evaluation and web equality are considered so that the “topic drift” and “tunneling” problems are solved as well as the resource equality is guaranteed. Experimental results show that, compared with other focused crawlers, the focused crawler based on probabilistic prediction can gather more subject relevant web pages by retrieving less web pages, and has a better average topic relevant degree.
Keywords:focused crawler  probabilistic model  URL filtering  URL ordering  priority value
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号