首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于搜索策略的多主题信息采集方法
引用本文:仲兆满,李存华,刘宗田,管燕.一种基于搜索策略的多主题信息采集方法[J].电子学报,2014,42(12):2352-2358.
作者姓名:仲兆满  李存华  刘宗田  管燕
作者单位:1. 淮海工学院计算机工程学院, 江苏连云港 222000; 2. 上海大学计算机学院, 上海 200072
摘    要:本文针对多主题信息采集效率低下的问题,调研了主题规则在内置搜索引擎和通用搜索引擎上搜索结果的差异,提出将主题规则拆分成原子规则的思想,分析了原子规则间的相同、互换、包含三种关系.在原子规则之间关系的基础上,设计了针对内置搜索和通用搜索不同的原子规则分配策略,这样做一方面提高主题信息采集的准确率,另一方面减少搜索采集的次数.针对原子规则直接搜索结果的准确率不高的问题,提出了基于句群的主题与信息相关性的过滤方法.设置138条主题规则(拆分后的原子规则为8223条),14个内置搜索引擎和4个通用搜索引擎,在单位时间内采集到的信息总条数与采集到的相关信息的条数两个方面进行了实验比较.结果表明,所提方法在信息采集数目及相关信息采集数目方面均具有较好的性能.

关 键 词:多主题信息采集  原子规则  内置搜索  通用搜索  相关性计算  
收稿时间:2013-09-24

A Method of Mu lti-Topic Crawling Based on Search Strategy
ZHONG Zhao-man,LI Cun-hua,LIU Zong-tian,GUAN Yan.A Method of Mu lti-Topic Crawling Based on Search Strategy[J].Acta Electronica Sinica,2014,42(12):2352-2358.
Authors:ZHONG Zhao-man  LI Cun-hua  LIU Zong-tian  GUAN Yan
Affiliation:1. School of Computer Engineering, Huaihai Institute of Technology, Lianyungang, Jiangsu 222000, China; 2. School of Computer, Shanghai University, Shanghai 200072, China
Abstract:Aiming at the low efficiency of multi-topic crawling,the difference between built-in search engines (BSEs) and general search engines (GSEs) is investigated.The idea and method of dividing topic rules into atomic rules are proposed respectively,and three relations (equating relation,exchanging relation and containing relation) between atomic rules are analyzed.Based on atomic rule relations,the different allocation strategies for BSEs and GSEs are designed,which can not only improve the precision of topic-specific crawling,but also reduce crawling times.Furthermore,a method of sentence cluster-based relevance computing between topics and documents is proposed to solve the low precision problem of directly crawling information by atomic rules.We conduct an experiment with 138 topic rules (containing 8223 atomic rules),14 BSEs and 4 GSEs for evaluating the number of crawling information and related information in unit time.The results show that the proposed method offers more effective performances.
Keywords:multi-topic crawling  atomic rules  built-in search engines  general search engines  relevance computing
本文献已被 万方数据 等数据库收录!
点击此处可从《电子学报》浏览原始摘要信息
点击此处可从《电子学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号