首页 | 本学科首页   官方微博 | 高级检索  
     

主题搜索引擎中爬虫搜索策略的研究
引用本文:史宝明,贺元香,吴崇正.主题搜索引擎中爬虫搜索策略的研究[J].计算机工程与应用,2014(2):116-119,128.
作者姓名:史宝明  贺元香  吴崇正
作者单位:[1]兰州文理学院电子信息工程学院,兰州730000 [2]兰州理工大学计算机与通信学院,兰州730050
基金项目:甘肃联合大学科研能力提升计划项目(No.2012YBTS05).
摘    要:为了解决传统主题爬虫效率偏低的问题,传统主题爬虫会选择最有价值的链接进行访问,仅简单地计算链接的相关性,却忽视待分析URL之间的相关性关系,致使主题爬虫爬取效率较低。提出一种基于链接模型的相关性判别算法,综合利用有标种子URL和无标的待判别URL实现对无标URL的相关性判别,并推导出迭代初值选取对结果的不敏感性。实验结果表明,与传统的网络爬虫算法相关性判别方法相比,提出的方法效率更高。

关 键 词:网络爬虫  主题搜索引擎  搜索策略  向量空间模型

Research on search strategy of web spider in topic-oriented search engines
SHI Baoming,HE Yuanxiang,WU Chongzheng.Research on search strategy of web spider in topic-oriented search engines[J].Computer Engineering and Applications,2014(2):116-119,128.
Authors:SHI Baoming  HE Yuanxiang  WU Chongzheng
Affiliation:1.School of Electronics and Information Engineering, Lanzhou University of Arts and Science, Lanzhou 730000, China 2.School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
Abstract:In order to solve the low efficiency problem of traditional focused crawler, web spider always selects the most valuable links to visit, so how to focus the search around a given topic is a key problem. The traditional method always only computes the relevance of the links, but ignores the relevance among the unlabeled URL, now it proposes the algorithm based on link model which combines the seed URL with unlabeled URL to compute the relevance of the other URL, and it deduces the point that initial iterative is insensitivity of the results. Compared with the methods based on traditional algorithm, experimental result proves the performance of the new algorithm is more efficient than the traditional ones.
Keywords:web spider  topic-oriented search engine  search strategy  Vector Space Model(VSM)
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号