首页 | 本学科首页   官方微博 | 高级检索  
     

分布式爬虫的研究与实现
引用本文:马蕾,冯锡炜,窦予梓,高天铸,朱睿,吴衍兵.分布式爬虫的研究与实现[J].计算机技术与发展,2020(2):192-196.
作者姓名:马蕾  冯锡炜  窦予梓  高天铸  朱睿  吴衍兵
作者单位:辽宁石油化工大学计算机与通信工程学院
基金项目:辽宁省自然科学基金(20180550130)
摘    要:网络中的数据蕴藏着大量有价值信息,在实际的项目需求中,为了实现能够自动的在网页上对大量数据的数据信息的收集、解析、格式化存储的过程,提出了基于分布式的网络爬虫技术。采用Nutch爬虫框架和Zookeeper分布式协调服务,配合高性能的Key-Value数据库Redis对数据进行存储,采用Solr引擎将抓取信息进行清晰地索引、展示。运用提取页面信息算法优化提取页面信息流程,通过关键词匹配优化算法根据指标从抓取的数据中获取指标相关数据。通过分布式集群的搭建,Nutch项目的实现,及大量数据的采集,验证了基于Nutch的分布式网络爬虫的可行性。通过页面解析流程实验分析,基于Nutch的分布式爬虫与其他爬虫多组实验数据对比结果表明,基于Nutch的分布式爬虫项目在性能和准确度方面都优于传统其他爬虫。

关 键 词:分布式集群  NUTCH  SOLR  企业官网

Research and Realization of Distributed Crawler Based on Nutch
MA Lei,FENG Xi-wei,DOU YU-zi,GAO Tian-zhu,ZHU Rui,WU Yan-bing.Research and Realization of Distributed Crawler Based on Nutch[J].Computer Technology and Development,2020(2):192-196.
Authors:MA Lei  FENG Xi-wei  DOU YU-zi  GAO Tian-zhu  ZHU Rui  WU Yan-bing
Affiliation:(School of Computer and Communication Engineering,Liaoning Shihua University,Fushun 113001,China)
Abstract:The data in the network contains a lot of valuable information.In the actual project requirements,in order to realize the process of automatically collecting,parsing and formatting the data information of a large amount of data on the webpage,a distributed web crawler technology is proposed.The Nutch crawler framework and the Zookeeper distributed coordination service are used to store data in conjunction with the high-performance Key-Value database Redis.The Solr engine is used to clearly index and display the captured information.The extraction page information algorithm is used to optimize the process of extracting page information,and the keyword matching optimization algorithm is used to obtain the indicator related data from the captured data according to the index.Through the construction of distributed clusters,the implementation of the Nutch project,and the collection of large amounts of data,the feasibility of Nutch-based distributed web crawlers is verified.Through the analysis of the page analysis process,the experimental data comparison between the Nutch-based distributed crawler and other reptiles proves that the Nutch-based distributed crawler project is superior to other traditional crawlers in performance and accuracy.
Keywords:distributed cluster  Nutch  Solr  enterprise’s official website
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号