分布式爬虫的研究与实现 Research and Realization of Distributed Crawler Based on Nutch期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

分布式爬虫的研究与实现

引用本文：	马蕾,冯锡炜,窦予梓,高天铸,朱睿,吴衍兵.分布式爬虫的研究与实现[J].计算机技术与发展,2020(2):192-196.

作者姓名：	马蕾冯锡炜窦予梓高天铸朱睿吴衍兵

作者单位：	辽宁石油化工大学计算机与通信工程学院

基金项目：	辽宁省自然科学基金(20180550130)

摘要：	网络中的数据蕴藏着大量有价值信息,在实际的项目需求中,为了实现能够自动的在网页上对大量数据的数据信息的收集、解析、格式化存储的过程,提出了基于分布式的网络爬虫技术。采用Nutch爬虫框架和Zookeeper分布式协调服务,配合高性能的Key-Value数据库Redis对数据进行存储,采用Solr引擎将抓取信息进行清晰地索引、展示。运用提取页面信息算法优化提取页面信息流程,通过关键词匹配优化算法根据指标从抓取的数据中获取指标相关数据。通过分布式集群的搭建,Nutch项目的实现,及大量数据的采集,验证了基于Nutch的分布式网络爬虫的可行性。通过页面解析流程实验分析,基于Nutch的分布式爬虫与其他爬虫多组实验数据对比结果表明,基于Nutch的分布式爬虫项目在性能和准确度方面都优于传统其他爬虫。
关键词：	分布式集群 NUTCH SOLR 企业官网
Research and Realization of Distributed Crawler Based on Nutch

MA Lei,FENG Xi-wei,DOU YU-zi,GAO Tian-zhu,ZHU Rui,WU Yan-bing.Research and Realization of Distributed Crawler Based on Nutch[J].Computer Technology and Development,2020(2):192-196.

Authors:	MA Lei FENG Xi-wei DOU YU-zi GAO Tian-zhu ZHU Rui WU Yan-bing

Affiliation:	(School of Computer and Communication Engineering,Liaoning Shihua University,Fushun 113001,China)

Abstract:	The data in the network contains a lot of valuable information.In the actual project requirements,in order to realize the process of automatically collecting,parsing and formatting the data information of a large amount of data on the webpage,a distributed web crawler technology is proposed.The Nutch crawler framework and the Zookeeper distributed coordination service are used to store data in conjunction with the high-performance Key-Value database Redis.The Solr engine is used to clearly index and display the captured information.The extraction page information algorithm is used to optimize the process of extracting page information,and the keyword matching optimization algorithm is used to obtain the indicator related data from the captured data according to the index.Through the construction of distributed clusters,the implementation of the Nutch project,and the collection of large amounts of data,the feasibility of Nutch-based distributed web crawlers is verified.Through the analysis of the page analysis process,the experimental data comparison between the Nutch-based distributed crawler and other reptiles proves that the Nutch-based distributed crawler project is superior to other traditional crawlers in performance and accuracy.

Keywords:	distributed cluster Nutch Solr enterprise’s official website
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏