首页 | 本学科首页   官方微博 | 高级检索  
     

基于互联网的爬虫程序研究
引用本文:郭银芳,韩凯,郭峰明,王国升,李雪萌.基于互联网的爬虫程序研究[J].计算机应用文摘,2022(2).
作者姓名:郭银芳  韩凯  郭峰明  王国升  李雪萌
作者单位:太原学院
基金项目:太原学院大学生创新创业训练计划项目(TYX2021020)。
摘    要:随着互联网的飞速发展,大数据成为互联网技术行业的流行词汇。如今,想要获取大量的数据,爬虫无疑是非常便利的工具。文章介绍了爬虫的原理以及网页的分析方法,对Scrapy框架进行了介绍﹐并用Scrapy对网站进行了数据的爬取,最后利用数据可视化工具对数据进行处理,以便更加直观地对数据进行分析。文章以拉勾网为爬虫对象,在爬虫的过程中,总结了爬虫遇到的问题和解决办法。此外,文章利用Scrapy框架对程序进行了优化,提升了爬取效率。

关 键 词:聚焦爬虫  搜索策略  scrapy框架  全站爬取  分布式爬取

Research on crawler program based on Internet
Authors:GUO Yinfang  HAN Kai  GUO Fengming  WANG Guosheng  LI Xuemeng
Affiliation:(Taiyuan University,Taiyuan 030032,China)
Abstract:With the rapid development of the Internet,big data has become a popular vocabulary in the Internet technology industry.Now the crawler is undoubtedly a very convenient tool when obtaining alarge amount of data.This paper first introduces the principle of python crawler as well as the analysis method of Web page,presents the scrape framework,and then uses scrape to crawl data from thewebsite.Finally,data visualization tools are used to process the data in order to analyze the data more intuitively.This paper takes pull-up web as the object of crawler,and summarizes the problems andsolutions encountered by crawler in the process of crawler.Using Scrapy framework,the program isoptimized to improve the efficiency of climbing.
Keywords:focused crawler  search strategy  scratch framework  whole station crawling  distributed crawling
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号