首页 | 本学科首页   官方微博 | 高级检索  
     

一种多源统一爬虫框架的设计与实现
引用本文:潘洪涛. 一种多源统一爬虫框架的设计与实现[J]. 软件工程, 2021, 0(4)
作者姓名:潘洪涛
作者单位:保定电力职业技术学院
摘    要:面向深层网数据的爬虫技术与反爬虫技术之间的对抗随着网站技术、大数据、异步传输等技术的发展而呈现此消彼长的趋势。综合对比当前主流的爬虫和反爬虫技术,针对高效开发、快速爬取的需求,MUCrawler(多源统一爬虫框架)被设计成一种可以面向多个网站数据源,以统一的接口形式提供爬虫开发的Python框架。测试结果显示,该框架不但能够突破不同的反爬虫技术获取网站数据,在开发效率、鲁棒性和爬取效率等方面也体现出较好的运行效果。

关 键 词:Python开发  网络爬虫  浏览器行为  HTTP请求

Design and Implementation of a Multi-source Uniform-interface Crawler Framework
PAN Hongtao. Design and Implementation of a Multi-source Uniform-interface Crawler Framework[J]. Software Engineering, 2021, 0(4)
Authors:PAN Hongtao
Affiliation:(Baoding Electric Power VOC.&TECH College,Baoding 071051,China)
Abstract:Confrontation between crawler technology for deep web data and anti-crawler technology has waxed and waned with development of website technology,big data,and asynchronous transmission technology.This paper proposes to develop a Multi-source Uniform-interface Crawler(MUCrawler)framework after comprehensively comparing current mainstream crawler and anti-crawler technologies and considering the needs of efficient development and fast crawling.MUCrawler framework can face multiple websites data sources and provide Python framework of crawler development in the form of a uniform interface.Test results show that the proposed framework can not only break through different anti-crawler technologies to obtain website data,but also show better operating results in terms of development efficiency,robustness,and crawling efficiency.
Keywords:Python program  web crawler  browser behavior  HTTP(High Text Transfer Protocol)request
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号