首页 | 本学科首页   官方微博 | 高级检索  
     

主动获取式的分布式网络爬虫集群方法研究
引用本文:董禹龙,杨连贺,马欣.主动获取式的分布式网络爬虫集群方法研究[J].计算机科学,2018,45(Z6):428-432.
作者姓名:董禹龙  杨连贺  马欣
作者单位:天津工业大学计算机科学与软件学院 天津300387,天津工业大学计算机科学与软件学院 天津300387,天津工业大学计算机科学与软件学院 天津300387
摘    要:针对当前分布式网络爬虫方法遇到的处理效率、扩展性、可靠性、任务分配和负载平衡等问题,提出了一种主动获取任务式的分布式网络爬虫方法。该方法在子机节点中加入分控模块,评估节点负载及运行状况,并主动向中控节点申请任务队列。在此基础上,结合动态双向优先级任务分配算法,设计了一种具有负载平衡、任务分级分配、节点异常敏捷识别、节点安全退出等特性的分布式网络爬虫模型。实际测试表明,该主动获取式的分布式网络爬虫方法可有效地利用通用平台建立大型分布式爬虫集群。

关 键 词:主动获取  分布式爬虫  负载平衡  爬虫框架  多进程  动态优先级

Study on Active Acquisition of Distributed Web Crawler Cluster
DONG Yu-long,YANG Lian-he and MA Xin.Study on Active Acquisition of Distributed Web Crawler Cluster[J].Computer Science,2018,45(Z6):428-432.
Authors:DONG Yu-long  YANG Lian-he and MA Xin
Affiliation:School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387,China,School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387,China and School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387,China
Abstract:In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed web crawler method,an active acquisition task distributed web crawler method was proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model was designed,which has the characteristics of load ba-lance,task hierarchical allocation,abnormal node smart identification and safe exit,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.
Keywords:Active obtain  Distributed crawler  Load balancing  Crawler framework  Multi process  Dynamic priority
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号