分布式开放存取期刊爬虫的设计与实现 Design and Implementation of Distributed Web Crawler for Open Access Journal期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

分布式开放存取期刊爬虫的设计与实现

引用本文：	杨镇雄,蔡祖锐,陈国华,汤庸,张龙.分布式开放存取期刊爬虫的设计与实现[J].计算机科学与探索,2014(10):1187-1194.

作者姓名：	杨镇雄蔡祖锐陈国华汤庸张龙

作者单位：	华南师范大学计算机学院,广州,510000

基金项目：	The National Natural Science Foundation of China under Grant No.61272067，the National High Technology Research and Development Program of China under Grant No.2013AA01A212(国家高技术研究发展计划，the National Key Technology R&D Program of China under Grant No.2012BAH27F05，the Natural Science Founda-tion of Guangdong Province under Grant No. S2012030006242，the Major Scientific and Technological Project of Guangdong Province under Grant No.2012A080104019，the Science and Technology Planning Project of Guangdong Province under Grant No.2011B080100031

摘要：	开放存取（open access，OA）期刊属于网络深层资源且分散在互联网中，传统的搜索引擎不能对其建立索引，不能满足用户获取OA期刊资源的需求，从而造成了开放资源的浪费。针对如何集中采集万维网上分散的开放存取期刊资源的问题，提出了一个面向OA期刊的分布式主题爬虫架构。该架构采用主从分布式设计，提出了基于用户预定义规则的OA期刊页面学术信息提取方法，由一个主控中心节点控制多个可动态增减的爬行节点，采用基于Chrome浏览器的插件机制来实现分布式爬行节点的可扩展性和部署的灵活性。
关键词：	分布式爬虫开放存取期刊插件机制
Design and Implementation of Distributed Web Crawler for Open Access Journal

YANG Zhenxiong,CAI Zurui,CHEN Guohua,TANG Yong,ZHANG Long.Design and Implementation of Distributed Web Crawler for Open Access Journal[J].Journal of Frontier of Computer Science and Technology,2014(10):1187-1194.

Authors:	YANG Zhenxiong CAI Zurui CHEN Guohua TANG Yong ZHANG Long

Affiliation:	(School of Computer, South China Normal University, Guangzhou 510000, China)

Abstract:	Open access journal is a kind of deep online resources and disperses on the Internet, and it is difficult for the traditional search engines to index these online resources, so the user can not access directly the open access journal via search engines, resulting in a waste of these open resources. This paper proposes a novel focused Web crawler with distributed architecture to collect the open access journal resources scattering throughout the Internet. This architecture adopts the distributed master-slave design, which consists of a master control center and multiple distributed crawler nodes, and proposes an academic information extraction method based on user predefined rules from the open access journals. These distributed crawling nodes can be adjusted dynamically and use Chrome browser based plug-in mechanism to achieve scalability and deployment flexibility.

Keywords:	distributed Web crawler open access journal plug-in mechanism
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏