有效的爬行Ajax页面的网络爬行算法 Efficient Algorithm for Crawling Ajax Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

有效的爬行Ajax页面的网络爬行算法

引用本文：	李华波,吴礼发,赖海光,郑成辉,黄康宇.有效的爬行Ajax页面的网络爬行算法[J].电子科技大学学报（自然科学版）,2013,42(1):115-120.

作者姓名：	李华波吴礼发赖海光郑成辉黄康宇

作者单位：	1.解放军理工大学指挥信息系统学院南京 210007

基金项目：	江苏省自然科学基金(BK2010132)

摘要：	Ajax页面的生成和页面导航需要执行客户端的JavaScript代码, 传统网络爬行算法无法获取Ajax页面全部内容. 分析了Ajax的工作方式, 阐述了爬行Ajax网页所面临的主要问题, 提出并实现了一种有效爬行Ajax页面的网络爬行算法. 该算法可控制客户端浏览器动态生成页面内容和完成页面导航, 为爬行过的页面分配标识编号并生成相应静态页面. 实验结果表明, 提出的算法所爬行的Ajax页面数量明显多于传统方法, 同时, 采用的双重消重策略可有效减少算法的时间耗费.
关键词：	Ajax 爬行算法消重策略搜索引擎
收稿时间：	2011-04-12
Efficient Algorithm for Crawling Ajax Web Pages

Affiliation:	1.College of Command Information System,PLAUST Nanjing 210007

Abstract:	The generation of Ajax web pages and the Ajax page navigation must execute the client JavaScript, thus it is impossible to extract the complete content of an Ajax page through the traditional crawling algorithms. In this paper, the working mode of Ajax is analyzed, the problem of crawling Ajax web pages is elaborated, and an effective algorithm for crawling Ajax pages is proposed. The algorithm can realize the dynamic generation of Ajax web contents in client browser and the navigation of Ajax web pages, and also it can assign identification number for the crawled pages whose static pages can be generated. Experimental result shows that the number of Ajax pages crawled by the proposed algorithm is obvious bigger than the traditional ones', and the presented replicas-detecting policies can effectively reduce the time consumption of the algorithm.

Keywords:	Ajax crawling algorithm replicas-detecting policy search engine
本文献已被万方数据等数据库收录！
	点击此处可从《电子科技大学学报（自然科学版）》浏览原始摘要信息
	点击此处可从《电子科技大学学报（自然科学版）》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏