首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce的并行PageRank算法实现
引用本文:平宇,向阳,张波,黄寅飞.基于MapReduce的并行PageRank算法实现[J].计算机工程,2014(2):31-34,38.
作者姓名:平宇  向阳  张波  黄寅飞
作者单位:[1]同济大学计算机科学与技术系,上海201804 [2]上海师范大学信息与机电工程学院,上海200234 [3]上海证券交易所,上海200120
基金项目:国家自然科学基金资助项目(61103069,71170148);国家科技支撑计划基金资助项目(2012BAD35801);上海市科技创新计划基金资助项目(11DZl501703);陈家镇智慧社区和智能交通基金资助项目(1ldzl210600)
摘    要:分布式网络爬虫的广泛应用使得搜索引擎的数据规模呈几何式增长,面对数以TB甚至PB量级的数据,单机模式下的PageRank算法由于CPU、I/O和内存的开销过大导致效率低下。为此,提出一种基于MapReduce框架的并行PageRank算法。在算法的一次迭代过程中,利用Map函数对网页拓扑信息文件进行解析,使用Reduce函数计算网页得分,从而并行化PageRank算法的中间迭代过程。通过计算全局网页得分控制迭代次数,得到较精确的网页排序结果。实验结果表明,该算法在保持原有单机PageRank算法整体网页排序精度的基础上,具有较好的集群性能和较快的执行速度。

关 键 词:搜索引擎  PageRank算法  MapReduce框架  并行计算  Hadoop平台

Implementation of Parallel PageRank Algoirthm Based on MapReduce
PING Yu,XIANG Yang,ZHANG Bo,HUANG Yin-fei.Implementation of Parallel PageRank Algoirthm Based on MapReduce[J].Computer Engineering,2014(2):31-34,38.
Authors:PING Yu  XIANG Yang  ZHANG Bo  HUANG Yin-fei
Affiliation:1. Department of Computer Science and Technology, Tongji University, Shanghai 201804, China; 2. College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China; 3. Shanghai Stock Exchange, Shanghai 200120, China)
Abstract:The emergence of distributed Web crawl largely expands the scale of related Web information. Since PageRank needs to process the topology of entire existed page set, the limitation of CPU, I/O and memory becomes the big issue when it confronts the data in TB or PB level. Aiming at these problems, this paper proposes a parallel PageRank algorithm based on MapReduce. In a certain iteration of algorithm, it processes the files containing the topology of Web page graph by Map function and calculates the pages' scores by Reduce function. Using the global Web page score as convergence to control iterations and get more precise Web page sorting result. Experimental result shows that the improved algorithm has better clustering performance and faster execution speed on the basis of keeping the overall Web page sorting accuracy of single machine PageRank algorithm.
Keywords:search engine  PageRank algorithm  MapReduce framework  parallel computing  Hadoop platform
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号