基于Hadoop平台的事实并行处理算法 Parallel Processing Algorithms for Facts Based on Hadoop Platform期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Hadoop平台的事实并行处理算法

引用本文：	孙莉,何刚,李继云.基于Hadoop平台的事实并行处理算法[J].计算机工程,2014(3):59-62,81.

作者姓名：	孙莉何刚李继云

作者单位：	东华大学计算机科学与技术学院,上海201620

摘要：	针对传统的抽取、转换和加载工具在面临数据仓库中海量事实数据时效率较低的问题，从事实表查找代理键和多粒度事实预聚合2个角度出发，提出在渐变维度表上的多路并行查找算法和在不同粒度上对事实数据进行聚合的算法。第1种算法综合考虑了渐变维度和大维度的情况，运用分布式缓存方法将小维度表复制到各个数据节点的内存中，同时对事实数据和大维度数据采用相同的分区函数进行分区，从而解决内存不足的问题，在Map阶段实现多路查找代理键，避免由于数据传输产生的网络延迟。第2种算法在Reduce阶段之后增加Merge阶段，可有效解决事实数据按照不同粒度进行聚合的问题。实验结果表明，与Hive数据仓库相比，2种算法在并行处理数据仓库的事实数据的问题上具有更高的处理效率。
关键词：	MapReduce模型维度事实代理键并行查找聚合
Parallel Processing Algorithms for Facts Based on Hadoop Platform

SUN Li,HE Gang,LI Ji-yun.Parallel Processing Algorithms for Facts Based on Hadoop Platform[J].Computer Engineering,2014(3):59-62,81.

Authors:	SUN Li HE Gang LI Ji-yun

Affiliation:	(School of Computer Science and Technology, Donghua University, Shanghai 201620, China)

Abstract:	In view of that traditional Extract, Transform, Load（ETL） tools face the efficient problem of the massive fact data in data warehouse, two algorithms about parallel processing facts are designed and implemented based on Hadoop platform. From the two perspectives of surrogate key lookup of fact table and aggregation for fact data on the different granularity, a multi-way parallel lookup algorithm on slowly changing dimensions and an algorithm of aggregation for fact data on the different granularity are presented. The first algorithm considers slowly changing dimensions and big dimensions synthetically. In order to solve the problem of out of memory, the algorithm adopts an approach to the distributed cache to copy small dimensions to every date nodes＇ memory. And implementing multi-way lookup of dimension keys in the stage of map is to avoid network delay result from data transmission. The second algorithm adds merge stage after reducing stage, so it is beneficial to solve the aggregation problem of the fact data according to different granularity effectively. Experimental results show that the two algorithms have better efficient than Hive data warehouse with respect to the problem of parallel processing facts data in data warehouse.

Keywords:	MapReduce model dimension fact surrogate key parallel lookup aggregation
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏