首页 | 本学科首页   官方微博 | 高级检索  
     

一个基于三元组存储的列式OLAP查询执行引擎
引用本文:朱阅岸,张延松,周烜,王珊.一个基于三元组存储的列式OLAP查询执行引擎[J].软件学报,2014,25(4):753-767.
作者姓名:朱阅岸  张延松  周烜  王珊
作者单位:数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872;数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872;中国人民大学 中国调查与数据中心, 北京 100872;数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872;数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;中国人民大学 信息学院, 北京 100872
基金项目:国家科技重大专项(核高基)(2010ZX01042-001-002);国家自然科学基金(61272138,61232007);中国人民大学研究生科学研究基金(13XNH216)
摘    要:大数据与传统的数据仓库技术相结合产生了大数据实时分析处理需要(volume+velocity),它要求大数据背景下的数据仓库不能过多地依赖物化、索引等高存储代价的优化技术,而要提高实时处理能力来应对大数据分析中数据量大、查询分析复杂等特点.这些查询分析操作一般表现为在事实表和维表之间连接操作的基础上对结果集上进行分组聚集等操作.因此,表连接和分组聚集操作是ROLAP(relational OLAP)性能的两个重要决定因素.研究了新硬件平台下针对大规模数据的OLAP查询的性能,设计新的列存储OLAP查询执行引擎CDDTA-MMDB(columnar direct dimensional tuple access-main memory databasequeryexecutionengine,直接维表元组访问的内存数据库查询执行引擎).基于三元组的物化策略,使得CDDTA-MMDB能够减少内存列存储模型上表连接操作访问基表和中间数据结构的次数.首先,CDDTA-MMDB将查询分解为作用在维表和事实表上的子查询,如果只涉及过滤操作,子查询将生成<代理键,布尔值>二元组;否则,子查询生成<代理键,关键字,值>三元组.然后,只需一趟扫描事实表,利用事实表的外键映射函数直接定位相应三元组或者二元组,完成相应的过滤、连接或聚集操作.CDDTA-MMDB充分考虑了内存列存储数据库的设计原则,尽量减少随机内存访问.实验结果表明:CDDTA-MMDB是高效的,与具代表性的列存储数据库相比,比MonetDB 5.5快2.5倍,比C-store的invisible join快5倍;并且,CDDTA-MMDB在多核处理器上具有线性加速比.

关 键 词:大数据分析  联机分析处理  内存列存储数据库  表连接算法  物化策略
收稿时间:2013/10/13 0:00:00
修稿时间:2014/1/27 0:00:00

Column-Oriented Query Execution Engine for OLAP Based on Triplet
ZHU Yue-An,ZHANG Yan-Song,ZHOU Xuan and WANG Shan.Column-Oriented Query Execution Engine for OLAP Based on Triplet[J].Journal of Software,2014,25(4):753-767.
Authors:ZHU Yue-An  ZHANG Yan-Song  ZHOU Xuan and WANG Shan
Affiliation:Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education (Renmin University of China, Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China;Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education (Renmin University of China, Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China;National Survey Research Center at Renmin University of China, Beijing 100872, China;Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education (Renmin University of China, Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China;Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education (Renmin University of China, Beijing 100872, China;School of Information, Renmin University of China, Beijing 100872, China
Abstract:Integrating big data and traditional data warehouse (DW) techniques bring demand for real-time big data analysis. The new demand means DW can not depend too much on the optimization such as materialization and indexing which consume large space, but instead needs to enhance ability of real-time analysis to handle big data analysis which usually issues complex queries on huge data volumes. Those queries usually consist in applying group or aggregation operator on the join result between fact table and dimension table(s). The join and group operation often are the bottle-necks for performance improvement. This paper studies the OLAP performance under the new hardware platform and big data environment, and develops a new OLAP query execution engine in columnar storage, called CDDTA-MMDB (columnar direct dimensional tuple access for main memory database query execution engine). The optimized materialization makes CDDTA-MMDB reduce access to base table and intermediate data structure during join procedure. CDDTA- MMDB decomposes the query into sub-queries on the fact table and dimension table respectively. If the sub-query on dimension table only serves as filter, it will produce the binary tuple <surrogate,Boolean_value>; otherwise, it will produce the triplet in the form of <surrogate,key,value>. Thus, by just scanning the fact table one-pass and utilizing the mapping function of foreign keys in fact table to directly access the binary tuples or triplets, the executor can accomplish the join, filter and group operations. Consideration is fully placed on the design principle for the main-memory columnar database. Experimental results show that the system is efficient and can be 2.5 times faster than MonetDB 5.5 and 5 times faster than invisible join used by C-store. Moreover, it scales linearly on multi-core processors.
Keywords:big data analysis  OLAP  main-memory columnar database  join algorithm  materialization
本文献已被 CNKI 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号