首页 | 本学科首页   官方微博 | 高级检索  
     

压缩数据上的关系代数操作算法
引用本文:丁鑫哲,张兆功,李建中,谭龙,刘勇.压缩数据上的关系代数操作算法[J].计算机应用,2016,36(1):21-26.
作者姓名:丁鑫哲  张兆功  李建中  谭龙  刘勇
作者单位:1. 黑龙江大学 计算机科学技术学院, 哈尔滨 150080;2. 哈尔滨工业大学 计算机科学技术学院, 哈尔滨 150010
基金项目:国家自然科学基金资助项目(81273649);黑龙江省自然科学基金资助项目(F201434)。
摘    要:针对在大数据管理中,在压缩的数据上无需解压即可进行相关操作的问题,在数据服从正态分布的前提下,根据列数据存储的特点,提出了一种新的面向列存储的压缩方法——CCA。首先,通过对列数据的长度进行归类;然后,采用抽样的方法获得重复度较高的前缀;最后,使用字典编码进行压缩,提出了列索引(CI)和列实体(CR)作为数据压缩结构来降低大数据存储的空间需求,从而直接有效地在压缩数据上支持选择、投影、连接等基本操作,并实现了基于CCA的数据库原型系统——D-DBMS。理论分析和在1 TB数据上的实验结果表明,该压缩算法能够显著提高大数据的存储效率和数据操作性能,与BAP和TIDC压缩方法相比,在压缩率分别提高了51%、14%;在执行速度上提高了47%、42%。

关 键 词:大数据压缩  列索引  列实体  关系代数操作  
收稿时间:2015-07-27
修稿时间:2015-08-04

Relational algebraic operation algorithm on compressed data
DING Xinzhe,ZHANG Zhaogong,LI Jianzhong,TAN Long,LIU Yong.Relational algebraic operation algorithm on compressed data[J].journal of Computer Applications,2016,36(1):21-26.
Authors:DING Xinzhe  ZHANG Zhaogong  LI Jianzhong  TAN Long  LIU Yong
Affiliation:1. College of Computer Science and Technology, Heilongjiang University, Harbin Heilongjiang 150080, China;2. College of Computer Science and Technology, Harbin Institute of Technology, Harbin Heilongjiang 150010, China
Abstract:Since in the massive data management, the compressed data can be done some operations without decompressing first, under the condition of normal distribution, according to features of column data storage, a new compression algorithm which oriented column storage, called CCA (Column Compression Algorithm), was proposed. Firstly, the length of data was classified; secondly, the sampling method was used to get more repetitive prefix; finally the dictionary coding was utilized to compress, meanwhile the Column Index (CI) and Column Reality (CR) were acted as data compression structure to reduce storage requirement of massive data storage, thus the basic relational algebraic operations such as select, project and join were directly and effectively supported. A prototype database system based on CCA, called D-DBMS (Ding-Database Management System), was implemented. The theoretical analyses and the results of experiment on 1 TB data show that the proposed compression algorithm can significantly improve the performance of massive data storage efficiency and data manipulation. Compared to BAP (Bit Address Physical) and TIDC (TupleID Center) method, the compression rate of CCA was improved by 51% and 14%, and its running speed was improved by 47% and 42%.
Keywords:massive data compression                                                                                                                        Column Index (CI)                                                                                                                        Column Reality (CR)                                                                                                                        relational algebraic operation
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号