首页 | 本学科首页   官方微博 | 高级检索  
     

基于Hadoop平台的分布式重删存储系统
引用本文:刘青,付印金,倪桂强,梅建民.基于Hadoop平台的分布式重删存储系统[J].计算机应用,2016,36(2):330-335.
作者姓名:刘青  付印金  倪桂强  梅建民
作者单位:解放军理工大学 指挥信息系统学院, 南京 210007
基金项目:国家863计划项目(2012AA01A509,2012AA01A510);国家自然科学基金资助项目(61402518)。
摘    要:针对数据中心存在大量数据冗余的问题,特别是备份数据造成的存储容量浪费,提出一种基于Hadoop平台的分布式重复数据删除解决方案。该方案通过检测并消除特定数据集内的冗余数据,来显著降低数据存储容量,优化存储空间利用率。利用Hadoop大数据处理平台下的分布式文件系统(HDFS)和非关系型数据库HBase两种数据管理模式,设计并实现一种可扩展分布式重删存储系统。其中,MapReduce并行编程框架实现分布式并行重删处理,HDFS负责重删后的数据存储,在HBase数据库中构建索引表,实现高效数据块索引查询。最后,利用虚拟机镜像文件数据集对系统进行了测试,基于Hadoop平台的分布式重删系统能在保证高重删率的同时,具有高吞吐率和良好的可扩展性。

关 键 词:重复数据删除  分布式存储  Hadoop  HBase  Hadoop分布式文件系统  
收稿时间:2015-09-15
修稿时间:2015-09-22

Distributed deduplication storage system based on Hadoop platform
LIU Qing,FU Yinjin,NI Guiqiang,MEI Jianmin.Distributed deduplication storage system based on Hadoop platform[J].journal of Computer Applications,2016,36(2):330-335.
Authors:LIU Qing  FU Yinjin  NI Guiqiang  MEI Jianmin
Affiliation:College of Command Information System, PLA University of Science and Technology, Nanjing Jiangsu 210007, China
Abstract:Focusing on the issues that there is a lot of data redundancy in data center, especially the backup data has caused a tremendous waste on storage space, a deduplication prototype based on Hadoop platform was proposed. Deduplication technology which detects and eliminates redundant data in a particular data set can greatly reduce the data storage capacity and optimize the utilization of storage space. Using the two big data management tools——Hadoop Distributed File System (HDFS) and non-relational database HBase, a scalable and distributed deduplication storage system was designed and implemented. In this system, the MapReduce parallel programming framework was responsible for parallel deduplication, and HDFS was responsible for data storage after deduplication. The index table was stored in HBase for efficient chunk fingerprint indexing. The system was also tested with virtual machine image file sets. The results demonstrate that the Hadoop based distributed deduplication system can ensure high throughput and excellent scalability as well as guaranting high deduplication rate.
Keywords:deduplication                                                                                                                        distributed storage                                                                                                                        Hadoop                                                                                                                        HBase                                                                                                                        Hadoop Distributed File System(HDFS)
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号