首页 | 本学科首页   官方微博 | 高级检索  
     

海量样本数据集中小文件的存取优化研究
引用本文:马 振,哈力旦·阿布都热依木,李希彤.海量样本数据集中小文件的存取优化研究[J].计算机工程与应用,2018,54(22):80-84.
作者姓名:马 振  哈力旦·阿布都热依木  李希彤
作者单位:新疆大学 电气工程学院,乌鲁木齐 830047
摘    要:针对Hadoop分布式文件系统(Hadoop Distributed File System,HDFS)在海量样本数据集存储方面存在内存占用多、读取效率低的问题,以及分布式数据库HBase在存储文件名重复度和类似度高时产生访问热点的问题,结合样本数据集的特点、类型,提出一种面向样本数据集存取优化方案,优化样本数据集中小文件的写入、读取、添加、删除和替换策略。该方案根据硬件配置测得大、小文件的分界点,通过变尺度堆栈算法按样本数据集的目录结构将小文件合并存储至HDFS;结合行键优化策略将文件索引存储在HBase数据表中;搭建基于Ehcache缓存框架的预取机制。实验结果表明,该方案降低了主节点的内存消耗,提高了文件的读取效率,实现了对海量样本数据集中小文件的高效存取。

关 键 词:Hadoop分布式文件系统(HDFS)  小文件  样本数据集  缓存预取  分布式数据库  HBase  

Research on access optimization of small files in massive sample data sets
MA Zhen,HALIDAN Abudureyimu,LI Xitong.Research on access optimization of small files in massive sample data sets[J].Computer Engineering and Applications,2018,54(22):80-84.
Authors:MA Zhen  HALIDAN Abudureyimu  LI Xitong
Affiliation:School of Electrical Engineering, Xinjiang University, Urumchi 830047, China
Abstract:For the Hadoop Distributed File System(HDFS), there are problems of large memory usage and low reading efficiency in the storage of massive sample data sets, and the problem of generating access hotspots when the repeatability and similarity of storage file name are high for the distributed database HBase. Combined with the characteristics and types of sample data sets, a sample data sets access optimization scheme is proposed to optimize the writing, reading, adding, deleting and replacing of small files in the sample data sets. The scheme measures the demarcation point of large and small files according to the hardware configuration and stores the small files into HDFS by the variable scale stack algorithm according to the directory structure of the sample data sets, then stores the file index in the HBase data table with the row-key optimization strategy and builds the prefetching mechanism based on the Ehcache cache frame. The experimental results show that the scheme reduces the memory consumption of the master node, improves the reading efficiency of the files and achieves efficient access to small files in the massive sample data sets.
Keywords:Hadoop Distributed File System(HDFS)  small file  sample data set  cache prefetch  distributed database  HBase  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号