海量样本数据集中小文件的存取优化研究 Research on access optimization of small files in massive sample data sets期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

海量样本数据集中小文件的存取优化研究

引用本文：	马振,哈力旦·阿布都热依木,李希彤.海量样本数据集中小文件的存取优化研究[J].计算机工程与应用,2018,54(22):80-84.

作者姓名：	马振哈力旦·阿布都热依木李希彤

作者单位：	新疆大学电气工程学院，乌鲁木齐 830047

摘要：	针对Hadoop分布式文件系统（Hadoop Distributed File System，HDFS）在海量样本数据集存储方面存在内存占用多、读取效率低的问题，以及分布式数据库HBase在存储文件名重复度和类似度高时产生访问热点的问题，结合样本数据集的特点、类型，提出一种面向样本数据集存取优化方案，优化样本数据集中小文件的写入、读取、添加、删除和替换策略。该方案根据硬件配置测得大、小文件的分界点，通过变尺度堆栈算法按样本数据集的目录结构将小文件合并存储至HDFS；结合行键优化策略将文件索引存储在HBase数据表中；搭建基于Ehcache缓存框架的预取机制。实验结果表明，该方案降低了主节点的内存消耗，提高了文件的读取效率，实现了对海量样本数据集中小文件的高效存取。
关键词：	Hadoop分布式文件系统（HDFS）小文件样本数据集缓存预取分布式数据库 HBase
Research on access optimization of small files in massive sample data sets

MA Zhen,HALIDAN Abudureyimu,LI Xitong.Research on access optimization of small files in massive sample data sets[J].Computer Engineering and Applications,2018,54(22):80-84.

Authors:	MA Zhen HALIDAN Abudureyimu LI Xitong

Affiliation:	School of Electrical Engineering, Xinjiang University, Urumchi 830047, China

Abstract:	For the Hadoop Distributed File System（HDFS）, there are problems of large memory usage and low reading efficiency in the storage of massive sample data sets, and the problem of generating access hotspots when the repeatability and similarity of storage file name are high for the distributed database HBase. Combined with the characteristics and types of sample data sets, a sample data sets access optimization scheme is proposed to optimize the writing, reading, adding, deleting and replacing of small files in the sample data sets. The scheme measures the demarcation point of large and small files according to the hardware configuration and stores the small files into HDFS by the variable scale stack algorithm according to the directory structure of the sample data sets, then stores the file index in the HBase data table with the row-key optimization strategy and builds the prefetching mechanism based on the Ehcache cache frame. The experimental results show that the scheme reduces the memory consumption of the master node, improves the reading efficiency of the files and achieves efficient access to small files in the massive sample data sets.

Keywords:	Hadoop Distributed File System（HDFS） small file sample data set cache prefetch distributed database HBase

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏