多模态医疗数据中海量小文件存储优化方法 Optimization Method for Storing Massive Small Files in Multi-modal Medical Data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

多模态医疗数据中海量小文件存储优化方法

引用本文：	曾梦,邹北骥,张文生,杨雪冰,朱承璋.多模态医疗数据中海量小文件存储优化方法[J].软件学报,2023,34(3):1451-1469.

作者姓名：	曾梦邹北骥张文生杨雪冰朱承璋

作者单位：	中南大学计算机学院, 湖南长沙 410083;湖南省机器视觉与智慧医疗工程技术研究中心(中南大学), 湖南长沙 410083;中国科学院自动化研究所, 北京 100190;中南大学文学与新闻传播学院, 湖南长沙 410083;湖南省机器视觉与智慧医疗工程技术研究中心(中南大学), 湖南长沙 410083

基金项目：	科技创新2030——“新一代人工智能”重大项目（2018AAA0102100）；湖南省科技计划项目（2017WK2074）；湖南省高新技术产业科技创新引领计划（2020GK2021）

摘要：	Hadoop分布式文件系统(HDFS)通常用于大文件的存储和管理,当进行海量小文件的存储和计算时,会消耗大量的NameNode内存和访问时间,成为制约HDFS性能的一个重要因素.针对多模态医疗数据中海量小文件问题,提出一种基于双层哈希编码和HBase的海量小文件存储优化方法.在小文件合并时,使用可扩展哈希函数构建索引文件存储桶,使索引文件可以根据需要进行动态扩展,实现文件追加功能.在每个存储桶中,使用MWHC哈希函数存储每个文件索引信息在索引文件中的位置,当访问文件时,无须读取所有文件的索引信息,只需读取相应存储桶中的索引信息即可,从而能够在O(1)的时间复杂度内读取文件,提高文件查找效率.为了满足多模态医疗数据的存储需求,使用HBase存储文件索引信息,并设置标识列用于标识不同模态的医疗数据,便于对不同模态数据的存储管理,并提高文件的读取速度.为了进一步优化存储性能,建立了基于LRU的元数据预取机制,并采用LZ4压缩算法对合并文件进行压缩存储.通过对比文件存取性能、NameNode内存使用率,实验结果表明,所提出的算法与原始HDFS、HAR、MapFile、TypeStorage以及...
关键词：	多模态医疗数据 HDFS HBase 小文件存储性能优化
收稿时间：	2021/6/17 0:00:00
修稿时间：	2021/11/25 0:00:00
Optimization Method for Storing Massive Small Files in Multi-modal Medical Data

ZENG Meng,ZOU Bei-Ji,ZHANG Wen-Sheng,YANG Xue-Bing,ZHU Cheng-Zhang.Optimization Method for Storing Massive Small Files in Multi-modal Medical Data[J].Journal of Software,2023,34(3):1451-1469.

Authors:	ZENG Meng ZOU Bei-Ji ZHANG Wen-Sheng YANG Xue-Bing ZHU Cheng-Zhang

Affiliation:	School of Computer Science and Engineering, Central South University, Changsha 410083, China;Hunan Engineering Research Center of Machine Vision and Intelligent Medicine (Central South University), Changsha 410083, China;Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; School of Literature and Journalism, Central South University, Changsha 410083, China;Hunan Engineering Research Center of Machine Vision and Intelligent Medicine (Central South University), Changsha 410083, China

Abstract:	Hadoop distributed file system (HDFS) is used for the storage and management of large files, while storing and computing a large number of small files consume a lot of NameNode memory usage and access time. Therefore, the small file problem becomes an important factor that restricts HDFS performance. Aiming at the problem of massive small files in multi-modal medical data, a small file storage method based on two-layer hash coding and HBase is proposed to optimize the storage of massive small files on HDFS. When merging small files, an expandable hash function is utilized to build an index file bucket to expand the index file dynamically as needed and realize the file append function. To read the file in O(1) time complexity and improve the efficiency of file search, the MWHC hash function is used to store the position of the index information of each file in the index file. There is no need to read the index information of all files, only need to read the index information of the corresponding bucket. To meet the storage needs of multi-modal medical data, HBase is used to store the index information and set the identification column to identify different modal medical data, which is convenient for storage and management of different modal data and improves file reading speed. To further optimize storage performance, the LRU-based metadata prefetching mechanism is established, and the LZ4 compression algorithm is utilized to compress the merged files. The experiment compares file access performance and NameNode memory usage. The results show that compared with the original HDFS, HAR, MapFile, TypeStorage, and HPF small file storage methods, the proposed algorithm has a shorter file access time, which can improve the overall performance of HDFS when processing massive small files in multi-modal medical data.

Keywords:	multi-modal medical data HDFS HBase small files storage performance optimization

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏