面向固态硬盘的Spark数据持久化方法设计 Design of RDD Persistence Method in Spark for SSDs期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向固态硬盘的Spark数据持久化方法设计

引用本文：	陆克中, 朱金彬, 李正民, 隋秀峰. 面向固态硬盘的Spark数据持久化方法设计[J]. 计算机研究与发展, 2017, 54(6): 1381-1390. DOI: 10.7544/issn1000-1239.2017.20170108

作者姓名：	陆克中朱金彬李正民隋秀峰

作者单位：	1.¹(深圳大学计算机与软件学院广东深圳 518060);2.²(广东工业大学计算机学院广州 511400);3.³(国家计算机网络应急技术处理协调中心北京 100029);4.⁴(计算机体系结构国家重点实验室 (中国科学院计算技术研究所) 北京 100190);5.⁵(中国工程院战略咨询中心北京 100088) (kzlu@szu.edu.cn)

基金项目：	国家“八六三”高技术研究发展计划基金项目(2015AA015305)；广东省自然科学基金项目(2014A030313553)；广东省省部产学研项目(2013B090500055);深圳市基础研究学科布局项目(JCYJ20150529164656096)

摘要：	基于固态硬盘(solid-state drive, SSD)和硬盘(hard disk drive, HDD)混合存储的数据中心已经成为大数据计算领域的高性能载体，数据中心负载应该可将不同特性的数据按需持久化到SSD或HDD，以提升系统整体性能.Spark是目前产业界广泛使用的高效大数据计算框架，尤其适用于多次迭代计算的应用领域，其原因在于Spark可以将中间数据持久化在内存或硬盘中，且持久化数据到硬盘打破了内存容量不足对数据集规模的限制.然而，当前的Spark实现并未专门提供显式的面向SSD的持久化接口，尽管可根据配置信息将数据按比例分布到不同的存储介质中，但是用户无法根据数据特征按需指定RDD的持久化存储介质，针对性和灵活性不足.这不仅成为进一步提升Spark性能的瓶颈，而且严重影响了混合存储系统性能的发挥.有鉴于此，首次提出面向SSD的数据持久化策略.探索了Spark数据持久化原理，基于混合存储系统优化了Spark的持久化架构，最终通过提供特定的持久化API实现用户可显式、灵活指定RDD的持久化介质.基于SparkBench的实验结果表明，经本方案优化后的Spark与原生版本相比，其性能平均提升14.02%.
关键词：	大数据混合存储固态硬盘 Spark 持久化
Design of RDD Persistence Method in Spark for SSDs

Lu Kezhong, Zhu Jinbin, Li Zhengmin, Sui Xiufeng. Design of RDD Persistence Method in Spark for SSDs[J]. Journal of Computer Research and Development, 2017, 54(6): 1381-1390. DOI: 10.7544/issn1000-1239.2017.20170108

Authors:	Lu Kezhong Zhu Jinbin Li Zhengmin Sui Xiufeng

Affiliation:	1.¹(College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060);2.²(School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 511400);3.³(National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029);4.⁴(State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190);5.⁵(Strategic Studies Centre, Chinese Academy of Engineering, Beijing 100088)

Abstract:	SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD’s persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD’s storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

Keywords:	big data hybrid storage solid-state drive (SSD) Spark persistence

	点击此处可从《计算机研究与发展》浏览原始摘要信息
	点击此处可从《计算机研究与发展》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏