轻量级大数据运算系统Helius Helius: a lightweight big data processing system期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

轻量级大数据运算系统Helius

引用本文：	丁梦苏,陈世敏. 轻量级大数据运算系统Helius[J]. 计算机应用, 2017, 37(2): 305-310. DOI: 10.11772/j.issn.1001-9081.2017.02.0305

作者姓名：	丁梦苏陈世敏

作者单位：	计算机体系结构国家重点实验室(中国科学院计算技术研究所), 北京 100190

基金项目：	中国科学院“百人计划”项目；国家自然科学基金面上项目（61572468）；国家自然科学基金创新群体项目（61521092）。

摘要：	针对Spark数据集不可变，以及Java虚拟机（JVM）依赖环境引起的代码执行、内存管理、数据序列化/反序列化等开销过多的不足，采用C/C++语言，设计并实现了一种轻量级的大数据运算系统--Helius。Helius支持Spark的基本操作，同时允许数据集整体修改；同时，Helius利用C/C++优化内存管理和网络传输，并采用stateless worker机制简化分布式计算平台的容错恢复过程。实验结果显示：5次迭代中，Helius运行PageRank算法的时间仅为Spark的25.12%~53.14%，运行TPCH Q6的时间仅为Spark的57.37%；在PageRank迭代1次的基础上，运行在Helius系统下时，master节点IP接收和发送数据量约为运行于Spark系统的40%和15%，而且200 s的运行过程中，Helius占用的总内存约为Spark的25%。实验结果与分析表明，与Spark相比，Helius具有节约内存、不需要序列化和反序列化、减少网络交互以及容错简单等优点。
关键词：	内存计算大数据运算分布式计算有向无环图调度容错恢复
收稿时间：	2016-08-12
修稿时间：	2016-10-22
Helius: a lightweight big data processing system

DING Mengsu,CHEN Shimin. Helius: a lightweight big data processing system[J]. Journal of Computer Applications, 2017, 37(2): 305-310. DOI: 10.11772/j.issn.1001-9081.2017.02.0305

Authors:	DING Mengsu CHEN Shimin

Affiliation:	Key Laboratory of Computer System and Architecture(Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190, China

Abstract:	Concerning the limitations of Spark, including immutable datasets and significant costs of code execution, memory management and data serialization/deserialization caused by running environment of Java Virtual Machine (JVM), a light-weight big data processing system, named Helius, was implemented in C/C++. Helius supports the basic operations of Spark, while allowing the data set to be modified as a whole. In Helius, the C/C++ is utilized to optimize the memory management and network communication, and a stateless worker mechanism is utilized to simplify the fault tolerance and recovery process of the distributed computing platform. The experimental results showed that in 5 iterations, the running time in Helius was only 25.12% to 53.14% of that in Spark when running PageRank iterative jobs, and the running time in Helius was only 57.37% of that in Spark when processing TPCH Q6. On the basis of one iteration of PageRank, the IP incoming and outcoming data sizes of master node in Helius were about 40% and 15% of those in Sparks, and the total memory consumed in the worker node in Helius was only 25% of that in Spark.Compared with Spark, Helius has the advantages of saving memory, eliminating the need for serialization and deserialization, reducing network interaction and simplifying fault tolerance.

Keywords:	in-memory computation big data processing distributed computation Directed Acyclic Graph (DAG) scheduling fault tolerance and recovery

	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏