首页 | 本学科首页   官方微博 | 高级检索  
     

轻量级大数据运算系统Helius
引用本文:丁梦苏,陈世敏.轻量级大数据运算系统Helius[J].计算机应用,2017,37(2):305-310.
作者姓名:丁梦苏  陈世敏
作者单位:计算机体系结构国家重点实验室(中国科学院计算技术研究所), 北京 100190
基金项目:中国科学院“百人计划”项目;国家自然科学基金面上项目(61572468);国家自然科学基金创新群体项目(61521092)。
摘    要:针对Spark数据集不可变,以及Java虚拟机(JVM)依赖环境引起的代码执行、内存管理、数据序列化/反序列化等开销过多的不足,采用C/C++语言,设计并实现了一种轻量级的大数据运算系统--Helius。Helius支持Spark的基本操作,同时允许数据集整体修改;同时,Helius利用C/C++优化内存管理和网络传输,并采用stateless worker机制简化分布式计算平台的容错恢复过程。实验结果显示:5次迭代中,Helius运行PageRank算法的时间仅为Spark的25.12%~53.14%,运行TPCH Q6的时间仅为Spark的57.37%;在PageRank迭代1次的基础上,运行在Helius系统下时,master节点IP接收和发送数据量约为运行于Spark系统的40%和15%,而且200 s的运行过程中,Helius占用的总内存约为Spark的25%。实验结果与分析表明,与Spark相比,Helius具有节约内存、不需要序列化和反序列化、减少网络交互以及容错简单等优点。

关 键 词:内存计算  大数据运算  分布式计算  有向无环图调度  容错恢复  
收稿时间:2016-08-12
修稿时间:2016-10-22

Helius: a lightweight big data processing system
DING Mengsu,CHEN Shimin.Helius: a lightweight big data processing system[J].journal of Computer Applications,2017,37(2):305-310.
Authors:DING Mengsu  CHEN Shimin
Affiliation:Key Laboratory of Computer System and Architecture(Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190, China
Abstract:Concerning the limitations of Spark, including immutable datasets and significant costs of code execution, memory management and data serialization/deserialization caused by running environment of Java Virtual Machine (JVM), a light-weight big data processing system, named Helius, was implemented in C/C++. Helius supports the basic operations of Spark, while allowing the data set to be modified as a whole. In Helius, the C/C++ is utilized to optimize the memory management and network communication, and a stateless worker mechanism is utilized to simplify the fault tolerance and recovery process of the distributed computing platform. The experimental results showed that in 5 iterations, the running time in Helius was only 25.12% to 53.14% of that in Spark when running PageRank iterative jobs, and the running time in Helius was only 57.37% of that in Spark when processing TPCH Q6. On the basis of one iteration of PageRank, the IP incoming and outcoming data sizes of master node in Helius were about 40% and 15% of those in Sparks, and the total memory consumed in the worker node in Helius was only 25% of that in Spark.Compared with Spark, Helius has the advantages of saving memory, eliminating the need for serialization and deserialization, reducing network interaction and simplifying fault tolerance.
Keywords:in-memory computation  big data processing  distributed computation  Directed Acyclic Graph (DAG) scheduling  fault tolerance and recovery  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号