首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark的交互式数据预处理系统
引用本文:张磊,朱锋,钟华.基于Spark的交互式数据预处理系统[J].计算机系统应用,2016,25(11):84-89.
作者姓名:张磊  朱锋  钟华
作者单位:中国科学院大学, 北京 100049;中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190,中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190,中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190
基金项目:国家自然科学基金(U1435220)
摘    要:高质量的决策依赖于高质量的数据,数据预处理是数据挖掘至关重要的环节.传统的数据预处理系统并不能很好的适用于大数据环境,企业现阶段主要使用Hadoop/Hive对海量数据进行预处理,但普遍存在耗时长、效率低、无交互等问题.提出了一种基于Spark的交互式数据预处理系统,系统提供一套通用的数据预处理组件,并支持组件的扩展,数据以电子表格的形式展现,系统记录用户的处理过程并支持撤销重做.本文从数据模型、数据预处理操作、交互式执行引擎以及交互式前端四个方面描述了系统架构.最后使用医疗脑卒中的真实数据对系统进行验证,实验结果表明,系统能够在大数据场景下满足交互式处理需求.

关 键 词:数据预处理  Spark  交互式  大数据
收稿时间:3/9/2016 12:00:00 AM
修稿时间:4/8/2016 12:00:00 AM

Interactive Data Preprocessing System Based on Spark
ZHANG Lei,ZHU Feng and ZHONG Hua.Interactive Data Preprocessing System Based on Spark[J].Computer Systems& Applications,2016,25(11):84-89.
Authors:ZHANG Lei  ZHU Feng and ZHONG Hua
Affiliation:University of Chinese Academy of Sciences, Beijing 100049, China;Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China,Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China and Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
Abstract:The high quality decision-making depends on high quality data, hence data preprocessing is an essential phase for data analytics applications. In the big data area, traditional data preprocessing systems cannot be directly applied. To handle the large-scale data, enterprises adopt Hadoop/Hive as a popular solution at the present stage. However, it brings many defects, such as poor performance, the lack of interaction and so on. To fill this gap, this paper proposes and implements an interactive data preprocessing system based on Spark. This system provides a series of common preprocessing logics as basic components and supports flexible user-defined extensions. To get an interactive interface, the system presents data to users in the form of spreadsheets, while it can automatically records users operations to provide undo and redo support. In this paper, we introduce the architecture of this system with four aspects:data model, data preprocessing operations, interactive execution engine and interactive GUI. In the end, we conduct experiments with real stroke data and the result shows that the system can meet interactive demands in most big data scenarios.
Keywords:data preprocessing  Spark  interactive  big data
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号