基于Spark的交互式数据预处理系统 Interactive Data Preprocessing System Based on Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark的交互式数据预处理系统

引用本文：	张磊,朱锋,钟华.基于Spark的交互式数据预处理系统[J].计算机系统应用,2016,25(11):84-89.

作者姓名：	张磊朱锋钟华

作者单位：	中国科学院大学, 北京 100049;中国科学院软件研究所软件工程技术研究开发中心, 北京 100190,中国科学院软件研究所软件工程技术研究开发中心, 北京 100190,中国科学院软件研究所软件工程技术研究开发中心, 北京 100190

基金项目：	国家自然科学基金（U1435220）

摘要：	高质量的决策依赖于高质量的数据，数据预处理是数据挖掘至关重要的环节.传统的数据预处理系统并不能很好的适用于大数据环境，企业现阶段主要使用Hadoop/Hive对海量数据进行预处理，但普遍存在耗时长、效率低、无交互等问题.提出了一种基于Spark的交互式数据预处理系统，系统提供一套通用的数据预处理组件，并支持组件的扩展，数据以电子表格的形式展现，系统记录用户的处理过程并支持撤销重做.本文从数据模型、数据预处理操作、交互式执行引擎以及交互式前端四个方面描述了系统架构.最后使用医疗脑卒中的真实数据对系统进行验证，实验结果表明，系统能够在大数据场景下满足交互式处理需求.
关键词：	数据预处理 Spark 交互式大数据
收稿时间：	3/9/2016 12:00:00 AM
修稿时间：	4/8/2016 12:00:00 AM
Interactive Data Preprocessing System Based on Spark

ZHANG Lei,ZHU Feng and ZHONG Hua.Interactive Data Preprocessing System Based on Spark[J].Computer Systems& Applications,2016,25(11):84-89.

Authors:	ZHANG Lei ZHU Feng and ZHONG Hua

Affiliation:	University of Chinese Academy of Sciences, Beijing 100049, China;Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China,Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China and Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China

Abstract:	The high quality decision-making depends on high quality data, hence data preprocessing is an essential phase for data analytics applications. In the big data area, traditional data preprocessing systems cannot be directly applied. To handle the large-scale data, enterprises adopt Hadoop/Hive as a popular solution at the present stage. However, it brings many defects, such as poor performance, the lack of interaction and so on. To fill this gap, this paper proposes and implements an interactive data preprocessing system based on Spark. This system provides a series of common preprocessing logics as basic components and supports flexible user-defined extensions. To get an interactive interface, the system presents data to users in the form of spreadsheets, while it can automatically records users operations to provide undo and redo support. In this paper, we introduce the architecture of this system with four aspects:data model, data preprocessing operations, interactive execution engine and interactive GUI. In the end, we conduct experiments with real stroke data and the result shows that the system can meet interactive demands in most big data scenarios.

Keywords:	data preprocessing Spark interactive big data

	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏