首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark的流程化机器学习分析方法
引用本文:赵玲玲,刘杰,王伟.基于Spark的流程化机器学习分析方法[J].计算机系统应用,2016,25(12):162-168.
作者姓名:赵玲玲  刘杰  王伟
作者单位:中国科学院大学, 北京 100190;中国科学院软件研究所, 北京 100190,中国科学院软件研究所, 北京 100190,中国科学院软件研究所, 北京 100190
基金项目:国家自然科学基金(U1435220)
摘    要:Spark通过使用内存分布数据集,更加适合负载数据挖掘与机器学习等需要大量迭代的工作.但是数据分析师直接使用Spark进行开发十分复杂,包括scala学习门槛高,代码优化与系统部署需要丰富的经验,同时代码的复用度低导致重复工作繁多.本文设计并实现了一种基于Spark的可视化流程式机器学习的方法,一方面设计组件模型来刻画机器学习的基本步骤,包括数据预处理、特征处理、模型训练及验证评估,另一方面提供可视化的流程建模工具,支持分析者设计机器学习流程,由工具自动翻译为Spark平台代码高效执行.本工具可以极大的提高Spark平台机器学习应用开发的效率.论文介绍了工具的方法理论和关键技术,并通过案例表明工具的有效性.

关 键 词:机器学习  数据分析  分布式  大数据  Spark
收稿时间:2016/3/21 0:00:00
修稿时间:2016/4/11 0:00:00

Method of Implement Machine Learning Analysis with Workflow Based on Spark Platform
ZHAO Ling-Ling,LIU Jie and WANG Wei.Method of Implement Machine Learning Analysis with Workflow Based on Spark Platform[J].Computer Systems& Applications,2016,25(12):162-168.
Authors:ZHAO Ling-Ling  LIU Jie and WANG Wei
Affiliation:University of Chinese Academy of Sciences, Beijing 10090, China;Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China,Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China and Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
Abstract:By using resilient distributed dataset,Spark is more adapted to iterative algorithms,which are common in data mining and machine learning jobs.However,the development of Spark applications is complicated for data analysts on account of the high threshold to learn scala,the rich experience of code optimization and system deployment,as well as multiple duplicated work due to the low reusing of code.We design and develop a machine learning tool with visible workflow style based on Spark.We design the stages of machine learning with workflow modules,including data preprocessing,feature processing,model training and validation.Meanwhile,a friendly user interface is brought forward to accelerate the design of machine learning workflow model for analysts,with the support of auto parsing from modules to Spark jobs by server end.This tool can greatly improves the efficiency of machine learning development on Spark platform.We introduce the theoretical methods and critical techniques in the paper,and prove its validity with a real instance.
Keywords:machine learning  data analysis  distributed  big data  Spark
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号