首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于层次分割和聚合的大数据流水线任务处理方法
引用本文:陈天乐,蒲军,朱小杰,崔文娟,冯伟华,王锐,杜一,周园春. 一种基于层次分割和聚合的大数据流水线任务处理方法[J]. 数据与计算发展前沿, 2019, 10(1): 3-11
作者姓名:陈天乐  蒲军  朱小杰  崔文娟  冯伟华  王锐  杜一  周园春
作者单位:中国科学院计算机网络信息中心,北京 100190;中国科学院大学,北京 100049;中国科学院计算机网络信息中心,北京,100190;中国烟草总公司郑州烟草研究院,河南 郑州,450001
基金项目:中国烟草总公司科技重大专项;中国烟草总公司科技重大专项
摘    要:近年来,互联网各类型的数据不断增长,数据的应用场景也越来越广泛。如何将各种类型数据自动整合后接入不同的场景平台成为了各界关心的问题。业界通常采用流水线工具进行任务调度,然而大多数流水线工具都无法将任务分割为多个子任务并行执行。因此本文提出一种基于层次分割和聚合的大数据流水线任务处理方法:首先在分割模块中将流水线中的任务分割为多个子任务,然后在合并程序中等待各子任务全部完成,合并结果,最终获得整个任务的完成事件。实现了利用流处理框架处理有限数据集,扩展了流处理系统的使用场景,提升了流水线工具的代码重用性。实验表明,本文方法极大减少了数据库I/O次数,使得本文方法处理DBLP[1]数据的速率是Apache NiFi[2]传统处理方法的7倍多。

关 键 词:大数据  流水线  批处理  流处理

A Hierarchical Splitting and Merging Task Management Method in Big Data Dataflow Processing
Chen Tianle,Pu Jun,Cui Wenjuan,Zhu Xiaojie,Feng Weihua,Wang Rui,Du Yi,Zhou Yuanchun. A Hierarchical Splitting and Merging Task Management Method in Big Data Dataflow Processing[J]. Frontiers of Data & Computing, 2019, 10(1): 3-11
Authors:Chen Tianle  Pu Jun  Cui Wenjuan  Zhu Xiaojie  Feng Weihua  Wang Rui  Du Yi  Zhou Yuanchun
Affiliation:(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China;Zhengzhou Tobacco Research Institute,China National Tobacco Corporation,Zhengzhou 450001,China)
Abstract:In recent years,various types of data on the Internet have been growing,and data application scenarios have become wider and wider.How to integrate different types of data from different sources has been a concern in bigdata area.The most popular method is using dataflow platform to run tasks,which has a problem in splitting tasks to sub-tasks and running parallelly.Therefore,in this paper,we propose a hierarchical splitting and merging method:firstly,we use splitting module to split one task to sub-tasks,then we use merging module to wait the completion of all sub-tasks,and merge the results,which represent the whole completion of original task.This method helps using stream processing platform to process bounded dataset,which may enhance the application scenarios of stream processing platform,and improve the code reusing of dataflow platform.Then we use this method to process DBLP[1]data,which reduced the times of database I/O,and the results indicate the effeciency of our method exceed 7 times than traditional method in Apache NiFi[2].
Keywords:big data  dataflow  batch processing  stream processing
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号