首页 | 本学科首页   官方微博 | 高级检索  
     

基于MPI的并行大数据集生成器
引用本文:葛旭冉,刘洋,陈志广,肖侬.基于MPI的并行大数据集生成器[J].计算机工程与科学,2022,44(7):1152-1161.
作者姓名:葛旭冉  刘洋  陈志广  肖侬
作者单位:(1.国防科技大学计算机学院,湖南 长沙 410073;2.中山大学计算机学院,广东 广州 510006)
基金项目:国家重点研发计划(2018YFC1406205);国家自然科学基金(U1811461,61872392); 广东省自然科学基金 (2018B0303120);广东省基础与应用基础研究(2019B030302002)

摘    要:大数据处理分析算法在优化研究过程中,速度常常受限于数据集的规模。在数据集体量不足时,算法的通信时间往往要高于真正的计算时间,无法验证真实的效果。故设计实现了一个大数据集生成器,为运行在超级计算机上的并行大数据处理分析算法提供基准测试数据集。首先,使用MPI并行编程技术构造了一个并行随机数生成器,在此基础上设计实现了可控制规模及复杂性的人工数据集,主要包括:分类和聚类数据集、回归数据集、流形学习数据集和因子分解数据集等。其次,设计了大数据集生成器的I/O系统,提供MPI-I/O并行读、写数据集的接口,并设置了数据集在不同进程间的分发、映射规则,通过点对点通信实现不同节点之间的数据交互。实验结果表明,并行大数据集生成器有效提高了数据生成效率和生成规模,为并行大数据处理分析算法提供了高质量、大体量的测试数据集。

关 键 词:MPI  大数据集生成器  I/O系统  并行大数据处理算法  算法测试  
收稿时间:2021-11-19
修稿时间:2022-01-06

A parallel large dataset generator based on MPI
GE Xu-ran,LIU Yang,CHEN Zhi-guang,XIAO Nong.A parallel large dataset generator based on MPI[J].Computer Engineering & Science,2022,44(7):1152-1161.
Authors:GE Xu-ran  LIU Yang  CHEN Zhi-guang  XIAO Nong
Affiliation:(1.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073; 2.School of Computer,Sun Yat-sen University,Guangzhou 510006,China)
Abstract:The speed of big data processing and analysis algorithms in optimization research is often limited by the size of the dataset. In the case of insufficient data volume, the communication time of the algorithm is often higher than the real calculation time, and the real effect cannot be verified. Therefore, a large dataset generator is designed to provide benchmark datasets for parallel big data processing and analysis algorithms running on supercomputers. Firstly, a parallel random number generator is constructed using MPI parallel programming technology. On this basis, artificial datasets with controllable scale and complexity are implemented which mainly includes classification and clustering datasets, regression datasets, manifold Learning datasets, factorization datasets, etc. Besides, the I/O system of the large dataset generator is designed. The system provides interfaces for MPI-I/O parallel read and write datasets. It also sets the distribution and mapping rules of the dataset between different processes and realizes the data access between different nodes through point-to-point communication. Experimental results show that the parallel large dataset generator effectively improves the efficiency and scale of data generation, and provides high-quality, large-scale test datasets for big data processing and analysis algorithms.
Keywords:MPI  large dataset generator  I/O system  parallel big data processing algorithm  algorithm test  
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号