首页 | 本学科首页   官方微博 | 高级检索  
     

结合信息论改进的并行深度森林算法
引用本文:毛伊敏,耿俊豪,陈亮.结合信息论改进的并行深度森林算法[J].计算机工程与应用,2022,58(7):106-115.
作者姓名:毛伊敏  耿俊豪  陈亮
作者单位:1.江西理工大学 信息工程学院,江西 赣州 341000 2.江西理工大学 应用科学学院,江西 赣州 341000
基金项目:江西省教育厅科技项目;国家自然科学基金;国家重点研发计划
摘    要:针对并行深度森林算法在处理大数据问题时存在的冗余与不相关特征过多,多粒度扫描不平衡以及并行化效率低等问题,提出了大数据环境下基于信息论改进的并行深度森林算法——IPDFIT(improved parallel deep forest based on information theory)。该算法基于信息论设计了一种混合降维策略DRIT(dimension reduction based on information theory),以获得降维后的数据集,有效减少了冗余及不相关特征的数量;提出了一种改进的多粒度扫描策略IMGSS(improved multi-grained scanning strategy)对样本进行扫描,保证每个特征在扫描后,同频率出现在数据子集中,避免了因多粒度扫描不平衡对深度森林模型的影响;结合MapReduce框架,对深度森林每层级联结构中的随机森林模型进行并行化训练,同时提出了一种样本加权策略TSWS(the sample weighting strategy),根据级联中随机森林模型对样本进行评估,选取评估结果较差的样本进入下一层训练,逐步减少了每层级中训练样本的数量,从而提高了算法的并行效率。实验结果表明,该算法在大数据环境下,尤其是针对特征数较多的数据集有着更好的分类效果。

关 键 词:MapReduce框架  深度森林  DRIT策略  IMGSS策略  TSWS策略  

Improved Parallel Deep Forest Algorithm Combining with Information Theory
MAO Yimin,GENG Junhao,CHEN Liang.Improved Parallel Deep Forest Algorithm Combining with Information Theory[J].Computer Engineering and Applications,2022,58(7):106-115.
Authors:MAO Yimin  GENG Junhao  CHEN Liang
Affiliation:1.School of Information Engineering, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China 2.School of Applied Science, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China
Abstract:Aiming at the problems of excessive redundancy and irrelevant features, multi-grained scanning imbalance and low parallelization efficiency in big data parallel deep forest algorithm, this paper proposes an improved parallel deep forest based on information theory, named IPDFIT. Firstly, a dimension reduction based on information theory is presented to reduce the dimensionality of the original data set. Secondly, an improved multi-grained scanning strategy IMGSS to ensure that each feature appears in the data subset with the same frequency. Finally, in order to improve the parallel efficiency of the deep forest algorithm, the sample weighting strategy is proposed to evaluate the sample according to the forest in the cascade. Based on the evaluate results, the algorithm selects samples with poor evaluation to enter the next layer of training. The experimental results show that the IPDFIT algorithm has a better classification results in a big data environment, especially for data sets with more features.
Keywords:MapReduce framework  deep forest  DRIT strategy  IMGSS strategy  TSWS strategy  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号