首页 | 本学科首页   官方微博 | 高级检索  
     

基于互信息和融合加权的并行深度森林算法
引用本文:毛伊敏,李文豪.基于互信息和融合加权的并行深度森林算法[J].计算机应用研究,2024,41(2).
作者姓名:毛伊敏  李文豪
作者单位:江西理工大学 信息工程学院,江西理工大学 信息工程学院
基金项目:广东省重点领域研发计划资助项目(2022B0101020002);广东省重点提升项目(2022ZDJS048)
摘    要:针对大数据环境下并行深度森林算法中存在不相关及冗余特征过多、多粒度扫描不平衡、分类性能不足以及并行化效率低等问题,提出了基于互信息和融合加权的并行深度森林算法(parallel deep forest algorithm based on mutual information and mixed weighting,PDF-MIMW)。首先,在特征降维阶段提出了基于互信息的特征提取策略(feature extraction strategy based on mutual information,FE-MI),结合特征重要性、交互性和冗余性度量过滤原始特征,剔除过多的不相关和冗余特征;接着,在多粒度扫描阶段提出了基于填充的改进多粒度扫描策略(improved multi-granularity scanning strategy based on padding,IMGS-P),对精简后的特征进行填充并对窗口扫描后的子序列进行随机采样,保证多粒度扫描的平衡;其次,在级联森林构建阶段提出了并行子森林构建策略(sub-forest construction strategy based on mixed weighting,SFC-MW),结合Spark框架并行构建加权子森林,提升模型的分类性能;最后,在类向量合并阶段提出基于混合粒子群算法的负载均衡策略(load balancing strategy based on hybrid particle swarm optimization algorithm,LB-HPSO),优化Spark框架中任务节点的负载分配,降低类向量合并时的等待时长,提高模型的并行化效率。实验表明,PDF-MIMW算法的分类效果更佳,同时在大数据环境下的训练效率更高。

关 键 词:Spark框架    并行深度森林    互信息    负载均衡
收稿时间:2023/5/18 0:00:00
修稿时间:2024/1/14 0:00:00

Parallel deep forest algorithm based on mutual information and mixed weighting
Mao Yimin and Li Wenhao.Parallel deep forest algorithm based on mutual information and mixed weighting[J].Application Research of Computers,2024,41(2).
Authors:Mao Yimin and Li Wenhao
Affiliation:School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou Jiangxi,
Abstract:In the context of big data environments, the parallel deep forest algorithm faces several challenges, such as an abundance of irrelevant and redundant features, imbalanced multi-granularity scanning, inadequate classification performance, and low parallelization efficiency. To tackle these issues, this paper proposed PDF-MIMW. Firstly, the algorithm introduced FE-MI in the phase of dimensionality reduction, which filtered the original feature set by combining feature importance, interaction, and redundancy metrics, thereby eliminating excessive irrelevant and redundant features. Next, the algorithm proposed an IMGS-P in the phase of multi-granularity scanning, which involved padding the reduced features and performing random sampling on the subsequences obtained after window scanning, thereby ensuring a balanced multi-granularity scanning process. Then, the algorithm put forth the SFC-MW in the phase of cascade forest construction, which utilized the Spark framework to parallelly construct weighted sub-forests, thereby enhancing the model''s classification performance. Finally, the algorithm designed a load balancing strategy based on a mixed particle swarm algorithm in the phase of class vector merging, which optimized the load distribution among task nodes in the Spark framework, reducing the waiting time during class vector merging and improving the parallelization efficiency of the model. Experiments demonstrate that the PDF-MIMW algorithm achieves superior classification performance and higher training efficiency in the big data environment.
Keywords:Spark framework  parallel deep forest  mutual information  load balance
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号