首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark和NRSCA策略的并行深度森林算法
引用本文:毛伊敏,刘绍芬. 基于Spark和NRSCA策略的并行深度森林算法[J]. 计算机应用研究, 2024, 41(1): 126-133
作者姓名:毛伊敏  刘绍芬
作者单位:1. 江西理工大学信息工程学院;2. 韶关学院信息工程学院
基金项目:韶关市科技项目(220607154531533);
摘    要:针对并行深度森林在大数据环境下存在冗余及无关特征过多、两端特征利用率过低、模型收敛速度慢以及级联森林并行效率低等问题,提出了基于Spark和NRSCA策略的并行深度森林算法——PDF-SNRSCA。首先,该算法提出了基于邻域粗糙集和Fisher score的特征选择策略(FS-NRS),通过衡量特征的相关性和冗余度,对特征进行过滤,有效减少了冗余及无关特征的数量;其次,提出了一种随机选择和等距提取的扫描策略(S-RSEE),保证了所有特征能够同概率被利用,解决了多粒度扫描两端特征利用率低的问题;最后,结合Spark框架,实现级联森林并行化训练,提出了基于重要性指数的特征筛选机制(FFM-II),筛选出非关键性特征,平衡增强类向量与原始类向量维度,从而加快模型收敛速度,同时设计了基于SCA的任务调度机制(TSM-SCA),将任务重新分配,保证集群负载均衡,解决了级联森林并行效率低的问题。实验表明,PDF-SNRSCA算法能有效提高深度森林的分类效果,且对深度森林并行化训练的效率也有大幅提升。

关 键 词:并行深度森林算法  Spark框架  邻域粗糙集  正弦余弦算法  多粒度扫描
收稿时间:2023-05-11
修稿时间:2023-12-15

Parallel deep forest algorithm based on Spark and NRSCA strategy
Mao Yimin and Liu Shaofen. Parallel deep forest algorithm based on Spark and NRSCA strategy[J]. Application Research of Computers, 2024, 41(1): 126-133
Authors:Mao Yimin and Liu Shaofen
Affiliation:School of Information Engineering,Jiangxi University of Science and Technology,Ganzhou Jiangxi,
Abstract:Aiming to address several issues encountered by parallel deep forest algorithms in big data environments, such as excessive redundancy and irrelevant features, low utilization rate of features at both ends, slow model convergence speed, and low parallel efficiency of cascading forests, this paper proposed a parallel deep forest algorithm based on Spark and NRSCA strategy(PDF-SNRSCA). Firstly, the algorithm proposed a feature selection strategy(FS-NRS) based on neighborhood rough sets and Fisher score, which measured the correlation and redundancy of features to effectively reduce the number of redundant and irrelevant features. Secondly, it proposed a scanning strategy based on random selection and equidistant extraction(S-RSEE) to ensure that all features were utilized with the same probability and solved the problem of low utilization rate of two ends in multi-granularing scanning. Finally, combining with the Spark framework, the algorithm realized the parallel training of cascading forests, and it proposed a feature filtering mechanism based on the importance index(FFM-II) to balance the dimensions of enhanced class vectors and original class vectors, thereby accelerating the model convergence speed. Meanwhile, the algorithm designed a task scheduling mechanism based on SCA(TSM-SCA) to redistribute tasks and ensure load balancing in the cluster, which solved the problem of low parallel efficiency of cascading forests. Experiments show that the PDF-SNRSCA algorithm can effectively improve the classification performance of deep forests and greatly enhance the efficiency of parallel training of deep forests.
Keywords:parallel deep forest algorithm   Spark framework   neighborhood rough sets   sine cosine algorithm   multi-granularing scanning
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号