Spark下基于PCA和分层选择的随机森林算法 Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Spark下基于PCA和分层选择的随机森林算法

引用本文：	雷晨,毛伊敏.Spark下基于PCA和分层选择的随机森林算法[J].计算机工程与应用,2022,58(6):118-127.

作者姓名：	雷晨毛伊敏

作者单位：	江西理工大学信息工程学院，江西赣州 341000

基金项目：	江西省教育厅科技项目;国家自然科学基金;国家重点研发计划

摘要：	针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm).对初始特征集,提...
关键词：	随机森林 Spark 主成分分析(PCA) 分层抽样误差约束数据划分数据复用
Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark

LEI Chen,MAO Yimin.Random Forest Algorithm Based on PCA and Hierarchical Selection Under Spark[J].Computer Engineering and Applications,2022,58(6):118-127.

Authors:	LEI Chen MAO Yimin

Affiliation:	School of Information Engineering, Jiangxi University of Science & Technology, Ganzhou, Jiangxi 341000, China

Abstract:	In the context of big data, the random forest algorithm has large covariance matrix, insufficient coverage of subspace feature information and high node communication overhead. A parallel random forest algorithm based on PCA and subspace hierarchical selection, PLA-PRF（PCA and subspace layer sampling on parallel random forest algorithm）. For the initial feature set, a PCA-based matrix factorization strategy（MFS） is proposed to extract principal component features to solve the problem of large covariance matrix in the process of feature transformation. Based on the obtained principal component features, a hierarchical subspace construction algorithm（error-constrained hierarchical subspace construction algorithm, EHSCA） based on error constraints is proposed, which selects pheromone features hierarchically, constructs feature subspaces, and solves the problem of insufficient coverage of subspace feature information. In the process of parallel training decision trees in the Spark environment, a data reuse strategy（DRS） is designed to solve the problem of high node communication overhead. By vertically dividing RDD data objects, it improves the performance of the distributed environment. Data utilization rate solves the problem of high node communication overhead. Experimental results show that PLA-PRF has better classification effect and higher parallelization efficiency.

Keywords:	random forest Spark princepal component analysis（PCA） layer sampling error constraint data partition data reuse
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏