基于Spark的分层子空间权重树随机森林算法 Random forest algorithm using stratified subspaces and weighted trees based on Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark的分层子空间权重树随机森林算法

引用本文：	牛志华,屈景怡,吴仁彪. 基于Spark的分层子空间权重树随机森林算法[J]. 信号处理, 2017, 33(10): 1301-1307. DOI: 10.16798/j.issn.1003-0530.2017.10.004

作者姓名：	牛志华屈景怡吴仁彪

作者单位：	中国民航大学天津市智能信号与图像处理重点实验室

基金项目：	国家自然科学青年基金(11402294);天津市智能信号与图像处理重点实验室开放基金项目(2015AFS03);中国民航大学第六期波音基金项目(20160159209)

摘要：	高维数据的很多特征与类别的相关性弱，影响了随机森林的分类正确率。针对原始随机森林算法在高维数据上的分类问题，提出了一种分层子空间权重树随机森林算法。同时，传统的单机模式无法满足高维数据计算效率的需求，因此利用开源集群计算框架Spark在内存缓存和迭代计算上的优势，将所提算法在Spark上实现。所提算法采用以决策树为单位的分层抽样来生成特征子空间，在提高单棵决策树性能的同时，保证决策树之间的多样性；并且采用权重树的集成策略，使分类能力强的树在集成过程中影响力更大。通过在Mnist和Gisette数据集上的实验结果表明，相比原始随机森林算法、TWRF算法以及分层子空间随机森林算法，所提算法具有更好的正确率，提高了泛化误差性能，可扩展性良好，能够有效分类高维数据。
关键词：	高维数据随机森林算法决策树分层抽样权重树 Spark
收稿时间：	2017-03-20
Random forest algorithm using stratified subspaces and weighted trees based on Spark

Affiliation:	Tianjing Key Lab for Advanced Signal Processing, Civil Aviation University of China

Abstract:	For high dimensional data, a large portion of features are often not informative of the class of the objects, which affects the classification accuracy of the original random forest algorithm. In order to deal with the classification problem of the original random forest algorithm on high dimensional data, a random forest algorithm using stratified subspaces and weighted trees was proposed. Meanwhile, the traditional single-machine mode cannot meet the needs of computational efficiency of high dimensional data. Spark is a new cluster-computing framework. Therefore, the proposed algorithm was implemented on Spark to use its advantages in memory cache and iterative computation. In the paper, the decision tree was treated as a unit to adopt stratified sampling to generate feature subspaces, which could improve the performance of the decision trees among the forest and could ensure the diversity of them. Meanwhile, the integration strategy of weighted trees was used to make the trees with strong classification ability more influential in the integration process. The experiments on Mnist dataset and Gisette dataset show that the proposed algorithm has better performance than the original random forest algorithm and other two algorithms and has good scalability. The proposed algorithm could be an effective method for classifying high dimensional data.

Keywords:

	点击此处可从《信号处理》浏览原始摘要信息
	点击此处可从《信号处理》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏