首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
提出一种改进随机森林算法(SP-RF).通过建立数据抽样索引表和随机特征索引表来实现随机森林算法在Spark上的并行化;通过计算随机森林算法中每个决策树的AUC值来给分类能力不同的决策树分配权重;提高随机森林算法在投票环节的分类精度.实验结果表明改进后的随机森林算法分类精度平均提高5%,运行时间平均减少25%以上.  相似文献   

2.
提出一种改进随机森林算法(SP-RF).通过建立数据抽样索引表和随机特征索引表来实现随机森林算法在Spark上的并行化;通过计算随机森林算法中每个决策树的AUC值来给分类能力不同的决策树分配权重;提高随机森林算法在投票环节的分类精度.实验结果表明改进后的随机森林算法分类精度平均提高5%,运行时间平均减少25%以上.  相似文献   

3.
师彦文  王宏杰 《计算机科学》2017,44(Z11):98-101
针对不平衡数据集的有效分类问题,提出一种结合代价敏感学习和随机森林算法的分类器。首先提出了一种新型不纯度度量,该度量不仅考虑了决策树的总代价,还考虑了同一节点对于不同样本的代价差异;其次,执行随机森林算法,对数据集作K次抽样,构建K个基础分类器;然后,基于提出的不纯度度量,通过分类回归树(CART)算法来构建决策树,从而形成决策树森林;最后,随机森林通过投票机制做出数据分类决策。在UCI数据库上进行实验,与传统随机森林和现有的代价敏感随机森林分类器相比,该分类器在分类精度、AUC面积和Kappa系数这3种性能度量上都具有良好的表现。  相似文献   

4.
随机森林是一种有效的集成学习算法,被广泛应用于模式识别中。为了得到更高的预测精度,需要对参数进行优化。提出了一种基于袋外数据估计的分类误差,利用改进的网格搜索算法对随机森林算法中的决策树数量和候选分裂属性数进行参数优化的随机森林算法。仿真结果表明,利用该方法优化得到的参数都能够使随机森林的分类效果得到一定程度的提高。  相似文献   

5.
为进一步提高随机森林算法分类准确率,提出一种基于决策边界的倾斜森林(oblique forests based on decision boundary,OFDB)分类算法.将决策边界与自适应权重融入随机森林算法框架,采用决策边界作为分裂准则,使原本垂直于数据空间的分裂准则变为倾斜的超平面,有效提高算法对数据空间结构的...  相似文献   

6.
随着大数据时代的到来,数据信息呈几何倍数增长。传统的分类算法将面临着极大的挑战。为了提高分类算法的效率,提出了一种基于弱相关化特征子空间选择的离散化随机森林并行分类算法。该算法在数据预处理阶段对数据集中的连续属性进行离散化。在随机森林抽取特征子空间阶段,利用属性向量空间模型计算属性间的相关性,构造弱相关化特征子空间,使所构建的决策树之间相关性降低,从而提高随机森林的分类效果;并通过研究随机森林的并行化策略,结合MapReduce框架,改进并实现了随机森林模型构建过程的双重并行化,进一步改善了算法的计算效率。  相似文献   

7.
《软件》2016,(7):75-79
不平衡数据集的分类问题是现今机器学习的一个热点问题。传统分类学习器以提高分类精度为准则导致对少数类识别准确率下降。本文首先综合描述了不平衡数据集分类问题的研究难点和研究进展,论述了对分类算法的评价指标,进而提出一种新的基于二次随机森林的不平衡数据分类算法。首先,用随机森林算法对训练样本学习找到模糊边界,将误判的多数类样本去除,改变原训练样本数据集结构,形成新的训练样本。然后再次使用随机森林对新训练样本数据进行训练。通过对UCI数据集进行实验分析表明新算法在处理不平衡数据集上在少数类的召回率和F值上有提高。  相似文献   

8.
杨丰瑞 《计算机应用研究》2020,37(9):2625-2628,2633
高维复杂数据处理是数据挖掘领域中的关键问题,针对现有特征选择分类算法存在的预测精确度失衡、整体分类效率低下等问题,提出了一种结合概率相关性和极限随机森林的特征选择分类算法(P-ERF)。该算法使用充分考虑特征之间相关性与P值结合的特征选择方式,避免了树节点分裂过程中造成的冗余性问题;并以随机树为基分类器、极限随机森林为整体框架,使P-ERF算法获得了更高的精准度和更好的泛化误差。实验结果表明,P-ERF算法相较于随机森林算法、极限随机森林算法,在数据集分类精度与整体性方面均得到良好的效果。  相似文献   

9.
为实现图书馆文献自动分类,提出一种随机森林算法的分类方法。首先,提出图书馆文献自动分类的思路;然后根据整体思路,建立数据集,并对数据集进行预处理和特征提取;最后搭建文献自动分类器,并比较随机森林分类器与常用的NB、SVM分类器的分类性能。结果表明:随机森林分类器对文献的分类准确率最高,且能够维持在95%左右,相比于NB、SVM自动分类法提高了3%;同时查准率和加速比上,随机森林分类器的性能也最优。由此得出,基于随机森林算法的分类能满足图书馆文献自动分类的要求,且分类准确率更好,分类效果更理想。  相似文献   

10.
《微型机与应用》2016,(3):28-30
随机森林可以产生高准确度的分类器,被广泛用于解决模式识别问题。然而,随机森林赋予每个决策树相同的权重,这在一定程度上降低了整个分类器的性能。为了解决这个问题,本文提出一种加权随机森林算法。该算法引入二次训练过程,提高分类正确率高的决策树投票权重,降低分类错误率高的决策树投票权重,从而提高整个分类器的分类能力。通过在不同数据集上的分类测试实验,证明了本文算法相比于传统的随机森林算法具有更强的分类性能。  相似文献   

11.
Random Forests receive much attention from researchers because of their excellent performance. As Breiman suggested, the performance of Random Forests depends on the strength of the weak learners in the forests and the diversity among them. However, in the literature, many researchers only considered pre-processing of the data or post-processing of the Random Forests models. In this paper, we propose a new method to increase the diversity of each tree in the forests and thereby improve the overall accuracy. During the training process of each individual tree in the forest, different rotation spaces are concatenated into a higher space at the root node. Then the best split is exhaustively searched within this higher space. The location where the best split lies decides which rotation method to be used for all subsequent nodes. The performance of the proposed method here is evaluated on 42 benchmark data sets from various research fields and compared with the standard Random Forests. The results show that the proposed method improves the performance of the Random Forests in most cases.  相似文献   

12.
Class decomposition describes the process of segmenting each class into a number of homogeneous subclasses. This can be naturally achieved through clustering. Utilising class decomposition can provide a number of benefits to supervised learning, especially ensembles. It can be a computationally efficient way to provide a linearly separable data set without the need for feature engineering required by techniques like support vector machines and deep learning. For ensembles, the decomposition is a natural way to increase diversity, a key factor for the success of ensemble classifiers. In this paper, we propose to adopt class decomposition to the state-of-the-art ensemble learning Random Forests. Medical data for patient diagnosis may greatly benefit from this technique, as the same disease can have a diverse of symptoms. We have experimentally validated our proposed method on a number of data sets that are mainly related to the medical domain. Results reported in this paper show clearly that our method has significantly improved the accuracy of Random Forests.  相似文献   

13.
改进的随机森林及其在遥感图像中的应用   总被引:1,自引:0,他引:1  
对于遥感图像训练样本获取难的问题,引入适用于小样本分类的随机森林算法。为了随机森林能在小样本情况下有更优的分类效果和更高的稳定性,在决策树基础上提出了一种更加随机的特征组合的方法,降低了决策树之间的相关性,从而降低了森林的泛化误差;引入人工免疫算法来对改进后的随机森林进行压缩优化,很好地权衡了森林规模和分类稳定性、精度的矛盾。通过UCI数据集的实验表明,改进的随机森林的有效性及其优化的模型的可行性,优化后森林的规模降低了,且有更高的分类精度。在遥感图像上与传统的方法进行了对比。  相似文献   

14.
为了从大量的电子邮件中检测垃圾邮件,提出了一个基于Hadoop平台的电子邮件分类方法。不同于传统的基于内容的垃圾邮件检测,通过在Map Reduce框架上统计分析邮件收发记录,提取邮件账号的行为特征。然后使用Map Reduce框架并行的实现随机森林分类器,并基于带有行为特征的样本训练分类器和分类邮件。实验结果表明,基于Hadoop平台的电子邮件分类方法大大提高了大规模电子邮件的分类效率。  相似文献   

15.
Predicting the fault-proneness labels of software program modules is an emerging software quality assurance activity and the quality of datasets collected from previous software version affects the performance of fault prediction models. In this paper, we propose an outlier detection approach using metrics thresholds and class labels to identify class outliers. We evaluate our approach on public NASA datasets from PROMISE repository. Experiments reveal that this novel outlier detection method improves the performance of robust software fault prediction models based on Naive Bayes and Random Forests machine learning algorithms.  相似文献   

16.
Elghazel  Haytham  Aussem  Alex 《Machine Learning》2015,98(1-2):157-180
Machine Learning - In this paper, we show that the way internal estimates are used to measure variable importance in Random Forests are also applicable to feature selection in unsupervised...  相似文献   

17.
随机森林与支持向量机分类性能比较   总被引:5,自引:0,他引:5  
黄衍  查伟雄 《软件》2012,(6):107-110
随机森林是一种性能优越的分类器。为了使国内学者更深入地了解其性能,通过将其与已在国内得到广泛应用的支持向量机进行数据实验比较,客观地展示其分类性能。实验选取了20个UCI数据集,从泛化能力、噪声鲁棒性和不平衡分类三个主要方面进行,得到的结论可为研究者选择和使用分类器提供有价值的参考。  相似文献   

18.
针对装备试验数据量有限和装备测试数据易缺失的现状,提出了一种基于集成学习的回归插补方法。以随机森林和XGBoost算法为回归器,通过设定快速填充基准和特征重要性评估策略的方法,改进数据子集重建和训练集与测试集的迭代划分策略,使用Optuna框架实现回归器超参数的自动优化,在某型导弹发射试验上进行实例验证。结果表明,使用集成学习算法的回归插补效果明显优于传统的统计量插补法以及KNN和BP神经网络,在不同缺失比例下的回归确定系数结果均保持在0.95以上,能有效解决装备小样本试验数据缺失的问题,并利用KEEL公测数据集验证了该方法的推广价值和通用性。  相似文献   

19.
We present an extensive empirical comparison between nineteen prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching and their variants, as well as more recent techniques like Random Patches. These algorithms were compared against each other in terms of threshold, ranking/ordering and probability metrics over nineteen UCI benchmark data sets with binary labels. We also examine the influence of two base learners, CART and Extremely Randomized Trees, on the bias–variance decomposition and the effect of calibrating the models via Isotonic Regression on each performance metric. The selected data sets were already used in various empirical studies and cover different application domains. The source code and the detailed results of our study are publicly available.  相似文献   

20.
Currently, high dimensional data processing confronts two main difficulties: inefficient similarity measure and high computational complexity in both time and memory space. Common methods to deal with these two difficulties are based on dimensionality reduction and feature selection. In this paper, we present a different way to solve high dimensional data problems by combining the ideas of Random Forests and Anchor Graph semi-supervised learning. We randomly select a subset of features and use the Anchor Graph method to construct a graph. This process is repeated many times to obtain multiple graphs, a process which can be implemented in parallel to ensure runtime efficiency. Then the multiple graphs vote to determine the labels for the unlabeled data. We argue that the randomness can be viewed as a kind of regularization. We evaluate the proposed method on eight real-world data sets by comparing it with two traditional graph-based methods and one state-of-the-art semi-supervised learning method based on Anchor Graph to show its effectiveness. We also apply the proposed method to the subject of face recognition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号