首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
Empirical characterization of random forest variable importance measures   总被引:2,自引:0,他引:2  
Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the predictive structure of the problem. Most machine learning methods, such as k-nearest neighbors, support vector machines, and neural networks, are useful for classification. However, these methods provide no insight regarding the covariates that best contribute to the predictive structure. Other methods, such as linear discriminant analysis, require the predictor space be substantially reduced prior to deriving the classifier. A recently developed method, random forests (RF), does not require reduction of the predictor space prior to classification. Additionally, RF yield variable importance measures for each candidate predictor. This study examined the effectiveness of RF variable importance measures in identifying the true predictor among a large number of candidate predictors. An extensive simulation study was conducted using 20 levels of correlation among the predictor variables and 7 levels of association between the true predictor and the dichotomous response. We conclude that the RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. Such goals are common among microarray studies, and therefore application of the RF methodology for the purpose of obtaining variable importance measures is demonstrated on a microarray data set.  相似文献   

2.
Using a number of measures for characterising the complexity of classification problems, we studied the comparative advantages of two methods for constructing decision forests – bootstrapping and random subspaces. We investigated a collection of 392 two-class problems from the UCI depository, and observed that there are strong correlations between the classifier accuracies and measures of length of class boundaries, thickness of the class manifolds, and nonlinearities of decision boundaries. We found characteristics of both difficult and easy cases where combination methods are no better than single classifiers. Also, we observed that the bootstrapping method is better when the training samples are sparse, and the subspace method is better when the classes are compact and the boundaries are smooth. Received: 03 November 2000, Received in revised form: 25 October 2001, Accepted: 04 January 2002  相似文献   

3.
针对当前分类算法还存在的诸如伸缩性不强、可调性差、缺乏全局优化能力等问题,该文提出了一种有效的用于数据挖掘分类任务的方法——基于决策树的协同进化分类算法。实验结果表明该方法获得了更高的预测准确率,产生了更小的规则集。  相似文献   

4.
针对当前分类算法还存在的诸如伸缩性不强、可调性差、缺乏全局优化能力等问题,该文提出了一种有效的用于数据挖掘分类任务的方法--基于决策树的协同进化分类算法.实验结果表明该方法获得了更高的预测准确率,产生了更小的规则集.  相似文献   

5.
决策树算法采用递归方法构建,训练效率较低,过度分类的决策树可能产生过拟合现象.因此,文中提出模型决策树算法.首先在训练数据集上采用基尼指数递归生成一棵不完全决策树,然后使用一个简单分类模型对其中的非纯伪叶结点(非叶结点且结点包含的样本不属于同一类)进行分类,生成最终的决策树.相比原始的决策树算法,这样产生的模型决策树能在算法精度不损失或损失很小的情况下,提高决策树的训练效率.在标准数据集上的实验表明,文中提出的模型决策树在速度上明显优于决策树算法,具备一定的抗过拟合能力.  相似文献   

6.
The regularized random forest (RRF) was recently proposed for feature selection by building only one ensemble. In RRF the features are evaluated on a part of the training data at each tree node. We derive an upper bound for the number of distinct Gini information gain values in a node, and show that many features can share the same information gain at a node with a small number of instances and a large number of features. Therefore, in a node with a small number of instances, RRF is likely to select a feature not strongly relevant.Here an enhanced RRF, referred to as the guided RRF (GRRF), is proposed. In GRRF, the importance scores from an ordinary random forest (RF) are used to guide the feature selection process in RRF. Experiments on 10 gene data sets show that the accuracy performance of GRRF is, in general, more robust than RRF when their parameters change. GRRF is computationally efficient, can select compact feature subsets, and has competitive accuracy performance, compared to RRF, varSelRF and LASSO logistic regression (with evaluations from an RF classifier). Also, RF applied to the features selected by RRF with the minimal regularization outperforms RF applied to all the features for most of the data sets considered here. Therefore, if accuracy is considered more important than the size of the feature subset, RRF with the minimal regularization may be considered. We use the accuracy performance of RF, a strong classifier, to evaluate feature selection methods, and illustrate that weak classifiers are less capable of capturing the information contained in a feature subset. Both RRF and GRRF were implemented in the “RRF” R package available at CRAN, the official R package archive.  相似文献   

7.
Decision tree regression for soft classification of remote sensing data   总被引:1,自引:0,他引:1  
In recent years, decision tree classifiers have been successfully used for land cover classification from remote sensing data. Their implementation as a per-pixel based classifier to produce hard or crisp classification has been reported in the literature. Remote sensing images, particularly at coarse spatial resolutions, are contaminated with mixed pixels that contain more than one class on the ground. The per-pixel approach may result in erroneous classification of images dominated by mixed pixels. Therefore, soft classification approaches that decompose the pixel into its class constituents in the form of class proportions have been advocated. In this paper, we employ a decision tree regression approach to determine class proportions within a pixel so as to produce soft classification from remote sensing data. Classification accuracy achieved by decision tree regression is compared with those achieved by the most widely used maximum likelihood classifier, implemented in the soft mode, and a supervised version of the fuzzy c-means classifier. Root Mean Square Error (RMSE) and fuzzy error matrix based measures have been used for accuracy assessment of soft classification.  相似文献   

8.
牛琨  陈俊亮  张舒博 《计算机工程》2007,33(10):222-224
采用最大分类树作为分析经验风险与结构风险的工具,对决策树分类准确率极限进行了研究。针对决策树模型的分类效果难以客观评价的问题,讨论了决策树分类准确率极限的存在条件,给出了求出该极限的方法。以最大分类树作为分析工具,提出了在经验风险和结构风险4种分布条件下分类准确率极限是否存在的4个定理,并从机器学习理论和工程建模实践2个角度进行了讨论。实验验证了该理论的正确性。  相似文献   

9.
当重要用户或敏感用户发生停电事件时,电网企业将面临较大压力,所以对用电敏感用户进行准确辨识,降低停电对其带来的损失具有重要意义。提出了采用蚁群算法优化决策树算法,主要从属性离散化,启发信息,信息素更新等方面进行优化。通过UCI数据库的分类数据建立仿真对比实验,与传统的SVM和决策树方法进行实验对比,验证了本文所提方法具有更高的分类准确性。将所提方法与传统的SVM和Logistic算法进行仿真对比,验证所提方法更适用于用户停电敏感度的分析。  相似文献   

10.
决策树算法是一种非参数化、非线性的监督分类法。以2010年8月1日Landsat TM影像为基础遥感信息源,以内蒙古自治区赤峰市中部巴林右旗、林西县、克什克腾旗、翁牛特旗交汇处的区域为研究区,通过多次修改完善训练样本数据集,然后把6个原始波段和NDVI、主成分分析后的前3个主分量、常用8个纹理特征以及3个地形特征等共21个特征变量组合成5个不同特征变量组合,采用典型决策树算法C5.0进行了遥感影像分类实验,与最大似然分类结果进行对比。结果表明:C5.0决策树的分类结果优于最大似然结果,尤其是特征变量组合恰当的时候,能够有效利用相关辅助信息,因而最终的分类结果更能满足用户需求。  相似文献   

11.
森林是生态环境系统的重要组成部分。随着气候变暖,恶劣气候气象条件造成全球森林火灾频繁发生,给国民经济和消防救援带来巨大挑战,森林火灾已成为全球主要的自然灾害。因此,森林场景可视化建模、3维场景仿真、林火模拟仿真、火场复现、预测和灾害评估成为林业虚拟仿真研究热点。本文对树木形态结构建模技术、森林场景大规模重建和实时渲染、森林场景可视化、林火模型和林火模拟仿真等前沿技术和算法进行综述。对相关的林木、植被的形态结构表达和真实感可视化建模方法进行归纳分类,并对不同可视化方法的算法优劣、复杂度、实时渲染效率和适用场景进行讨论。基于规则的林木建模方法和基于林分特征的真实场景重建方法对大规模森林场景重建技术进行分类,基于物理模型、经验模型和半经验模型对森林火灾的林火模型、单木林火、多木林火模拟和蔓延进行总结,对影响林火蔓延的不同环境气象因子(如地形地貌、湿度、可燃物等)和森林分布对林火发生、扩散和蔓延的影响进行分析,对不同算法的优劣进行对比、分析和讨论,对森林场景可视化和林火模拟仿真技术未来的发展方向、存在问题和挑战进行展望。本文为基于森林真实场景的森林火灾模拟仿真和数字孪生沉浸式互动模拟系统的构建提供了理论方法基础,该平台可以实现森林场景快速构建、不同火源林火模拟、火场蔓延模拟仿真以及不同气象影响条件的火场预测,可对森林火场救援指挥、火场灾害评估和火场复原提供可视化决策支持。  相似文献   

12.
基于决策树方法的云南省森林分类研究   总被引:2,自引:0,他引:2  
森林分类对于理解森林生态系统结构和功能具有重要意义。由于云南省地形和森林类型复杂,首先按云南省的16个行政区划将全省Landsat TM影像分为对应的16个区域。以TM波段1~5和7,以及由植被指数、缨帽变换、主成分变换、DEM组成的18个变量组,统计训练样本光谱值均值变化和光谱值与频率间的关系。利用交点计算公式计算类间最佳分类界点进而建立决策树,逐一分离各区的所有森林类型,将分类结果合并得到云南省阔叶林、针叶林和针阔混交林类型分布图。最后将分类结果与监督分类中的最大似然比法的分类结果进行对比。结果表明:监督分类的总体分类精度为74.39%,Kappa系数为0.63,决策树方法的总体分类精度为86.61%,Kappa系数为0.80,说明决策树方法可以提取高精度的云南省森林类型,进而为该区域森林叶面积指数和生物量反演等研究提供基础数据支持。  相似文献   

13.
数据挖掘中判定树算法SLIQ的设计与应用   总被引:4,自引:0,他引:4  
分析了一种用Gini指标进行属性选择的SLIQ算法,讨论了提高效率的可行方法.把算法用到电力市场发电竞价决策系统中,通过对发电商的竞标能力进行挖掘,获取的知识对发电商的决策有重要现实意义.  相似文献   

14.
杨杰  叶晨洲  黄欣 《计算机仿真》2000,17(6):19-20,35
有许多优化问题中,目标值是连续的。对这类问题,首先对目标值进行离散化,再采用决策树方法提取规则。在一定程度上,相比直接对连续的目标值优化可提高正确率,并增加结果的可理解性。为了克服分段划分带来的突变性,可将目标值进行模糊划分,再采用决策树方法提取规则,这样进一步可提高正确率。  相似文献   

15.
心电图的自动诊断,具有很高的临床应用价值。本文在比较各种分类算法的基础上,提出了应用CBR模型建立心电图自动诊断系统的构想,并阐述了CBR模型中心电图实例库设计的若干关键问题。  相似文献   

16.
17.
决策树是一种采用分治策略的聚类分析方法,构建决策树的关键是选择合适的属性。传统的决策树通常从最大化信息熵的角度来构造,不能对属性的分类能力进行足够好的区分。对传统的决策树生成算法的不足,本文提出了一种基于马氏距离的决策树生成算法。算法使用马氏距离来区分不同特征属性子集的分类能力。试验结果表明,基于度量的决策树的性能优于传统的决策树。  相似文献   

18.

This paper reflects on 20 years of behavioural research in telecommunications published in BIT. The past 20 years have seen major changes in telecommunications technology and applications. They have also seen the deregulation of telecommunications markets and the pervasive penetration of the working environment by networked systems. Papers published in BIT have reflected these changes. Some research topics have attracted continuing interest throughout this period, and two are reviewed briefly: the effect of network delays on users and the relative effectiveness of different media and user choices between them. In addition many new technical and theoretical developments have been reported. Two major theoretical trends have been the convergence between behavioural research in telecommunications and computing, and the rise in social-science-based research. The question whether published behavioural research has been able to influence the development of the technologies studied is considered. Finally, the paper speculates on future topics for research.  相似文献   

19.
This paper reflects on 20 years of behavioural research in telecommunications published in BIT. The past 20 years have seen major changes in telecommunications technology and applications. They have also seen the deregulation of telecommunications markets and the pervasive penetration of the working environment by networked systems. Papers published in BIT have reflected these changes. Some research topics have attracted continuing interest throughout this period, and two are reviewed briefly: the effect of network delays on users and the relative effectiveness of different media and user choices between them. In addition many new technical and theoretical developments have been reported. Two major theoretical trends have been the convergence between behavioural research in telecommunications and computing, and the rise in social-science-based research. The question whether published behavioural research has been able to influence the development of the technologies studied is considered. Finally, the paper speculates on future topics for research.  相似文献   

20.
数据挖掘中决策树算法的最新进展   总被引:27,自引:1,他引:27  
概述了传统决策树方法的基本原理和优越性,指出了该方法应用于超大数据集的数据挖掘环境时的局限性;着重分五个方面概括了近年来决策树方法在数据挖掘中的主要进展,并讨论了决策树方法面临的挑战及其发展趋势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号