首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
师彦文  王宏杰 《计算机科学》2017,44(Z11):98-101
针对不平衡数据集的有效分类问题,提出一种结合代价敏感学习和随机森林算法的分类器。首先提出了一种新型不纯度度量,该度量不仅考虑了决策树的总代价,还考虑了同一节点对于不同样本的代价差异;其次,执行随机森林算法,对数据集作K次抽样,构建K个基础分类器;然后,基于提出的不纯度度量,通过分类回归树(CART)算法来构建决策树,从而形成决策树森林;最后,随机森林通过投票机制做出数据分类决策。在UCI数据库上进行实验,与传统随机森林和现有的代价敏感随机森林分类器相比,该分类器在分类精度、AUC面积和Kappa系数这3种性能度量上都具有良好的表现。  相似文献   

2.
随机森林(RF)具有抗噪能力强,预测准确率高,能够处理高维数据等优点,因此在机器学习领域得到了广泛的应用。模型决策树(MDT)是一种加速的决策树算法,虽然能够提高决策树算法的训练效率,但是随着非纯伪叶结点规模的增大,模型决策树的精度也在下降。针对上述问题,提出了一种模型决策森林算法(MDF)以提高模型决策树的分类精度。MDF算法将MDT作为基分类器,利用随机森林的思想,生成多棵模型决策树。算法首先通过旋转矩阵得到不同的样本子集,然后在这些样本子集上训练出多棵不同的模型决策树,再将这些树通过投票的方式进行集成,最后根据得到的模型决策森林给出分类结果。在标准数据集上的实验结果表明,提出的模型决策森林在分类精度上明显优于模型决策树算法,并且MDF在树的数量较少时也能取到不错的精度,避免了因树的数量增加时间复杂度增高的问题。  相似文献   

3.
随机森林在bootstrap的基础上通过对特征进行抽样构建决策树,以牺牲决策树准确性的方式来降低决策树间的相关性,从而提高预测的准确性。但在数据规模较大时,决策树间的相关性仍然较高,导致随机森林的性能表现不佳。为解决该问题,提出一种基于袋外预测的改进算法,通过提高决策树的准确性来提升随机森林的预测性能。将随机森林的袋外预测与原特征相结合并重新训练随机森林,以有效降低决策树的VC-dimension、经验风险、泛化风险并提高其准确性,最终提升随机森林的预测性能。然而,决策树准确性的提高会使决策树间的预测趋于相近,提升了决策树间的相关性从而影响随机森林最终的预测表现,为此,通过扩展空间算法为不同决策树生成不同的特征,从而降低决策树间的相关性而不显著降低决策树的准确性。实验结果表明,该算法在32个数据集上的平均准确率相对原始随机森林提高1.7%,在校正的paired t-test上,该方法在其中19个数据集上的预测性能显著优于原始随机森林。  相似文献   

4.
Decision forest is an ensemble classification method that combines multiple decision trees to in a manner that results in more accurate classifications. By combining multiple heterogeneous decision trees, decision forest is effective in mitigating noise that is often prevalent in real‐world classification tasks. This paper presents a new genetic algorithm for constructing a decision forest. Each decision tree classifier is trained using a disjoint set of attributes. Moreover, we examine the effectiveness of using a Vapnik–Chervonenkis dimension bound for evaluating the fitness function of decision forest. The new algorithm was tested on various datasets. The obtained results have been compared to other methods, indicating the superiority of the proposed algorithm. © 2008 Wiley Periodicals, Inc.  相似文献   

5.
为提高决策树的集成分类精度,介绍了一种基于特征变换的旋转森林分类器集成算法,通过对数据属性集的随机分割,并在属性子集上对抽取的子样本数据进行主成分分析,以构造新的样本数据,达到增大基分类器差异性及提高预测准确率的目的。在Weka平台下,分别采用Bagging、AdaBoost及旋转森林算法对剪枝与未剪枝的J48决策树分类算法进行集成的对比试验,以10次10折交叉验证的平均准确率为比较依据。结果表明旋转森林算法的预测精度优于其他两个算法,验证了旋转森林是一种有效的决策树分类器集成算法。  相似文献   

6.
运行状态评价是指在过程正常生产的前提下, 进一步判断生产过程运行状态的优劣. 针对复杂工业过程定量信息与定性信息共存的情况, 本文提出了一种基于随机森林的工业过程运行状态评价方法. 针对随机森林中决策树信息存在冗余的问题, 基于互信息将传统随机森林中的决策树进行分组, 并选出每组中最优的决策树组成新的随机森林. 同时为了强化评价精度高的决策树和弱化评价精度低的决策树对最终评价结果的影响, 使用加权投票机制取代传统众数投票方法, 最终构成一种基于互信息的加权随机森林算法(Mutual information weighted random forest, MIWRF). 对于在线评价, 本文通过计算在线数据处于各个等级的概率, 并且结合提出的在线评价策略, 判定当前样本运行状态等级. 为了验证所提算法的有效性, 将所提方法应用于湿法冶金浸出过程, 实验结果表明, 相对于传统随机森林算法, MIWRF 降低了模型的复杂度, 同时提高了运行状态评价精度.  相似文献   

7.
基于关系数据分析的决策森林学习方法   总被引:1,自引:0,他引:1  
模式识别中的多分类器集成日益得到研究人员的关注并成为研究的热点。提出一种基于决策森林构造的多重子模型集成方法,通过对每个样本赋予决策规则,构造决策森林而非单个决策树以自动确定相对独立的样本子集,在此基础上结合条件独立性假设进行模型集成。整个学习过程不需要任何人为参与,能够自适应确定决策树数量和每个子树结构,发挥各分类器在不同样本和不同区域上的分类优势。在UCI机器学习数据集上的实验结果和样例分析验证了方法的有效性。  相似文献   

8.
Univariate decision trees are classifiers currently used in many data mining applications. This classifier discovers partitions in the input space via hyperplanes that are orthogonal to the axes of attributes, producing a model that can be understood by human experts. One disadvantage of univariate decision trees is that they produce complex and inaccurate models when decision boundaries are not orthogonal to axes. In this paper we introduce the Fisher’s Tree, it is a classifier that takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees, to come up with an oblique decision tree. Our proposal generates an artificial attribute that is used to split the data in a recursive way.The Fisher’s decision tree induces oblique trees whose accuracy, size, number of leaves and training time are competitive with respect to other decision trees reported in the literature. We use more than ten public available data sets to demonstrate the effectiveness of our method.  相似文献   

9.
An automated method is presented for the design of linear tree classifiers, i.e. tree classifiers in which a decision based on a linear sum of features is carried out at each node. The method exploits the discriminability of Tomek links joining opposed pairs of data points in multidimensional feature space to produce a hierarchically structured piecewise linear decision function. The corresponding decision surface is optimized by a gradient descent that maximizes the number of Tomek links cut by each linear segment of the decision surface, followed by training each node's linear decision segment on the data associated with that node. Experiments on real data obtained from ship images and character images suggest that the accuracy of the tree classifier designed by this scheme is comparable to that of the k-NN classifier while providing much greater decision speeds, and that the trade-off between the speed and the accuracy in pattern classification can be controlled by bounding the number of features to be used at each node of the tree. Further experiments comparing the classification errors of our tree classifier with the tree classifier produced by the Mui/Fu technique indicate that our tree classifier is no less accurate and often much faster than the Mui/Fu classifier.  相似文献   

10.
随机森林中树的数量   总被引:4,自引:0,他引:4  
随机森林是一种集成分类器,对影响随机森林性能的参数进行了分析,结果表明随机森林中树的数量对随机森林的性能影响至关重要。对树的数量的确定方法以及随机森林性能指标的评价方法进行了研究与总结。以分类精度为评价方法,利用UCI数据集对随机森林中决策树的数量与数据集的关系进行了实验分析,实验结果表明对于多数数据集,当树的数量为100时,就可以使分类精度达到要求。将随机森林和分类性能优越的支持向量机在精度方面进行了对比,实验结果表明随机森林的分类性能可以与支持向量机相媲美。  相似文献   

11.
针对数据不平衡带来的少数类样本识别率低的问题,提出通过加权策略对过采样和随机森林进行改进的算法,从数据预处理和算法两个方面降低数据不平衡对分类器的影响。数据预处理阶段应用合成少数类过采样技术(Synthetic Minority Oversampling Technique,SMOTE)降低数据不平衡度,每个少数类样本根据其相对于剩余样本的欧氏距离分配权重,使每个样本合成不同数量的新样本。算法改进阶段利用Kappa系数评价随机森林中决策树训练后的分类效果,并赋予每棵树相应的权重,使分类能力更好的树在投票阶段有更大的投票权,提高随机森林算法对不平衡数据的整体分类性能。在KEEL数据集上的实验表明,与未改进算法相比,改进后的算法对少数类样本分类准确率和整体样本分类性能有所提升。  相似文献   

12.
郭冰楠  吴广潮 《计算机应用》2019,39(10):2888-2892
在网络贷款用户数据集中,贷款成功和贷款失败的用户数量存在着严重的不平衡,传统的机器学习算法在解决该类问题时注重整体分类正确率,导致贷款成功用户的预测精度较低。针对此问题,在代价敏感决策树敏感函数的计算中加入类分布,以减弱正负样本数量对误分类代价的影响,构建改进的代价敏感决策树;以该决策树作为基分类器并以分类准确度作为衡量标准选择表现较好的基分类器,将它们与最后阶段生成的分类器集成得到最终的分类器。实验结果表明,与已有的常用于解决此类问题的算法(如MetaCost算法、代价敏感决策树、AdaCost算法等)相比,改进的代价敏感决策树对网络贷款用户分类可以降低总体的误分类错误率,具有更强的泛化能力。  相似文献   

13.
Neural networks and decision tree methods are two common approaches to pattern classification. While neural networks can achieve high predictive accuracy rates, the decision boundaries they form are highly nonlinear and generally difficult to comprehend. Decision trees, on the other hand, can be readily translated into a set of rules. In this paper, we present a novel algorithm for generating oblique decision trees that capitalizes on the strength of both approaches. Oblique decision trees classify the patterns by testing on linear combinations of the input attributes. As a result, an oblique decision tree is usually much smaller than the univariate tree generated for the same domain. Our algorithm consists of two components: connectionist and symbolic. A three-layer feedforward neural network is constructed and pruned, a decision tree is then built from the hidden unit activation values of the pruned network. An oblique decision tree is obtained by expressing the activation values using the original input attributes. We test our algorithm on a wide range of problems. The oblique decision trees generated by the algorithm preserve the high accuracy of the neural networks, while keeping the explicitness of decision trees. Moreover, they outperform univariate decision trees generated by the symbolic approach and oblique decision trees built by other approaches in accuracy and tree size.  相似文献   

14.
基于Google Earth Engine(GEE)云计算平台,协同Sentinel-2影像、WordClim生物气候数据、SRTM地形数据、森林资源二类调查数据等数据,以随机森林(Random Forest,RF),支持向量机(Support Vector Machine,SVM)和最大熵(Maximum Entropy,MaxEnt)3种机器学习算法为组件分类器,开展多源特征、多分类器决策融合的优势树种分类研究。通过3种组件分类器分别构建了两种串行集成和3种贝叶斯并行集成模型,用于确定云南香格里拉地区10种主要优势树种的空间分布。分类结果显示:3个组件分类器的总体精度均低于67.17%;3种并行集成方法总体精度相当,约为72%;两种串行集成方法精度高于78.48%,其中MaxEnt-SVM串行集成方法获得最佳精度(OA:80.66%,Kappa:0.78),与组件分类器相比精度至少提高了13.49%。研究表明:决策融合方法在优势树种分类中比组件分类器精度更高,并且有效改善了小样本树种的分类精度,可用于大范围山区优势树种分类。  相似文献   

15.
This paper proposes a method for combining multiple tree classifiers based on both classifier ensemble (bagging) and dynamic classifier selection schemes (DCS). The proposed method is composed of the following procedures: (1) building individual tree classifiers based on bootstrap samples; (2) calculating the distance between all possible two trees; (3) clustering the trees based on single linkage clustering; (4) selecting two clusters by local region in terms of accuracy and error diversity; and (5) voting the results of tree classifiers selected in the two clusters. Empirical evaluation using publicly available data sets confirms the superiority of our proposed approach over other classifier combining methods.  相似文献   

16.
提出了一个通用OCR开发工具的设想,用于各种文字的OCR软件的开发,它能够在使用者的干预下自动完成识别器的设计,大大减少文字识别软件开发的工作量。系统以决策树作为基本的判别器,并用多个决策树组成多方案识别系统。提出设计树和分类器设计器的概念,分别用于决策树设计过程的控制和决策树节点中的分类器的设计。最后实现一个实验系统,验证了该文的设想和设计方案的可行性。  相似文献   

17.
In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance.  相似文献   

18.
Enlarging the feature space of the base tree classifiers in a decision forest by means of informative features extracted from an additional predictive model is advantageous for classification tasks. In this paper, we have empirically examined the performance of this type of decision forest with three different base tree classifier models including; (1) the full decision tree, (2) eight-node decision tree and (3) two-node decision tree (or decision stump). The hybrid decision forest with these base classifiers are trained in nine different sized resampled training sets. We have examined the performance of all these ensembles from different point of views; we have studied the bias-variance decomposition of the misclassification error of the ensembles, then we have investigated the amount of dependence and degree of uncertainty among the base classifiers of these ensembles using information theoretic measures. The experiment was designed to find out: (1) optimal training set size for each base classifier and (2) which base classifier is optimal for this kind of decision forest. In the final comparison, we have checked whether the subsampled version of the decision forest outperform the bootstrapped version. All the experiments have been conducted with 20 benchmark datasets from UCI machine learning repository. The overall results clearly point out that with careful selection of the base classifier and training sample size, the hybrid decision forest can be an efficient tool for real world classification tasks.  相似文献   

19.
This work is motivated by the interest in forensics steganalysis which is aimed at detecting the presence of secret messages transmitted through a subliminal channel. A critical part of the steganalyser design depends on the choice of stego-sensitive features and an efficient machine learning paradigm. The goals of this paper are: (1) to demonstrate that the higher-order statistics of Hausdorff distance - a dissimilarity metric, offers potential discrimination ability for a clean and a stego audio and (2) to achieve promising classification accuracy by realizing the proposed steganalyser with evolving decision tree classifier. Stego sensitive feature selection process is imparted by the genetic algorithm (GA) component and the construction of the rule base is facilitated by the decision tree module. The objective function is designed to maximize the Precision and Recall measures of the classifier thereby enhancing the detection accuracy of the system with low-dimensional and informative features. An extensive experimental evaluation of the proposed system on a database containing 4800 clean and stego audio files (generated by using six different embedding schemes), with the family of six GA decision trees was conducted. The observations reported as 90%+ detection rate, a promising score for a blind steganalyser, show that the proposed scheme, with the Hausdorff distance statistics as features and the evolving decision tree as classifier, is a state-of-the-art steganalyser that outperforms many of the previous steganalytic methods.  相似文献   

20.
With the advantages of being easy to understand and efficient to compute, the decision tree method has long been one of the most popular classifiers. Decision trees constructed with existing approaches, however, tend to be huge and complex, and consequently are difficult to use in practical applications. In this study, we deal with the problem of tree complexity by allowing users to specify the number of leaf nodes, and then construct a decision tree that allows maximum classification accuracy with the given number of leaf nodes. A new algorithm, the Size Constrained Decision Tree (SCDT), is proposed with which to construct a decision tree, paying close attention on how to efficiently use the limited number of leaf nodes. Experimental results show that the SCDT method can successfully generate a simpler decision tree and offers better accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号