首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rotation forest: A new classifier ensemble method   总被引:8,自引:0,他引:8  
We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name "forest.” Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the Rotation Forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest. The results were favorable to Rotation Forest and prompted an investigation into diversity-accuracy landscape of the ensemble models. Diversity-error diagrams revealed that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well.  相似文献   

2.
The ensembling of classifiers tends to improve predictive accuracy. To obtain an ensemble with N classifiers, one typically needs to run N learning processes. In this paper we introduce and explore Model Jittering Ensembling, where one single model is perturbed in order to obtain variants that can be used as an ensemble. We use as base classifiers sets of classification association rules. The two methods of jittering ensembling we propose are Iterative Reordering Ensembling (IRE) and Post Bagging (PB). Both methods start by learning one rule set over a single run, and then produce multiple rule sets without relearning. Empirical results on 36 data sets are positive and show that both strategies tend to reduce error with respect to the single model association rule classifier. A bias–variance analysis reveals that while both IRE and PB are able to reduce the variance component of the error, IRE is particularly effective in reducing the bias component. We show that Model Jittering Ensembling can represent a very good speed-up w.r.t. multiple model learning ensembling. We also compare Model Jittering with various state of the art classifiers in terms of predictive accuracy and computational efficiency.  相似文献   

3.
Generalized additive models (GAMs) are a generalization of generalized linear models (GLMs) and constitute a powerful technique which has successfully proven its ability to capture nonlinear relationships between explanatory variables and a response variable in many domains. In this paper, GAMs are proposed as base classifiers for ensemble learning. Three alternative ensemble strategies for binary classification using GAMs as base classifiers are proposed: (i) GAMbag based on Bagging, (ii) GAMrsm based on the Random Subspace Method (RSM), and (iii) GAMens as a combination of both. In an experimental validation performed on 12 data sets from the UCI repository, the proposed algorithms are benchmarked to a single GAM and to decision tree based ensemble classifiers (i.e. RSM, Bagging, Random Forest, and the recently proposed Rotation Forest). From the results a number of conclusions can be drawn. Firstly, the use of an ensemble of GAMs instead of a single GAM always leads to improved prediction performance. Secondly, GAMrsm and GAMens perform comparably, while both versions outperform GAMbag. Finally, the value of using GAMs as base classifiers in an ensemble instead of standard decision trees is demonstrated. GAMbag demonstrates performance comparable to ordinary Bagging. Moreover, GAMrsm and GAMens outperform RSM and Bagging, while these two GAM ensemble variations perform comparably to Random Forest and Rotation Forest. Sensitivity analyses are included for the number of member classifiers in the ensemble, the number of variables included in a random feature subspace and the number of degrees of freedom for GAM spline estimation.  相似文献   

4.
The amounts and types of remote sensing data have increased rapidly, and the classification of these datasets has become more and more overwhelming for a single classifier in practical applications. In this paper, an ensemble algorithm based on Diversity Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATEs) and Rotation Forest is proposed to solve the classification problem of remote sensing image. In this ensemble algorithm, the RBF neural networks are employed as base classifiers. Furthermore, interpolation technology for identical distribution is used to remold the input datasets. These remolded datasets will construct new classifiers besides the initial classifiers constructed by the Rotation Forest algorithm. The change of classification error is used to decide whether to add another new classifier. Therefore, the diversity among these classifiers will be enhanced and the accuracy of classification will be improved. Adaptability of the proposed algorithm is verified in experiments implemented on standard datasets and actual remote sensing dataset.  相似文献   

5.
The ensemble method is a powerful data mining paradigm, which builds a classification model by integrating multiple diversified component learners. Bagging is one of the most successful ensemble methods. It is made of bootstrap-inspired classifiers and uses these classifiers to get an aggregated classifier. However, in bagging, bootstrapped training sets become more and more similar as redundancy is increasing. Besides redundancy, any training set is usually subject to noise. Moreover, the training set might be imbalanced. Thus, each training instance has a different impact on the learning process. This paper explores some properties of the ensemble margin and its use in improving the performance of bagging. We introduce a new approach to measure the importance of training data in learning, based on the margin theory. Then, a new bagging method concentrating on critical instances is proposed. This method is more accurate than bagging and more robust than boosting. Compared to bagging, it reduces the bias while generally keeping the same variance. Our findings suggest that (a) examples with low margins tend to be more critical for the classifier performance; (b) examples with higher margins tend to be more redundant; (c) misclassified examples with high margins tend to be noisy examples. Our experimental results on 15 various data sets show that the generalization error of bagging can be reduced up to 2.5% and its resilience to noise strengthened by iteratively removing both typical and noisy training instances, reducing the training set size by up to 75%.  相似文献   

6.
网络作弊检测是搜索引擎的重要挑战之一,该文提出基于遗传规划的集成学习方法 (简记为GPENL)来检测网络作弊。该方法首先通过欠抽样技术从原训练集中抽样得到t个不同的训练集;然后使用c个不同的分类算法对t个训练集进行训练得到t*c个基分类器;最后利用遗传规划得到t*c个基分类器的集成方式。新方法不仅将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能,还能方便地集成不同类型的基分类器。在WEBSPAM-UK2006数据集上所做的实验表明无论是同态集成还是异态集成,GPENL均能提高分类的性能,且异态集成比同态集成更加有效;GPENL比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。  相似文献   

7.
针对如何提高集成学习的性能,提出一种结合Rotation Forest和Multil3oost的集成学习方法—利用Rotation Forest中旋转变换的思想对原始数据集进行变换,旨在增加分类器间的差异度;利用Mu1tiI3oost在变换后的数据集上训练基分类器,旨在提高基分类器的准确度。最后用简单的多数投票法融合各基分类器的决策结果,将其作为集成分类器的输出。为了验证该方法的有效性,在公共数据集UCI上进行了实验,结果显示,该方法可获得较高的分类精度。  相似文献   

8.
Several studies have demonstrated the superior performance of ensemble classification algorithms, whereby multiple member classifiers are combined into one aggregated and powerful classification model, over single models. In this paper, two rotation-based ensemble classifiers are proposed as modeling techniques for customer churn prediction. In Rotation Forests, feature extraction is applied to feature subsets in order to rotate the input data for training base classifiers, while RotBoost combines Rotation Forest with AdaBoost. In an experimental validation based on data sets from four real-life customer churn prediction projects, Rotation Forest and RotBoost are compared to a set of well-known benchmark classifiers. Moreover, variations of Rotation Forest and RotBoost are compared, implementing three alternative feature extraction algorithms: principal component analysis (PCA), independent component analysis (ICA) and sparse random projections (SRP). The performance of rotation-based ensemble classifier is found to depend upon: (i) the performance criterion used to measure classification performance, and (ii) the implemented feature extraction algorithm. In terms of accuracy, RotBoost outperforms Rotation Forest, but none of the considered variations offers a clear advantage over the benchmark algorithms. However, in terms of AUC and top-decile lift, results clearly demonstrate the competitive performance of Rotation Forests compared to the benchmark algorithms. Moreover, ICA-based Rotation Forests outperform all other considered classifiers and are therefore recommended as a well-suited alternative classification technique for the prediction of customer churn that allows for improved marketing decision making.  相似文献   

9.
一种基于旋转森林的集成协同训练算法   总被引:1,自引:0,他引:1       下载免费PDF全文
集成协同训练算法(ensemble co-training)是将集成学习(ensemble learning)和协同训练算法(co-training)相结合的半监督学习方法,旋转森林(rotation forest)是利用特征提取来构造基分类器差异性的集成学习方法,在对现有的集成协同训练算法研究基础上,提出了基于旋转森林的协同训练算法——ROFCO,该方法重在利用未标记数据提高基分类器之间的差异性和特征提取效果,使基分类器的泛化误差保持不变或下降的同时,能保持甚至提高基分类器之间的差异性,提高集成效果。实验结果表明该方法能取得较好效果。  相似文献   

10.
A dynamic classifier ensemble selection approach for noise data   总被引:2,自引:0,他引:2  
Dynamic classifier ensemble selection (DCES) plays a strategic role in the field of multiple classifier systems. The real data to be classified often include a large amount of noise, so it is important to study the noise-immunity ability of various DCES strategies. This paper introduces a group method of data handling (GMDH) to DCES, and proposes a novel dynamic classifier ensemble selection strategy GDES-AD. It considers both accuracy and diversity in the process of ensemble selection. We experimentally test GDES-AD and six other ensemble strategies over 30 UCI data sets in three cases: the data sets do not include artificial noise, include class noise, and include attribute noise. Statistical analysis results show that GDES-AD has stronger noise-immunity ability than other strategies. In addition, we find out that Random Subspace is more suitable for GDES-AD compared with Bagging. Further, the bias-variance decomposition experiments for the classification errors of various strategies show that the stronger noise-immunity ability of GDES-AD is mainly due to the fact that it can reduce the bias in classification error better.  相似文献   

11.
Dynamic weighting ensemble classifiers based on cross-validation   总被引:1,自引:1,他引:0  
Ensemble of classifiers constitutes one of the main current directions in machine learning and data mining. It is accepted that the ensemble methods can be divided into static and dynamic ones. Dynamic ensemble methods explore the use of different classifiers for different samples and therefore may get better generalization ability than static ensemble methods. However, for most of dynamic approaches based on KNN rule, additional part of training samples should be taken out for estimating “local classification performance” of each base classifier. When the number of training samples is not sufficient enough, it would lead to the lower accuracy of the training model and the unreliableness for estimating local performances of base classifiers, so further hurt the integrated performance. This paper presents a new dynamic ensemble model that introduces cross-validation technique in the process of local performances’ evaluation and then dynamically assigns a weight to each component classifier. Experimental results with 10 UCI data sets demonstrate that when the size of training set is not large enough, the proposed method can achieve better performances compared with some dynamic ensemble methods as well as some classical static ensemble approaches.  相似文献   

12.
方丁  王刚 《计算机系统应用》2012,21(7):177-181,248
随着Web2.0的迅速发展,越来越多的用户乐于在互联网上分享自己的观点或体验。这类评论信息迅速膨胀,仅靠人工的方法难以应对网上海量信息的收集和处理,因此基于计算机的文本情感分类技术应运而生,并且研究的重点之一就是提高分类的精度。由于集成学习理论是提高分类精度的一种有效途径,并且已在许多领域显示出其优于单个分类器的良好性能,为此,提出基于集成学习理论的文本情感分类方法。实验结果显示三种常用的集成学习方法 Bagging、Boosting和Random Subspace对基础分类器的分类精度都有提高,并且在不同的基础分类器条件下,Random Subspace方法较Bagging和Boosting方法在统计意义上更优,以上结果进一步验证了集成学习理论在文本情感分类中应用的有效性。  相似文献   

13.
为解决垃圾网页检测过程中的不平衡分类和"维数灾难"问题,提出一种基于随机森林(RF)和欠采样集成的二元分类器算法。首先使用欠采样技术将训练样本集大类抽样成多个子样本集,再将其分别与小类样本集合并构成多个平衡的子训练样本集;然后基于各个子训练样本集训练出多个随机森林分类器;最后用多个随机森林分类器对测试样本集进行分类,采用投票法确定测试样本的最终所属类别。在WEBSPAM UK-2006数据集上的实验表明,该集成分类器算法应用于垃圾网页检测比随机森林算法及其Bagging和Adaboost集成分类器算法效果更好,准确率、F1测度、ROC曲线下面积(AUC)等指标提高至少14%,13%和11%。与Web spam challenge 2007 优胜团队的竞赛结果相比,该集成分类器算法在F1测度上提高至少1%,在AUC上达到最优结果。  相似文献   

14.
The Rotation Forest classifier is a successful ensemble method for a wide variety of data mining applications. However, the way in which Rotation Forest transforms the feature space through PCA, although powerful, penalizes training and prediction times, making it unfeasible for Big Data. In this paper, a MapReduce Rotation Forest and its implementation under the Spark framework are presented. The proposed MapReduce Rotation Forest behaves in the same way as the standard Rotation Forest, training the base classifiers on a rotated space, but using a functional implementation of the rotation that enables its execution in Big Data frameworks. Experimental results are obtained using different cloud-based cluster configurations. Bayesian tests are used to validate the method against two ensembles for Big Data: Random Forest and PCARDE classifiers. Our proposal incorporates the parallelization of both the PCA calculation and the tree training, providing a scalable solution that retains the performance of the original Rotation Forest and achieves a competitive execution time (in average, at training, more than 3 times faster than other PCA-based alternatives). In addition, extensive experimentation shows that by setting some parameters of the classifier (i.e., bootstrap sample size, number of trees, and number of rotations), the execution time is reduced with no significant loss of performance using a small ensemble.  相似文献   

15.
蔡铁  伍星  李烨 《计算机应用》2008,28(8):2091-2093
为构造集成学习中具有差异性的基分类器,提出基于数据离散化的基分类器构造方法,并用于支持向量机集成。该方法采用粗糙集和布尔推理离散化算法处理训练样本集,能有效删除不相关和冗余的属性,提高基分类器的准确性和差异性。实验结果表明,所提方法能取得比传统集成学习算法Bagging和Adaboost更好的性能。  相似文献   

16.
在文本分类研究中,集成学习是一种提高分类器性能的有效方法.Bagging算法是目前流行的一种集成学习算法.针对Bagging算法弱分类器具有相同权重问题,提出一种改进的Bagging算法.该方法通过对弱分类器分类结果进行可信度计算得到投票权重,应用于Attribute Bagging算法设计了一个中文文本自动分类器.采用kNN作为弱分类器基本模型对Sogou实验室提供的新闻集进行分类.实验表明该算法比Attribute Bagging有更好的分类精度.  相似文献   

17.
Learning from imperfect (noisy) information sources is a challenging and reality issue for many data mining applications. Common practices include data quality enhancement by applying data preprocessing techniques or employing robust learning algorithms to avoid developing overly complicated structures that overfit the noise. The essential goal is to reduce noise impact and eventually enhance the learners built from noise-corrupted data. In this paper, we propose a novel corrective classification (C2) design, which incorporates data cleansing, error correction, Bootstrap sampling and classifier ensembling for effective learning from noisy data sources. C2 differs from existing classifier ensembling or robust learning algorithms in two aspects. On one hand, a set of diverse base learners of C2 constituting the ensemble are constructed via a Bootstrap sampling process; on the other hand, C2 further improves each base learner by unifying error detection, correction and data cleansing to reduce noise impact. Being corrective, the classifier ensemble is built from data preprocessed/corrected by the data cleansing and correcting modules. Experimental comparisons demonstrate that C2 is not only more accurate than the learner built from original noisy sources, but also more reliable than Bagging [4] or aggressive classifier ensemble (ACE) [56], which are two degenerated components/variants of C2. The comparisons also indicate that C2 is more stable than Boosting and DECORATE, which are two state-of-the-art ensembling methods. For real-world imperfect information sources (i.e. noisy training and/or test data), C2 is able to deliver more accurate and reliable prediction models than its other peers can offer.  相似文献   

18.
传统集成分类算法中,一般将集成数目设置为固定值,这可能会导致较低分类准确率。针对这一问题,提出了准确率爬坡集成分类算法(C-ECA)。首先,该算法不再用一些基分类器去替换相同数量的表现最差的基分类器,而是基于准确率对基分类器进行更新,然后确定最佳集成数目。其次,在C-ECA的基础上提出了基于爬坡的动态加权集成分类算法(C-DWECA)。该算法提出了一个加权函数,其在具有不同特征的数据流上训练基分类器时,可以获得基分类器的最佳权值,从而提升集成分类器的性能。最后,为了能更早地检测到概念漂移并提高最终精度,采用了快速霍夫丁漂移检测方法(FHDDM)。实验结果表明C-DWECA的准确率最高可达到97.44%,并且该算法的平均准确率比自适应多样性的在线增强(ADOB)算法提升了40%左右,也优于杠杆装袋(LevBag)、自适应随机森林(ARF)等其他对比算法。  相似文献   

19.
在集成学习中使用平均法、投票法作为结合策略无法充分利用基分类器的有效信息,且根据波动性设置基分类器的权重不精确、不恰当。以上问题会降低集成学习的效果,为了进一步提高集成学习的性能,提出将证据推理(evidence reasoning, ER)规则作为结合策略,并使用多样性赋权法设置基分类器的权重。首先,由多个深度学习模型作为基分类器、ER规则作为结合策略,构建集成学习的基本结构;然后,通过多样性度量方法计算每个基分类器相对于其他基分类器的差异性;最后,将差异性归一化实现基分类器的权重设置。通过多个图像数据集的分类实验,结果表明提出的方法较实验选取的其他方法准确率更高且更稳定,证明了该方法可以充分利用基分类器的有效信息,且多样性赋权法更精确。  相似文献   

20.
相比于集成学习,集成剪枝方法是在多个分类器中搜索最优子集从而改善分类器的泛化性能,简化集成过程。帕累托集成剪枝方法同时考虑了分类器的精准度及集成规模两个方面,并将二者均作为优化的目标。然而帕累托集成剪枝算法只考虑了基分类器的精准度与集成规模,忽视了分类器之间的差异性,从而导致了分类器之间的相似度比较大。本文提出了融入差异性的帕累托集成剪枝算法,该算法将分类器的差异性与精准度综合为第1个优化目标,将集成规模作为第2个优化目标,从而实现多目标优化。实验表明,当该改进的集成剪枝算法与帕累托集成剪枝算法在集成规模相当的前提下,由于差异性的融入该改进算法能够获得较好的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号