首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Ensemble methods have proven to be highly effective in improving the performance of base learners under most circumstances. In this paper, we propose a new algorithm that combines the merits of some existing techniques, namely, bagging, arcing, and stacking. The basic structure of the algorithm resembles bagging. However, the misclassification cost of each training point is repeatedly adjusted according to its observed out-of-bag vote margin. In this way, the method gains the advantage of arcing-building the classifier the ensemble needs - without fixating on potentially noisy points. Computational experiments show that this algorithm performs consistently better than bagging and arcing with linear and nonlinear base classifiers. In view of the characteristics of bacing, a hybrid ensemble learning strategy, which combines bagging and different versions of bacing, is proposed and studied empirically.  相似文献   

2.
One of the most widely used approaches to the class-imbalanced issue is ensemble learning. The base classifier is trained using an unbalanced training set in the conventional ensemble learning approach. We are unable to select the best suitable resampling method or base classifier for the training set, despite the fact that researchers have examined employing resampling strategies to balance the training set. A multi-armed bandit heterogeneous ensemble framework was developed as a solution to these issues. This framework employs the multi-armed bandit technique to pick the best base classifier and resampling techniques to build a heterogeneous ensemble model. To obtain training sets, we first employ the bagging technique. Then, we use the instances from the out-of-bag set as the validation set. In general, we consider the basic classifier combination with the highest validation set score to be the best model on the bagging subset and add it to the pool of model. The classification performance of the multi-armed bandit heterogeneous ensemble model is then assessed using 30 real-world imbalanced data sets that were gathered from UCI, KEEL, and HDDT. The experimental results demonstrate that, under the two assessment metrics of AUC and Kappa, the proposed heterogeneous ensemble model performs competitively with other nine state-of-the-art ensemble learning methods. At the same time, the findings of the experiment are confirmed by the statistical findings of the Friedman test and Holm's post-hoc test.  相似文献   

3.
理论及实验表明,在训练集上具有较大边界分布的组合分类器泛化能力较强。文中将边界概念引入到组合剪枝中,并用它指导组合剪枝方法的设计。基于此,构造一个度量标准(MBM)用于评估基分类器相对于组合分类器的重要性,进而提出一种贪心组合选择方法(MBMEP)以降低组合分类器规模并提高它的分类准确率。在随机选择的30个UCI数据集上的实验表明,与其它一些高级的贪心组合选择算法相比,MBMEP选择出的子组合分类器具有更好的泛化能力。  相似文献   

4.
In this paper we present a new approach for boosting methods for the construction of ensembles of classifiers. The approach is based on using the distribution given by the weighting scheme of boosting to construct a non-linear supervised projection of the original variables, instead of using the weights of the instances to train the next classifier. With this method we construct ensembles that are able to achieve a better generalization error and are more robust to noise presence.It has been proved that AdaBoost method is able to improve the margin of the instances achieved by the ensemble. Moreover, its practical success has been partially explained by this margin maximization property. However, in noisy problems, likely to occur in real-world applications, the maximization of the margin of wrong instances or outliers can lead to poor generalization. We propose an alternative approach, where the distribution of the weights given by the boosting algorithm is used to get a supervised projection. Then, the supervised projection is used to train the next classifier using a uniform distribution of the training instances.The proposed approach is compared with three boosting techniques, namely AdaBoost, GentleBoost and MadaBoost, showing an improved performance on a large set of 55 problems from the UCI Machine Learning Repository, and less sensitiveness to noise in the class labels. The behavior of the proposed algorithm in terms of margin distribution and bias-variance decomposition is also studied.  相似文献   

5.
In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance.  相似文献   

6.
This paper presents cluster‐based ensemble classifier – an approach toward generating ensemble of classifiers using multiple clusters within classified data. Clustering is incorporated to partition data set into multiple clusters of highly correlated data that are difficult to separate otherwise and different base classifiers are used to learn class boundaries within the clusters. As the different base classifiers engage on different difficult‐to‐classify subsets of the data, the learning of the base classifiers is more focussed and accurate. A selection rather than fusion approach achieves the final verdict on patterns of unknown classes. The impact of clustering on the learning parameters and accuracy of a number of learning algorithms including neural network, support vector machine, decision tree and k‐NN classifier is investigated. A number of benchmark data sets from the UCI machine learning repository were used to evaluate the cluster‐based ensemble classifier and the experimental results demonstrate its superiority over bagging and boosting.  相似文献   

7.
Rotation Forest, an effective ensemble classifier generation technique, works by using principal component analysis (PCA) to rotate the original feature axes so that different training sets for learning base classifiers can be formed. This paper presents a variant of Rotation Forest, which can be viewed as a combination of Bagging and Rotation Forest. Bagging is used here to inject more randomness into Rotation Forest in order to increase the diversity among the ensemble membership. The experiments conducted with 33 benchmark classification data sets available from the UCI repository, among which a classification tree is adopted as the base learning algorithm, demonstrate that the proposed method generally produces ensemble classifiers with lower error than Bagging, AdaBoost and Rotation Forest. The bias–variance analysis of error performance shows that the proposed method improves the prediction error of a single classifier by reducing much more variance term than the other considered ensemble procedures. Furthermore, the results computed on the data sets with artificial classification noise indicate that the new method is more robust to noise and kappa-error diagrams are employed to investigate the diversity–accuracy patterns of the ensemble classifiers.  相似文献   

8.
A theoretical analysis of bagging as a linear combination of classifiers   总被引:1,自引:0,他引:1  
We apply an analytical framework for the analysis of linearly combined classifiers to ensembles generated by bagging. This provides an analytical model of bagging misclassification probability as a function of the ensemble size, which is a novel result in the literature. Experimental results on real data sets confirm the theoretical predictions. This allows us to derive a novel and theoretically grounded guideline for choosing bagging ensemble size. Furthermore, our results are consistent with explanations of bagging in terms of classifier instability and variance reduction, support the optimality of the simple average over the weighted average combining rule for ensembles generated by bagging, and apply to other randomization-based methods for constructing classifier ensembles. Although our results do not allow to compare bagging misclassification probability with the one of an individual classifier trained on the original training set, we discuss how the considered theoretical framework could be exploited to this aim.  相似文献   

9.
由于高维数据通常存在冗余和噪声,在其上直接构造覆盖模型不能充分反映数据的分布信息,导致分类器性能下降.为此提出一种基于精简随机子空间多树集成分类方法.该方法首先生成多个随机子空间,并在每个子空间上构造独立的最小生成树覆盖模型.其次对每个子空间上构造的分类模型进行精简处理,通过一个评估准则(AUC值),对生成的一类分类器进行精简.最后均值合并融合这些分类器为一个集成分类器.实验结果表明,与其它直接覆盖分类模型和bagging算法相比,多树集成覆盖分类器具有更高的分类正确率.  相似文献   

10.
We present attribute bagging (AB), a technique for improving the accuracy and stability of classifier ensembles induced using random subsets of features. AB is a wrapper method that can be used with any learning algorithm. It establishes an appropriate attribute subset size and then randomly selects subsets of features, creating projections of the training set on which the ensemble classifiers are built. The induced classifiers are then used for voting. This article compares the performance of our AB method with bagging and other algorithms on a hand-pose recognition dataset. It is shown that AB gives consistently better results than bagging, both in accuracy and stability. The performance of ensemble voting in bagging and the AB method as a function of the attribute subset size and the number of voters for both weighted and unweighted voting is tested and discussed. We also demonstrate that ranking the attribute subsets by their classification accuracy and voting using only the best subsets further improves the resulting performance of the ensemble.  相似文献   

11.
为了去除集成学习中的冗余个体,提出了一种基于子图选择个体的分类器集成算法。训练出一批分类器,利用个体以及个体间的差异性构造出一个带权的完全无向图;利用子图方法选择部分差异性大的个体参与集成。通过使用支持向量机作为基学习器,在多个分类数据集上进行了实验研究,并且与常用的集成方法Bagging和Adaboost进行了比较,结果该方法获得了较好的集成效果。  相似文献   

12.
Ensembles of classifiers that are trained on different parts of the input space provide good results in general. As a popular boosting technique, AdaBoost is an iterative and gradient based deterministic method used for this purpose where an exponential loss function is minimized. Bagging is a random search based ensemble creation technique where the training set of each classifier is arbitrarily selected. In this paper, a genetic algorithm based ensemble creation approach is proposed where both resampled training sets and classifier prototypes evolve so as to maximize the combined accuracy. The objective function based random search procedure of the resultant system guided by both ensemble accuracy and diversity can be considered to share the basic properties of bagging and boosting. Experimental results have shown that the proposed approach provides better combined accuracies using a fewer number of classifiers than AdaBoost.  相似文献   

13.
Boosting Algorithms for Parallel and Distributed Learning   总被引:1,自引:0,他引:1  
The growing amount of available information and its distributed and heterogeneous nature has a major impact on the field of data mining. In this paper, we propose a framework for parallel and distributed boosting algorithms intended for efficient integrating specialized classifiers learned over very large, distributed and possibly heterogeneous databases that cannot fit into main computer memory. Boosting is a popular technique for constructing highly accurate classifier ensembles, where the classifiers are trained serially, with the weights on the training instances adaptively set according to the performance of previous classifiers. Our parallel boosting algorithm is designed for tightly coupled shared memory systems with a small number of processors, with an objective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor. After all processors learn classifiers in parallel at each boosting round, they are combined according to the confidence of their prediction. Our distributed boosting algorithm is proposed primarily for learning from several disjoint data sites when the data cannot be merged together, although it can also be used for parallel learning where a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boosting round, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site. The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. The new proposed methods applied to several data sets have shown that parallel boosting can achieve the same or even better prediction accuracy considerably faster than the standard sequential boosting. Results from the experiments also indicate that distributed boosting has comparable or slightly improved classification accuracy over standard boosting, while requiring much less memory and computational time since it uses smaller data sets.  相似文献   

14.
Yuan  Weiwei  Guan  Donghai  Zhu  Qi  Ma  Tinghuai 《Neural computing & applications》2018,29(10):673-683

As a kind of noise, mislabeled training data exist in many applications. Because of their negative effects on learning, many filter techniques have been proposed to identify and eliminate them. Ensemble learning-based filter (EnFilter) is the most widely used filter which employs ensemble classifiers. In EnFilter, first the noisy training dataset is divided into several subsets. Each noisy subset is then checked by the multiple classifiers which are trained based on other noisy subsets. It is noted that since the training data used to train multiple classifiers are noisy, the quality of these classifiers cannot be guaranteed, which might generate poor noise identification result. This problem is more serious when the noise ratio in the training dataset is high. To solve this problem, a straightforward but effective approach is proposed in this work. Instead of using noisy data to train the classifiers, nearly noise-free (NNF) data are used since they are supposed to train more reliable classifiers. To this end, a novel NNF data extraction approach is also proposed. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach.

  相似文献   

15.
提出了一种使用基于规则的基分类器建立组合分类器的新方法PCARules。尽管新方法也采用基分类器预测的加权投票来决定待分类样本的类,但是为基分类器创建训练数据集的方法与bagging和boosting完全不同。该方法不是通过抽样为基分类器创建数据集,而是随机地将特征划分成K个子集,使用PCA得到每个子集的主成分,形成新的特征空间,并将所有训练数据映射到新的特征空间作为基分类器的训练集。在UCI机器学习库的30个随机选取的数据集上的实验表明:算法不仅能够显著提高基于规则的分类方法的分类性能,而且与bagging和boosting等传统组合方法相比,在大部分数据集上都具有更高的分类准确率。  相似文献   

16.
In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a good-performance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.  相似文献   

17.
Out-of-bag样本的应用研究   总被引:3,自引:0,他引:3  
张春霞  郭高 《软件》2011,(3):1-4
Bagging集成通过组合不稳定的基分类器在很大程度上降低"弱"学习算法的分类误差,Out-of-bag样本是Bagging集成的自然产物。目前,Out-of-bag样本在估计Bagging集成的泛化误差、构建相关集成分类器等方面得到了广泛应用。文章对Out-of-bag样本的应用进行了综述,阐述了对其进行研究的主要内容和特点,并对它在将来可能的研究方向进行了讨论。  相似文献   

18.
In machine learning, class noise occurs frequently and deteriorates the classifier derived from the noisy data set. This paper presents two promising classifiers for this problem based on a probabilistic model proposed by Lawrence and Schölkopf (2001). The proposed algorithms are able to tolerate class noise, and extend the earlier work of Lawrence and Schölkopf in two ways. First, we present a novel incorporation of their probabilistic noise model in the Kernel Fisher discriminant; second, the distribution assumption previously made is relaxed in our work. The methods were investigated on simulated noisy data sets and a real world comparative genomic hybridization (CGH) data set. The results show that the proposed approaches substantially improve standard classifiers in noisy data sets, and achieve larger performance gain in non-Gaussian data sets and small size data sets.  相似文献   

19.
谭桥宇  余国先  王峻  郭茂祖 《软件学报》2017,28(11):2851-2864
弱标记学习是多标记学习的一个重要分支,近几年已被广泛研究并被应用于多标记样本的缺失标记补全和预测等问题.然而,针对特征集合较大、更容易拥有多个语义标记和出现标记缺失的高维数据问题,现有弱标记学习方法普遍易受这类数据包含的噪声和冗余特征的干扰.为了对高维多标记数据进行准确的分类,提出了一种基于标记与特征依赖最大化的弱标记集成分类方法EnWL.EnWL首先在高维数据的特征空间多次利用近邻传播聚类方法,每次选择聚类中心构成具有代表性的特征子集,降低噪声和冗余特征的干扰;再在每个特征子集上训练一个基于标记与特征依赖最大化的半监督多标记分类器;最后,通过投票集成这些分类器实现多标记分类.在多种高维数据集上的实验结果表明,EnWL在多种评价度量上的预测性能均优于已有相关方法.  相似文献   

20.
Removing or filtering outliers and mislabeled instances prior to training a learning algorithm has been shown to increase classification accuracy, especially in noisy data sets. A popular approach is to remove any instance that is misclassified by a learning algorithm. However, the use of ensemble methods has also been shown to generally increase classification accuracy. In this paper, we extensively examine filtering and ensembling. We examine 9 learning algorithms individually and ensembled together as filtering algorithms as well as the effects of filtering in the 9 chosen learning algorithms on a set of 54 data sets. We compare the filtering results with using a majority voting ensemble. We find that the majority voting ensemble significantly outperforms filtering unless there are high amounts of noise present in the data set. Additionally, for most cases, using an ensemble of learning algorithms for filtering produces a greater increase in classification accuracy than using a single learning algorithm for filtering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号