首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Trimmed bagging   总被引:1,自引:0,他引:1  
Bagging has been found to be successful in increasing the predictive performance of unstable classifiers. Bagging draws bootstrap samples from the training sample, applies the classifier to each bootstrap sample, and then averages over all obtained classification rules. The idea of trimmed bagging is to exclude the bootstrapped classification rules that yield the highest error rates, as estimated by the out-of-bag error rate, and to aggregate over the remaining ones. In this note we explore the potential benefits of trimmed bagging. On the basis of numerical experiments, we conclude that trimmed bagging performs comparably to standard bagging when applied to unstable classifiers as decision trees, but yields better results when applied to more stable base classifiers, like support vector machines.  相似文献   

2.
Out-of-bag样本的应用研究   总被引:3,自引:0,他引:3  
张春霞  郭高 《软件》2011,(3):1-4
Bagging集成通过组合不稳定的基分类器在很大程度上降低"弱"学习算法的分类误差,Out-of-bag样本是Bagging集成的自然产物。目前,Out-of-bag样本在估计Bagging集成的泛化误差、构建相关集成分类器等方面得到了广泛应用。文章对Out-of-bag样本的应用进行了综述,阐述了对其进行研究的主要内容和特点,并对它在将来可能的研究方向进行了讨论。  相似文献   

3.
Ensemble methods have proven to be highly effective in improving the performance of base learners under most circumstances. In this paper, we propose a new algorithm that combines the merits of some existing techniques, namely, bagging, arcing, and stacking. The basic structure of the algorithm resembles bagging. However, the misclassification cost of each training point is repeatedly adjusted according to its observed out-of-bag vote margin. In this way, the method gains the advantage of arcing-building the classifier the ensemble needs - without fixating on potentially noisy points. Computational experiments show that this algorithm performs consistently better than bagging and arcing with linear and nonlinear base classifiers. In view of the characteristics of bacing, a hybrid ensemble learning strategy, which combines bagging and different versions of bacing, is proposed and studied empirically.  相似文献   

4.
Bylander  Tom 《Machine Learning》2002,48(1-3):287-297
For two-class datasets, we provide a method for estimating the generalization error of a bag using out-of-bag estimates. In bagging, each predictor (single hypothesis) is learned from a bootstrap sample of the training examples; the output of a bag (a set of predictors) on an example is determined by voting. The out-of-bag estimate is based on recording the votes of each predictor on those training examples omitted from its bootstrap sample. Because no additional predictors are generated, the out-of-bag estimate requires considerably less time than 10-fold cross-validation. We address the question of how to use the out-of-bag estimate to estimate generalization error on two-class datasets. Our experiments on several datasets show that the out-of-bag estimate and 10-fold cross-validation have similar performance, but are both biased. We can eliminate most of the bias in the out-of-bag estimate and increase accuracy by incorporating a correction based on the distribution of the out-of-bag votes.  相似文献   

5.
The performance of m-out-of-n bagging with and without replacement in terms of the sampling ratio (m/n) is analyzed. Standard bagging uses resampling with replacement to generate bootstrap samples of equal size as the original training set mwor=n. Without-replacement methods typically use half samples mwr=n/2. These choices of sampling sizes are arbitrary and need not be optimal in terms of the classification performance of the ensemble. We propose to use the out-of-bag estimates of the generalization accuracy to select a near-optimal value for the sampling ratio. Ensembles of classifiers trained on independent samples whose size is such that the out-of-bag error of the ensemble is as low as possible generally improve the performance of standard bagging and can be efficiently built.  相似文献   

6.
Bagging, boosting, rotation forest and random subspace methods are well known re-sampling ensemble methods that generate and combine a diversity of learners using the same learning algorithm for the base-classifiers. Boosting and rotation forest algorithms are considered stronger than bagging and random subspace methods on noise-free data. However, there are strong empirical indications that bagging and random subspace methods are much more robust than boosting and rotation forest in noisy settings. For this reason, in this work we built an ensemble of bagging, boosting, rotation forest and random subspace methods ensembles with 6 sub-classifiers in each one and then a voting methodology is used for the final prediction. We performed a comparison with simple bagging, boosting, rotation forest and random subspace methods ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique had better accuracy in most cases.  相似文献   

7.
张震  胡捍英 《计算机工程》2005,31(15):160-161,171
提出了用于增强kNN的属性Bagging(ABagging),ABagging通过对属性重抽样而不是对训练实例重抽样来获得多个训练集。kNN对于属性重抽样不稳定,因而ABagging能有效降低kNN的错误率。ABaggingkNN对于不相关属性也有比kNN强得多的抵抗力。另外Abagging kNN的速度也比Bagging kNN更快。用UCI数据集证明了ABagging kNN的有效性。  相似文献   

8.
Many classi ers achieve high levels of accuracy but have limited applicability in real world situations because they do not lead to a greater understanding or insight into the way features in uence the classi cation. In areas such as health informatics a classi er that clearly identi es the in uences on classi cation can be used to direct research and formulate interventions. This research investigates the practical applications of Automated Weighted Sum, (AWSum), a classi er that provides accuracy comparable to other techniques whist providing insight into the data. This is achieved by calculating a weight for each feature value that represents its in uence on the class value. The merits of this approach in classi cation and insight are evaluated on a Cystic Fibrosis and Diabetes datasets with positive results.  相似文献   

9.
The ensemble method is a powerful data mining paradigm, which builds a classification model by integrating multiple diversified component learners. Bagging is one of the most successful ensemble methods. It is made of bootstrap-inspired classifiers and uses these classifiers to get an aggregated classifier. However, in bagging, bootstrapped training sets become more and more similar as redundancy is increasing. Besides redundancy, any training set is usually subject to noise. Moreover, the training set might be imbalanced. Thus, each training instance has a different impact on the learning process. This paper explores some properties of the ensemble margin and its use in improving the performance of bagging. We introduce a new approach to measure the importance of training data in learning, based on the margin theory. Then, a new bagging method concentrating on critical instances is proposed. This method is more accurate than bagging and more robust than boosting. Compared to bagging, it reduces the bias while generally keeping the same variance. Our findings suggest that (a) examples with low margins tend to be more critical for the classifier performance; (b) examples with higher margins tend to be more redundant; (c) misclassified examples with high margins tend to be noisy examples. Our experimental results on 15 various data sets show that the generalization error of bagging can be reduced up to 2.5% and its resilience to noise strengthened by iteratively removing both typical and noisy training instances, reducing the training set size by up to 75%.  相似文献   

10.
《Pattern recognition letters》2003,24(1-3):455-471
Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same size partitions and bags, disjoint partitions result in performance equivalent to, or better than, bootstrap aggregates (bags). Many applications (e.g., protein structure prediction) involve use of datasets that are too large to handle in the memory of the typical computer. Hence, bagging with samples the size of the data is impractical. Our results indicate that, in such applications, the simple approach of creating a committee of n classifiers from disjoint partitions each of size 1/n (which will be memory resident during learning) in a distributed way results in a classifier which has a bagging-like performance gain. The use of distributed disjoint partitions in learning is significantly less complex and faster than bagging.  相似文献   

11.
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.  相似文献   

12.
Ensemble pruning deals with the selection of base learners prior to combination in order to improve prediction accuracy and efficiency. In the ensemble literature, it has been pointed out that in order for an ensemble classifier to achieve higher prediction accuracy, it is critical for the ensemble classifier to consist of accurate classifiers which at the same time diverse as much as possible. In this paper, a novel ensemble pruning method, called PL-bagging, is proposed. In order to attain the balance between diversity and accuracy of base learners, PL-bagging employs positive Lasso to assign weights to base learners in the combination step. Simulation studies and theoretical investigation showed that PL-bagging filters out redundant base learners while it assigns higher weights to more accurate base learners. Such improved weighting scheme of PL-bagging further results in higher classification accuracy and the improvement becomes even more significant as the ensemble size increases. The performance of PL-bagging was compared with state-of-the-art ensemble pruning methods for aggregation of bootstrapped base learners using 22 real and 4 synthetic datasets. The results indicate that PL-bagging significantly outperforms state-of-the-art ensemble pruning methods such as Boosting-based pruning and Trimmed bagging.  相似文献   

13.
Many data mining applications have a large amount of data but labeling data is often di cult, expensive, or time consuming, as it requires human experts for annotation.Semi-supervised learning addresses this problem by using unlabeled data together with labeled data to improve the performance. Co-Training is a popular semi-supervised learning algorithm that has the assumptions that each example is represented by two or more redundantly su cient sets of features (views) and additionally these views are independent given the class. However, these assumptions are not satis ed in many real-world application domains. In this paper, a framework called Co-Training by Committee (CoBC) is proposed, in which an ensemble of diverse classi ers is used for semi-supervised learning that requires neither redundant and independent views nor di erent base learning algorithms. The framework is a general single-view semi-supervised learner that can be applied on any ensemble learner to build diverse committees. Experimental results of CoBC using Bagging, AdaBoost and the Random Subspace Method (RSM) as ensemble learners demonstrate that error diversity among classi ers leads to an e ective Co-Training style algorithm that maintains the diversity of the underlying ensemble.  相似文献   

14.
Relief算法是一个过滤式特征选择算法,通过一种贪心的方式最大化最近邻居分类器中的实例边距,结合局部权重方法有作者提出了为每个类别分别训练一个特征权重的类依赖Relief算法(Class Dependent RELIEF algorithm:CDRELIEF).该方法更能反映特征相关性,但是其训练出的特征权重仅仅对于衡量特征对于某一个类的相关性很有效,在实际分类中分类精度不够高.为了将CDRELIEF算法应用于分类过程,本文改变权重更新过程,并给训练集中的每个实例赋予一个实例权重值,通过将实例权重值结合到权重更新公式中从而排除远离分类边界的数据点和离群点对权重更新的影响,进而提高分类准确率.本文提出的实例加权类依赖RELIEF (IWCDRELIEF)在多个UCI二类数据集上,与CDRELIEF进行测试比较.实验结果表明本文提出的算法相比CDRELIEF算法有明显的提高.  相似文献   

15.
针对不平衡数据集的Bagging改进算法   总被引:1,自引:1,他引:0       下载免费PDF全文
传统的Bagging分类方法对不平衡数据集进行分类时,虽然能够达到很高的分类精度,但是对其中少数类的分类准确率不高。为提高其对少数类数据的分类精度,利用SMOTE算法对样例集中的少数类样例进行加工,在Bagging算法中根据类值对各个样例的权重进行调整。混淆矩阵和ROC曲线表明改进算法达到了既能保证整体的分类准确率,又能提高少数类分类精度的目的。  相似文献   

16.
Accurate prediction of high performance concrete (HPC) compressive strength is very important issue. In the last decade, a variety of modeling approaches have been developed and applied to predict HPC compressive strength from a wide range of variables, with varying success. The selection, application and comparison of decent modeling methods remain therefore a crucial task, subject to ongoing researches and debates. This study proposes three different ensemble approaches: (i) single ensembles of decision trees (DT) (ii) two-level ensemble approach which employs same ensemble learning method twice in building ensemble models (iii) hybrid ensemble approach which is an integration of attribute-base ensemble method (random sub-spaces RS) and instance-base ensemble methods (bagging Bag, stochastic gradient boosting GB). A decision tree is used as the base learner of ensembles and its results are benchmarked to proposed ensemble models. The obtained results show that the proposed ensemble models could noticeably advance the prediction accuracy of the single DT model and for determining average determination of correlation, the best models for HPC compressive strength forecasting are GB–RS DT, RS–GB DT and GB–GB DT among the eleven proposed predictive models, respectively. The obtained results show that the proposed ensemble models could noticeably advance the prediction accuracy of the single DT model and for determining determination of correlation (R2max), the best models for HPC compressive strength forecasting are GB–RS DT (R2=0.9520), GB–GB DT (R2=0.9456) and Bag–Bag DT (R2=0.9368) among the eleven proposed predictive models, respectively.  相似文献   

17.
Given a graph G=(V,E) with strictly positive integer weights ω i on the vertices iV, an interval coloring of G is a function I that assigns an interval I(i) of ω i consecutive integers (called colors) to each vertex iV so that I(i)∩I(j)= for all edges {i,j}∈E. The interval coloring problem is to determine an interval coloring that uses as few colors as possible. Assuming that a strictly positive integer weight δ ij is associated with each edge {i,j}∈E, a bandwidth coloring of G is a function c that assigns an integer (called a color) to each vertex iV so that |c(i)−c(j)|≥δ ij for all edges {i,j}∈E. The bandwidth coloring problem is to determine a bandwidth coloring with minimum difference between the largest and the smallest colors used. We prove that an optimal solution of the interval coloring problem can be obtained by solving a series of bandwidth coloring problems. Computational experiments demonstrate that such a reduction can help to solve larger instances or to obtain better upper bounds on the optimal solution value of the interval coloring problem.  相似文献   

18.
基于样本权重更新的不平衡数据集成学习方法   总被引:1,自引:0,他引:1  
不平衡数据的问题普遍存在于大数据、机器学习的各个应用领域,如医疗诊断、异常检测等。研究者提出或采用了多种方法来进行不平衡数据的学习,比如数据采样(如SMOTE)或者集成学习(如EasyEnsemble)的方法。数据采样中的过采样方法可能存在过拟合或边界样本分类准确率较低等问题,而欠采样方法则可能导致欠拟合。文中将SMOTE,Bagging,Boosting等算法的基本思想进行融合,提出了Rotation SMOTE算法。该算法通过在Boosting过程中根据基分类器的预测结果对少数类样本进行SMOTE来间接地增大少数类样本的权重,并借鉴Focal Loss的基本思想提出了根据基分类器预测结果直接优化AdaBoost权重更新策略的FocalBoost算法。对不同应用领域共11个不平衡数据集的多个评价指标进行实验测试,结果表明,相比于其他不平衡数据算法(包括SMOTEBoost算法和EasyEnsemble算法),Rotation SMOTE算法在所有数据集上具有最高的召回率,并且在大多数数据集上具有最佳或者次佳的G-mean以及F1Score;而相比于原始的AdaBoost,FocalBoost则在其中9个不平衡数据集上都获得了更优的性能指标。  相似文献   

19.
Analytical and numerical results onM-variable generalized Bessel functions   总被引:1,自引:0,他引:1  
Recently, some multivariable special functions have been obtained by generalizing functions of Bessel type. Here, we continue the treatment of these functions starting fromJ n (x, y; i), which is of noticeable practical interest. Finally, we consider the cases of functionsJ n(x1, x2,..., xM) and the related modified version,I n(x1, x2,..., xM), with two significant physical applications. Calculations of multivariable generalized Bessel functions are discussed and numerical results are given forJ n(x1,x2;i), withn=0, 1, in a region of interest.  相似文献   

20.
The value difference metric (VDM) is one of the best-known and widely used distance functions for nominal attributes. This work applies the instanceweighting technique to improveVDM. An instance weighted value difference metric (IWVDM) is proposed here. Different from prior work, IWVDM uses naive Bayes (NB) to find weights for training instances. Because early work has shown that there is a close relationship between VDM and NB, some work on NB can be applied to VDM. The weight of a training instance x, that belongs to the class c, is assigned according to the difference between the estimated conditional probability ^P(c|x) by NB and the true conditional probability P(c|x), and the weight is adjusted iteratively. Compared with previous work, IWVDM has the advantage of reducing the time complexity of the process of finding weights, and simultaneously improving the performance of VDM. Experimental results on 36 UCI datasets validate the effectiveness of IWVDM.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号