首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
Out-of-bag样本的应用研究   总被引:3,自引:0,他引:3  
张春霞  郭高 《软件》2011,(3):1-4
Bagging集成通过组合不稳定的基分类器在很大程度上降低"弱"学习算法的分类误差,Out-of-bag样本是Bagging集成的自然产物。目前,Out-of-bag样本在估计Bagging集成的泛化误差、构建相关集成分类器等方面得到了广泛应用。文章对Out-of-bag样本的应用进行了综述,阐述了对其进行研究的主要内容和特点,并对它在将来可能的研究方向进行了讨论。  相似文献   

2.
Learning from noisy data is a challenging task for data mining research. In this paper, we argue that for noisy data both global bagging strategy and local bagging strategy su er from their own inherent disadvantages and thus cannot form accurate prediction models. Consequently, we present a Global and Local Bagging (called Glocal Bagging:GB) approach to tackle this problem. GB assigns weight values to the base classi ers under the consideration that: (1) for each test instance Ix, GB prefers bags close to Ix, which is the nature of the local learning strategy; (2) for base classi ers, GB assigns larger weight values to the ones with higher accuracy on the out-of-bag, which is the nature of the global learning strategy. Combining (1) and (2), GB assign large weight values to the classi ers which are close to the current test instance Ix and have high out-of-bag accuracy. The diversity/accuracy analysis on synthetic datasets shows that GB improves the classi er ensemble's performance by increasing its base classi er's accuracy. Moreover, the bias/variance analysis also shows that GB's accuracy improvement mainly comes from the reduction of the bias error. Experiment results on 25 UCI benchmark datasets show that when the datasets are noisy, GB is superior to other former proposed bagging methods such as the classical bagging, bragging, nice bagging, trimmed bagging and lazy bagging.  相似文献   

3.
Bootstrap aggregation, or bagging, is a method of reducing the prediction error of a statistical learner. The goal of bagging is to construct a new learner which is the expectation of the original learner with respect to the empirical distribution function. In nearly all cases, the expectation cannot be computed analytically, and bootstrap sampling is used to produce an approximation. The k-nearest neighbor learners are exceptions to this generalization, and exact bagging of many k-nearest neighbor learners is straightforward. This article presents computationally simple and fast formulae for exact bagging of k-nearest neighbor learners and extends exact bagging methods from the conventional bootstrap sampling (sampling n observations with replacement from a set of n observations) to bootstrap sub-sampling schemes (with and without replacement). In addition, a partially exact k-nearest neighbor regression learner is developed. The article also compares the prediction error associated with elementary and exact bagging k-nearest neighbor learners, and several other ensemble methods using a suite of publicly available data sets.  相似文献   

4.
This research aims to evaluate ensemble learning (bagging, boosting, and modified bagging) potential in predicting microbially induced concrete corrosion in sewer systems from the data mining (DM) perspective. Particular focus is laid on ensemble techniques for network-based DM methods, including multi-layer perceptron neural network (MLPNN) and radial basis function neural network (RBFNN) as well as tree-based DM methods, such as chi-square automatic interaction detector (CHAID), classification and regression tree (CART), and random forests (RF). Hence, an interdisciplinary approach is presented by combining findings from material sciences and hydrochemistry as well as data mining analyses to predict concrete corrosion. The effective factors on concrete corrosion such as time, gas temperature, gas-phase H2S concentration, relative humidity, pH, and exposure phase are considered as the models’ inputs. All 433 datasets are randomly selected to construct an individual model and twenty component models of boosting, bagging, and modified bagging based on training, validating, and testing for each DM base learners. Considering some model performance indices, (e.g., Root mean square error, RMSE; mean absolute percentage error, MAPE; correlation coefficient, r) the best ensemble predictive models are selected. The results obtained indicate that the prediction ability of the random forests DM model is superior to the other ensemble learners, followed by the ensemble Bag-CHAID method. On average, the ensemble tree-based models acted better than the ensemble network-based models; nevertheless, it was also found that taking the advantages of ensemble learning would enhance the general performance of individual DM models by more than 10%.  相似文献   

5.
《Pattern recognition letters》2003,24(1-3):455-471
Bagging forms a committee of classifiers by bootstrap aggregation of training sets from a pool of training data. A simple alternative to bagging is to partition the data into disjoint subsets. Experiments with decision tree and neural network classifiers on various datasets show that, given the same size partitions and bags, disjoint partitions result in performance equivalent to, or better than, bootstrap aggregates (bags). Many applications (e.g., protein structure prediction) involve use of datasets that are too large to handle in the memory of the typical computer. Hence, bagging with samples the size of the data is impractical. Our results indicate that, in such applications, the simple approach of creating a committee of n classifiers from disjoint partitions each of size 1/n (which will be memory resident during learning) in a distributed way results in a classifier which has a bagging-like performance gain. The use of distributed disjoint partitions in learning is significantly less complex and faster than bagging.  相似文献   

6.
Trimmed bagging   总被引:1,自引:0,他引:1  
Bagging has been found to be successful in increasing the predictive performance of unstable classifiers. Bagging draws bootstrap samples from the training sample, applies the classifier to each bootstrap sample, and then averages over all obtained classification rules. The idea of trimmed bagging is to exclude the bootstrapped classification rules that yield the highest error rates, as estimated by the out-of-bag error rate, and to aggregate over the remaining ones. In this note we explore the potential benefits of trimmed bagging. On the basis of numerical experiments, we conclude that trimmed bagging performs comparably to standard bagging when applied to unstable classifiers as decision trees, but yields better results when applied to more stable base classifiers, like support vector machines.  相似文献   

7.
This paper investigates the use of wavelet ensemble models for high performance concrete (HPC) compressive strength forecasting. More specifically, we incorporate bagging and gradient boosting methods in building artificial neural networks (ANN) ensembles (bagged artificial neural networks (BANN) and gradient boosted artificial neural networks (GBANN)), first. Coefficient of determination (R2), mean absolute error (MAE) and the root mean squared error (RMSE) statics are used for performance evaluation of proposed predictive models. Empirical results show that ensemble models (R2BANN=0.9278, R2GBANN=0.9270) are superior to a conventional ANN model (R2ANN=0.9088). Then, we use the coupling of discrete wavelet transform (DWT) and ANN ensembles for enhancing the prediction accuracy. The study concludes that DWT is an effective tool for increasing the accuracy of the ANN ensembles (R2WBANN=0.9397, R2WGBANN=0.9528).  相似文献   

8.
A comparison of decision tree ensemble creation techniques   总被引:3,自引:0,他引:3  
We experimentally evaluate bagging and seven other randomization-based approaches to creating an ensemble of decision tree classifiers. Statistical tests were performed on experimental results from 57 publicly available data sets. When cross-validation comparisons were tested for statistical significance, the best method was statistically more accurate than bagging on only eight of the 57 data sets. Alternatively, examining the average ranks of the algorithms across the group of data sets, we find that boosting, random forests, and randomized trees are statistically significantly better than bagging. Because our results suggest that using an appropriate ensemble size is important, we introduce an algorithm that decides when a sufficient number of classifiers has been created for an ensemble. Our algorithm uses the out-of-bag error estimate, and is shown to result in an accurate ensemble for those methods that incorporate bagging into the construction of the ensemble  相似文献   

9.
The estimation of quantiles in two-phase sampling with arbitrary sampling design in each of the two phases is investigated. Several ratio and exponentiation type estimators that provide the optimum estimate of a quantile based on an optimum exponent α are proposed. Properties of these estimators are studied under large sample size approximation and the use of double sampling for stratification to estimate quantiles can also be seen. The real performance of these estimators will be evaluated for the three quartiles on the basis of data from two real populations using different sampling designs. The simulation study shows that proposed estimators can be very satisfactory in terms of relative bias and efficiency.  相似文献   

10.
Ensemble methods have proven to be highly effective in improving the performance of base learners under most circumstances. In this paper, we propose a new algorithm that combines the merits of some existing techniques, namely, bagging, arcing, and stacking. The basic structure of the algorithm resembles bagging. However, the misclassification cost of each training point is repeatedly adjusted according to its observed out-of-bag vote margin. In this way, the method gains the advantage of arcing-building the classifier the ensemble needs - without fixating on potentially noisy points. Computational experiments show that this algorithm performs consistently better than bagging and arcing with linear and nonlinear base classifiers. In view of the characteristics of bacing, a hybrid ensemble learning strategy, which combines bagging and different versions of bacing, is proposed and studied empirically.  相似文献   

11.
This paper proposes an approach for embedding two complete binary trees (CBT) into ann-dimensional star graph (S n), and provides a fault-tolerant scheme for the trees. First, aCBT with height Σ m=2 n ?logm? is embedded into theS n with dilation 3. The height of theCBT is very close to ?Σ m=2 n logm?, the height of the largest possibleCBT which can be embedded into theS n. Shifting the firstCBT by generating function productg 2 g 3 g 4 g 3, anotherCBT with height Σ m=2 n ?logm? can also be embedded into theS n without conflicting with the first one. Moreover, if three-eights of nodes in the firstCBT and all nodes in the secondCBT are faulty, all of them can be recovered. Under the condition that the firstCBT with smaller height (?Σ m=2 n logm? ? 1) is embedded, all the replacement nodes will be free. As a consequence, even in the case that all nodes in the two trees are faulty, they can be recovered in the smallest number of recovery steps and only with dilation 5.  相似文献   

12.
Accurate prediction of high performance concrete (HPC) compressive strength is very important issue. In the last decade, a variety of modeling approaches have been developed and applied to predict HPC compressive strength from a wide range of variables, with varying success. The selection, application and comparison of decent modeling methods remain therefore a crucial task, subject to ongoing researches and debates. This study proposes three different ensemble approaches: (i) single ensembles of decision trees (DT) (ii) two-level ensemble approach which employs same ensemble learning method twice in building ensemble models (iii) hybrid ensemble approach which is an integration of attribute-base ensemble method (random sub-spaces RS) and instance-base ensemble methods (bagging Bag, stochastic gradient boosting GB). A decision tree is used as the base learner of ensembles and its results are benchmarked to proposed ensemble models. The obtained results show that the proposed ensemble models could noticeably advance the prediction accuracy of the single DT model and for determining average determination of correlation, the best models for HPC compressive strength forecasting are GB–RS DT, RS–GB DT and GB–GB DT among the eleven proposed predictive models, respectively. The obtained results show that the proposed ensemble models could noticeably advance the prediction accuracy of the single DT model and for determining determination of correlation (R2max), the best models for HPC compressive strength forecasting are GB–RS DT (R2=0.9520), GB–GB DT (R2=0.9456) and Bag–Bag DT (R2=0.9368) among the eleven proposed predictive models, respectively.  相似文献   

13.
One of the most widely used approaches to the class-imbalanced issue is ensemble learning. The base classifier is trained using an unbalanced training set in the conventional ensemble learning approach. We are unable to select the best suitable resampling method or base classifier for the training set, despite the fact that researchers have examined employing resampling strategies to balance the training set. A multi-armed bandit heterogeneous ensemble framework was developed as a solution to these issues. This framework employs the multi-armed bandit technique to pick the best base classifier and resampling techniques to build a heterogeneous ensemble model. To obtain training sets, we first employ the bagging technique. Then, we use the instances from the out-of-bag set as the validation set. In general, we consider the basic classifier combination with the highest validation set score to be the best model on the bagging subset and add it to the pool of model. The classification performance of the multi-armed bandit heterogeneous ensemble model is then assessed using 30 real-world imbalanced data sets that were gathered from UCI, KEEL, and HDDT. The experimental results demonstrate that, under the two assessment metrics of AUC and Kappa, the proposed heterogeneous ensemble model performs competitively with other nine state-of-the-art ensemble learning methods. At the same time, the findings of the experiment are confirmed by the statistical findings of the Friedman test and Holm's post-hoc test.  相似文献   

14.
Consider the problem of testing whether a context-free grammar is an (m, n)-BRC grammar. Let 6G6 denote the size of the grammar G. It is first shown that G is (m, n)-BRC if and only if G is (m0, n)-BRC where m0=4·6G62·(n + 1)2. Deterministic and nondeterministic algorithms are then presented for testing whether an arbitrary grammar has the (m, n)-BRC property for fixed values of m and n. The running times of both algorithms are low degree polynomials which are independent of m.  相似文献   

15.
由于缺少数据分布、参数和数据类别标记的先验信息,部分基聚类的正确性无法保证,进而影响聚类融合的性能;而且不同基聚类决策对于聚类融合的贡献程度不同,同等对待基聚类决策,将影响聚类融合结果的提升。为解决此问题,提出了基于随机取样的选择性K-means聚类融合算法(RS-KMCE)。该算法中的随机取样策略可以避免基聚类决策选取陷入局部极小,而且依据多样性和正确性定义的综合评价值,有利于算法快速收敛到较优的基聚类子集,提升融合性能。通过2个仿真数据库和4个UCI数据库的实验结果显示:RS-KMCE的聚类性能优于K-means算法、K-means融合算法(KMCE)以及基于Bagging的选择性K-means聚类融合(BA-KMCE)。  相似文献   

16.
The decision tree method has grown fast in the past two decades and its performance in classification is promising. The tree-based ensemble algorithms have been used to improve the performance of an individual tree. In this study, we compared four basic ensemble methods, that is, bagging tree, random forest, AdaBoost tree and AdaBoost random tree in terms of the tree size, ensemble size, band selection (BS), random feature selection, classification accuracy and efficiency in ecological zone classification in Clark County, Nevada, through multi-temporal multi-source remote-sensing data. Furthermore, two BS schemes based on feature importance of the bagging tree and AdaBoost tree were also considered and compared. We conclude that random forest or AdaBoost random tree can achieve accuracies at least as high as bagging tree or AdaBoost tree with higher efficiency; and although bagging tree and random forest can be more efficient, AdaBoost tree and AdaBoost random tree can provide a significantly higher accuracy. All ensemble methods provided significantly higher accuracies than the single decision tree. Finally, our results showed that the classification accuracy could increase dramatically by combining multi-temporal and multi-source data set.  相似文献   

17.
In this paper we present schemes for reconfiguration of embedded task graphs in hypercubes. Previous results, which use either fault-tolerant embedding or an automorphism approach, can be expensive in terms of either the required number of spare nodes or reconfiguration time. Using the free dimension concept, we combine the above two approaches in our schemes which can tolerate about n faulty nodes under the worst case while keeping task migration time small. With expansion-2 initial embedding, three distributed reconfiguration schemes are presented in this paper. The first scheme, applied to chains and rings, can tolerate any ƒ ≤ n − 2 faulty nodes in an n-dimensional hypercube. The second and third schemes are applied to meshes or tori. For a mesh or torus of size 2m1 1 ··· 1 2md, the second scheme can tolerate any ƒ ≤ mi − 1 faulty nodes, where mi is the largest direction of the mesh and n = m1 + ··· + md + 1. By embedding two copies of meshes or tori in cube, the third scheme can tolerate any ƒ ≤ n − 1 faulty nodes with the dilation of embedding after reconfiguration degraded to 2. The third scheme is quite general and can be applied to any task graph.  相似文献   

18.
Constructing support vector machine ensemble   总被引:30,自引:0,他引:30  
Hyun-Chul  Shaoning  Hong-Mo  Daijin  Sung 《Pattern recognition》2003,36(12):2757-2767
Even the support vector machine (SVM) has been proposed to provide a good generalization performance, the classification result of the practically implemented SVM is often far from the theoretically expected level because their implementations are based on the approximated algorithms due to the high complexity of time and space. To improve the limited classification performance of the real SVM, we propose to use the SVM ensemble with bagging (bootstrap aggregating) or boosting. In bagging, each individual SVM is trained independently using the randomly chosen training samples via a bootstrap technique. In boosting, each individual SVM is trained using the training samples chosen according to the sample's probability distribution that is updated in proportional to the errorness of the sample. In both bagging and boosting, the trained individual SVMs are aggregated to make a collective decision in several ways such as the majority voting, least-squares estimation-based weighting, and the double-layer hierarchical combining. Various simulation results for the IRIS data classification and the hand-written digit recognition, and the fraud detection show that the proposed SVM ensemble with bagging or boosting outperforms a single SVM in terms of classification accuracy greatly.  相似文献   

19.
The human visual system (HSV) is quite adept at swiftly detecting objects of interest in complex visual scene. Simulating human visual system to detect visually salient regions of an image has been one of the active topics in computer vision. Inspired by random sampling based bagging ensemble learning method, an ensemble dictionary learning (EDL) framework for saliency detection is proposed in this paper. Instead of learning a universal dictionary requiring a large number of training samples to be collected from natural images, multiple over-complete dictionaries are independently learned with a small portion of randomly selected samples from the input image itself, resulting in more flexible multiple sparse representations for each of the image patches. To boost the distinctness of salient patch from background region, we present a reconstruction residual based method for dictionary atom reduction. Meanwhile, with the obtained multiple probabilistic saliency responses for each of the patches, the combination of them is finally carried out from the probabilistic perspective to achieve better predictive performance on saliency region. Experimental results on several open test datasets and some natural images demonstrate that the proposed EDL for saliency detection is much more competitive compared with some existing state-of-the-art algorithms.  相似文献   

20.
In this article, we propose a multivariate synthetic double sampling T2 chart to monitor the mean vector of a multivariate process. The proposed chart combines the double sampling (DS) T2 chart and the conforming run length (CRL) chart. On the whole, the proposed chart performs better than its standard counterparts, namely, the Hotelling’s T2, DS T2, and synthetic T2 charts, in terms of the average run length (ARL) and average number of observations to sample (ANOS). The proposed chart also outperforms the multivariate exponentially weighted moving average (MEWMA) chart for moderate and large shifts but the latter is more sensitive than the former towards small shifts. For a variable sample size chart, like the synthetic DS T2 chart, ANOS is a more meaningful performance measure than ARL. ANOS relates to the actual number of observations sampled but ARL merely deals with the number of sampling stages taken. Interpretation based on ARL is more complicated as either n1 or n1 + n2 observations are taken in each sampling stage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号