Building bagging on critical instances |
| |
Authors: | Li Guo Samia Boukir Alexandre Aussem |
| |
Affiliation: | 1. G&E Laboratory (EA 4592), Bordeaux INP, Pessac, France
Atos Worldline, Seclin, France;2. G&E Laboratory (EA 4592), Bordeaux INP, Pessac, France;3. LIRIS (UMR CNRS 5205), University of Lyon, Villeurbanne, France |
| |
Abstract: | The ensemble method is a powerful data mining paradigm, which builds a classification model by integrating multiple diversified component learners. Bagging is one of the most successful ensemble methods. It is made of bootstrap-inspired classifiers and uses these classifiers to get an aggregated classifier. However, in bagging, bootstrapped training sets become more and more similar as redundancy is increasing. Besides redundancy, any training set is usually subject to noise. Moreover, the training set might be imbalanced. Thus, each training instance has a different impact on the learning process. This paper explores some properties of the ensemble margin and its use in improving the performance of bagging. We introduce a new approach to measure the importance of training data in learning, based on the margin theory. Then, a new bagging method concentrating on critical instances is proposed. This method is more accurate than bagging and more robust than boosting. Compared to bagging, it reduces the bias while generally keeping the same variance. Our findings suggest that (a) examples with low margins tend to be more critical for the classifier performance; (b) examples with higher margins tend to be more redundant; (c) misclassified examples with high margins tend to be noisy examples. Our experimental results on 15 various data sets show that the generalization error of bagging can be reduced up to 2.5% and its resilience to noise strengthened by iteratively removing both typical and noisy training instances, reducing the training set size by up to 75%. |
| |
Keywords: | Bagging ensemble instance importance instance selection margin |
|
|