Improving on Bagging with Input Smearing

  • Eibe Frank
  • Bernhard Pfahringer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)


Bagging is an ensemble learning method that has proved to be a useful tool in the arsenal of machine learning practitioners. Commonly applied in conjunction with decision tree learners to build an ensemble of decision trees, it often leads to reduced errors in the predictions when compared to using a single tree. A single tree is built from a training set of size N. Bagging is based on the idea that, ideally, we would like to eliminate the variance due to a particular training set by combining trees built from all training sets of size N. However, in practice, only one training set is available, and bagging simulates this platonic method by sampling with replacement from the original training data to form new training sets. In this paper we pursue the idea of sampling from a kernel density estimator of the underlying distribution to form new training sets, in addition to sampling from the data itself. This can be viewed as “smearing out” the resampled training data to generate new datasets, and the amount of “smear” is controlled by a parameter. We show that the resulting method, called “input smearing”, can lead to improved results when compared to bagging. We present results for both classification and regression problems.


Ensemble Member Base Learner Relative Bias Minority Class Ensemble Generation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)MATHGoogle Scholar
  2. 2.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Thirteenth Int. Conf. on Machine Learning, pp. 148–156 (1996)Google Scholar
  3. 3.
    Bay, S.D.: Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis 3, 191–209 (1999)CrossRefGoogle Scholar
  4. 4.
    Melville, P., Mooney, R.J.: Creating diversity in ensembles using artificial data. Journal of Information Fusion (Special Issue on Diversity in Multiple Classifier Systems) 6/1, 99–111 (2004)Google Scholar
  5. 5.
    Breiman, L.: Randomizing outputs to increase prediction accuracy. Machine Learning 40, 229–242 (2000)CrossRefMATHGoogle Scholar
  6. 6.
    Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)CrossRefMATHGoogle Scholar
  7. 7.
    Dietterich, T.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157 (2000)CrossRefGoogle Scholar
  8. 8.
    Domingos, P.: Knowledge acquisition from examples via multiple models. In: Proc. 14th Int. Conf. on Machine Learning, pp. 98–106 (1997)Google Scholar
  9. 9.
    Chawla, N.V., Bowyer, K.W., Kegelmeyer, L.W.P.: Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)MATHGoogle Scholar
  10. 10.
    Newman, D.J., Hettich, S., Blake, C., Merz, C.: UCI repository of machine learning databases (1998)Google Scholar
  11. 11.
    Nadeau, C., Bengio, Y.: Inference for the generalization error. Machine Learning 52, 239–281 (2003)CrossRefMATHGoogle Scholar
  12. 12.
    Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proc. Twentieth Int. Conf. on Machine Learning, pp. 616–623. AAAI Press, Menlo Park (2003)Google Scholar
  13. 13.
    Kohavi, R., Wolpert, D.H.: Bias plus variance decomposition for zero-one loss functions. In: Proc. Thirteenth Int. Conf. on Machine Learning, pp. 275–283 (1996)Google Scholar
  14. 14.
    Torgo, L.: Regression datasets (2005),
  15. 15.
    Quinlan, J.R.: Learning with Continuous Classes. In: Proc. 5th Australian Joint Conf. on Artificial Intelligence, pp. 343–348. World Scientific, Singapore (1992)Google Scholar
  16. 16.
    Wang, Y., Witten, I.: Inducing model trees for continuous classes. In: Proc. of Poster Papers, European Conf. on Machine Learning (1997)Google Scholar
  17. 17.
    Ting, K., Witten, I.: Stacking bagged and dagged models. In: Fourteenth Int. Conf. on Machine Learning (ICML 2007), pp. 367–375 (1997)Google Scholar
  18. 18.
    Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 832–844 (1998)CrossRefGoogle Scholar
  19. 19.
    Achlioptas, D.: Database-friendly random projections. In: Twentieth ACM Symposium on Principles of Database Systems, pp. 274–281 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Eibe Frank
    • 1
  • Bernhard Pfahringer
    • 1
  1. 1.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand

Personalised recommendations