Advertisement

Bagging Using Statistical Queries

  • Anneleen Van Assche
  • Hendrik Blockeel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)

Abstract

Bagging is an ensemble method that relies on random resampling of a data set to construct models for the ensemble. When only statistics about the data are available, but no individual examples, the straightforward resampling procedure cannot be implemented. The question is then whether bagging can somehow be simulated. In this paper we propose a method that, instead of computing certain heuristics (such as information gain) from a resampled version of the data, estimates the probability distribution of these heuristics under random resampling, and then samples from this distribution. The resulting method is not entirely equivalent to bagging because it ignores certain dependencies among statistics. Nevertheless, experiments show that this “simulated bagging” yields similar accuracy as bagging, while being as efficient and more generally applicable.

Keywords

Information Gain Itemset Frequency Decision Tree Algorithm Statistical Query Random Variate Generation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)MATHMathSciNetGoogle Scholar
  2. 2.
    Panov, P., Džeroski, S., Blockeel, H., Loškovska, S.: Predictive data mining using itemset frequencies. In: Proc. of the 7th Int. Multiconf. Information Society (2005)Google Scholar
  3. 3.
    Moore, A.W., Lee, M.S.: Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8, 67–91 (1998)MATHMathSciNetGoogle Scholar
  4. 4.
    Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proc. of ACM SIGMOD Conf. on Management of Data, Washington, D.C., USA, pp. 207–216. ACM, New York (1993)Google Scholar
  5. 5.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann series in Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  6. 6.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth, Belmont (1984)Google Scholar
  7. 7.
    Devroye, L.: Non-Uniform Random Variate Generation. Springer, Heidelberg (1986)Google Scholar
  8. 8.
    Kachitvichyanukul, V., Schmeiser, B.: Binomial random variate generation. Communications of the ACM 31, 216–222 (1988)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)MATHGoogle Scholar
  10. 10.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)Google Scholar
  11. 11.
    Cern: European org. for nuclear research (1999), http://dsd.lbl.gov/~hoschek/colt/
  12. 12.
    Van Assche, A., Blockeel, H.: Simulating bagging without bootstrapping. In: Saeys, Y., Tsiporkova, E., De Baets, B., Van de Peer, Y. (eds.) Proc. of the 15th Annual Machine Learning Conf. of Belgium and, the Netherlands, pp. 25–32 (2006)Google Scholar
  13. 13.
    Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. Journal of Machine Learning Research 3, 621–650 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Anneleen Van Assche
    • 1
  • Hendrik Blockeel
    • 1
  1. 1.Computer Science DepartmentKatholieke Universiteit LeuvenLeuvenBelgium

Personalised recommendations