Robust Bayesian Linear Classifier Ensembles

  • Jesús Cerquides
  • Ramon López de Mántaras
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)


Ensemble classifiers combine the classification results of several classifiers. Simple ensemble methods such as uniform averaging over a set of models usually provide an improvement over selecting the single best model. Usually probabilistic classifiers restrict the set of possible models that can be learnt in order to lower computational complexity costs. In these restricted spaces, where incorrect modeling assumptions are possibly made, uniform averaging sometimes performs even better than bayesian model averaging. Linear mixtures over sets of models provide an space that includes uniform averaging as a particular case. We develop two algorithms for learning maximum a posteriori weights for linear mixtures, based on expectation maximization and on constrained optimizition. We provide a nontrivial example of the utility of these two algorithms by applying them for one dependence estimators. We develop the conjugate distribution for one dependence estimators and empirically show that uniform averaging is clearly superior to Bayesian model averaging for this family of models. After that we empirically show that the maximum a posteriori linear mixture weights improve accuracy significantly over uniform aggregation.


  1. 1.
    Bouchard, G., Triggs, B.: The tradeoff between generative and discriminative classifiers. In: IASC International Symposium on Computational Statistics (COMPSTAT), Prague, August 2004, pp. 721–728 (2004)Google Scholar
  2. 2.
    Cerquides, J., López de Mántaras, R.: Tan classifiers based on decomposable distributions. Machine Learning- Special Issue on Graphical Models for Classification 59(3), 323–354 (2005)zbMATHGoogle Scholar
  3. 3.
    Clarke, B.: Comparing bayes model averaging and stacking when model approximation error cannot be ignored. Journal of Machine Learning Research 4, 683–712 (2003)CrossRefGoogle Scholar
  4. 4.
    Dash, D., Cooper, G.F.: Model averaging for prediction with discrete bayesian networks. Journal of Machine Learning Research 5, 1177–1203 (2004)MathSciNetGoogle Scholar
  5. 5.
    Dawes, R.: The robust beauty of improper linear models. American Psychologist 34, 571–582 (1979)CrossRefGoogle Scholar
  6. 6.
    Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Domingos, P.: Bayesian averaging of classifiers and the overfitting problem. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 223–230 (2000)Google Scholar
  8. 8.
    Fawcett, T.: Roc graphs: Notes and practical considerations for data mining researchers. Technical Report HPL-2003-4, HP Laboratories Palo Alto (2003)Google Scholar
  9. 9.
    Friedman, J.: Importance sampling: An alternative view of ensemble learning. In: Workshop on Data Mining Methodology and Applications (October 2004)Google Scholar
  10. 10.
    Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning 29, 131–163 (1997)zbMATHCrossRefGoogle Scholar
  11. 11.
    Genest, C., McConway, K.: Allocating the weights in the linear opinion pool. Journal of Forecasting 9, 53–73 (1990)CrossRefGoogle Scholar
  12. 12.
    Genest, C., Zidek, J.: Combining probability distributions: A critique and an annotated bibliography. Statistical Science 1(1), 114–148 (1986)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Ghahramani, Z., Kim, H.-C.: Bayesian classifier combination. Gatsby Technical report (2003)Google Scholar
  14. 14.
    Gill, P., Murray, W., Saunders, M., Wright, M.: Constrained nonlinear programming. In: Nemhauser, G., Rinnooy Kan, A., Todd, M. (eds.) Optimization, Handbooks in Operations Research and Management Science. North-Holland, Amsterdam (1989)Google Scholar
  15. 15.
    Greiner, R., Su, X., Shen, B., Zhou, W.: Structural extension to logistic regression: Discriminant parameter learning of belief net classifiers. Machine Learning - Special Issue on Graphical Models for Classification 59(3), 297–322 (2005)zbMATHGoogle Scholar
  16. 16.
    Grossman, D., Domingos, P.: Learning bayesian network classifiers by maximizing conditional likelihood. In: Brodley, C.E. (ed.) ICML. ACM, New York (2004)Google Scholar
  17. 17.
    Gruenwald, P., Kontkanen, P., Myllymäki, P., Roos, T., Tirri, H., Wettig, H.: Supervised posterior distributions. Presented at the Seventh Valencia International Meeting on Bayesian Statistics, Tenerife, Spain (2002)Google Scholar
  18. 18.
    Hand, D., Till, R.: A simple generalization of the area under the roc curve to multiple class classification problems. Machine Learning 45(2), 171–186 (2001)zbMATHCrossRefGoogle Scholar
  19. 19.
    Hoeting, J., Madigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: A tutorial (with discussion). Statistical science 14, 382–401 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Hoeting, J., Madigan, D., Raftery, A., Volinsky, C.: Bayesian model averaging: A tutorial (with discussion) - correction. Statistical science 15, 193–195 (1999)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Ide, J., Cozman, F.: Generation of random bayesian networks with constraints on induced width, with applications to the average analysis od d-connectivity, quasi-random sampling, and loopy propagation. Technical report, University of Sao Paulo (June 2003)Google Scholar
  22. 22.
    Keogh, E., Pazzani, M.: Learning augmented bayesian classifiers: A comparison of distribution-based and classification-based approaches. In: Uncertainty 1999: The Seventh International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, FL (1999)Google Scholar
  23. 23.
    McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Chichester (1997)zbMATHGoogle Scholar
  24. 24.
    McLachlan, G.J., Basford, K.E.: Mixture Models. Marcel Dekker, New York (1988)zbMATHGoogle Scholar
  25. 25.
    Meila, M., Jordan, M.I.: Learning with mixtures of trees. Journal of Machine Learning Research 1, 1–48 (2000)CrossRefMathSciNetGoogle Scholar
  26. 26.
    Meila-Predoviciu, M.: Learning with mixtures of trees. PhD thesis, Department of Electrical Engineering and Computer Science. MIT (1999)Google Scholar
  27. 27.
    Minka, T.: Bayesian model averaging is not model combination. MIT Media Lab note (December 2002)Google Scholar
  28. 28.
    Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14, pp. 841–848. MIT Press, Cambridge (2002)Google Scholar
  29. 29.
    Pedregal, P.: Introduction to Optimization. Texts in Applied Mathematics, vol. 46. Springer, Heidelberg (2004)zbMATHGoogle Scholar
  30. 30.
    Raina, R., Shen, Y., Ng, A.Y., McCallum, A.: Classification with hybrid generative/discriminative models. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)Google Scholar
  31. 31.
    Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., Tirri, H.: On discriminative bayesian network classifiers and logistic regression. Machine Learning - Special Issue on Graphical Models for Classification 59(3), 267–296 (2005)zbMATHGoogle Scholar
  32. 32.
    Sahami, M.: Learning limited dependence Bayesian classifiers. In: Second International Conference on Knowledge Discovery in Databases, pp. 335–338 (1996)Google Scholar
  33. 33.
    Thiesson, B., Meek, C., Chickering, D., Heckerman, D.: Learning mixtures of bayesian networks (1997)Google Scholar
  34. 34.
    Thiesson, B., Meek, C., Chickering, D., Heckerman, D.: Learning mixtures of dag models. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI 1998), pp. 504–513 (1998)Google Scholar
  35. 35.
    Ting, K., Witten, I.: Issues in stacked generalization. Journal of Artificial Intelligence Research 10, 271–289 (1999)zbMATHGoogle Scholar
  36. 36.
    Webb, G.I., Boughton, J., Wang, Z.: Not so naive bayes: Aggregating one-dependence estimators. Machine Learning 58(1), 5–24 (2005)zbMATHCrossRefGoogle Scholar
  37. 37.
    Witten, I.H., Frank, E.: Data Mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Francisco (2000)Google Scholar
  38. 38.
    Zheng, Z., Webb, G.I.: Lazy learning of bayesian rules. Machine Learning 41(1), 53–84 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jesús Cerquides
    • 1
  • Ramon López de Mántaras
    • 2
  1. 1.Dept. de Matemática Aplicada i AnálisiUniversitat de Barcelona 
  2. 2.Artificial Intelligence Research Institute – IIIASpanish Council for Scientific Research – CSIC 

Personalised recommendations