Machine Learning

, Volume 107, Issue 8–10, pp 1303–1331 | Cite as

Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes

  • François PetitjeanEmail author
  • Wray Buntine
  • Geoffrey I. Webb
  • Nayyar Zaidi
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2018 Journal Track


This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs). The main result of this paper is to show that improved parameter estimation allows BNCs to outperform leading learning methods such as random forest for both 0–1 loss and RMSE, albeit just on categorical datasets. As data assets become larger, entering the hyped world of “big”, efficient accurate classification requires three main elements: (1) classifiers with low-bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters’ estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.


Bayesian network Parameter estimation Graphical models Dirichlet processes Smoothing Classification 



This work was supported by the Australian Research Council under awards DE170100037 and DP140100087. The authors would like to thank Joan Capdevila Pujol and the anonymous reviewers for helping us strengthen the original manuscript.


  1. Bielza, C., & Larrañaga, P. (2014). Discrete Bayesian network classifiers: A survey. ACM Computing Surveys, 47(1), 5.CrossRefzbMATHGoogle Scholar
  2. Boström, H. (2012). Forests of probability estimation trees. International Journal of Pattern Recognition and Artificial Intelligence, 26(02), 125,1001.MathSciNetCrossRefGoogle Scholar
  3. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefzbMATHGoogle Scholar
  4. Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering, 8(2), 195–210.CrossRefGoogle Scholar
  5. Buntine, W., & Mishra, S. (2014). Experiments with non-parametric topic models. In 20th ACM SIGKDD international conference on knowledge discovery and data Mining, ACM, New York, NY, USA, KDD ’14 (pp. 881–890).Google Scholar
  6. Carvalho, A. M., Roos, T., Oliveira, A. L., & Myllymäki, P. (2011). Discriminative learning of Bayesian networks via factorized conditional log-likelihood. Journal of Machine Learning Research, 12(July), 2181–2210.MathSciNetzbMATHGoogle Scholar
  7. Chen, S., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In 34th Annual meeting on association for computational linguistics, ACL ’96 (pp. 310–318).Google Scholar
  8. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’16 (pp. 785–794).
  9. Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3), 462–467.CrossRefzbMATHGoogle Scholar
  10. Du, L., Buntine, W., & Jin, H. (2010). A segmented topic model based on the two-parameter Poisson–Dirichlet process. Machine Learning, 81(1), 5–19.MathSciNetCrossRefGoogle Scholar
  11. Fayyad, U., & Irani, K. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1), 87–102.zbMATHGoogle Scholar
  12. Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209–230.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163.CrossRefzbMATHGoogle Scholar
  14. Friedman, N., & Koller, D. (2003). Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50(1–2), 95–125.CrossRefzbMATHGoogle Scholar
  15. Gasthaus, J., & Teh, Y. (2010). Improvements to the sequence memoizer. In: J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, & A. Culotta, Proceedings of the 23rd International Conference on Neural Information Processing Systems NIPS’10, Vancouver, British Columbia, Canada, 6–9 December 2010 (Vol. 1, pp. 685–693). Curran Associates Inc.Google Scholar
  16. Hardy, G. H. [(1889) 1920]. Letter. Transactions of the Faculty of Actuaries 8, pp. 180–181 (originally published on “Insurance Record 457”).Google Scholar
  17. Huynh, V., Phung, D. Q., Venkatesh, S., Nguyen, X., Hoffman, M. D., & Bui, H. H. (2016). Scalable nonparametric Bayesian multilevel clustering. In UAI.Google Scholar
  18. Hwang, H. K. (1995). Asymptotic expansions for the Stirling numbers of the first kind. Journal of Combinatorial Theory, Series A, 71(2), 343–351.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Koller, D., & Friedman, N. (2009). Probabilistic graphical models—Principles and techniques. Adaptive computation and machine learning. Cambridge, MA: The MIT Press.Google Scholar
  20. Lewis, D. (1998). Naive Bayes at forty: The independence assumption in information retrieval. In 10th European conference on machine learning, Springer, London, UK, ECML ’98 (pp. 4–15).Google Scholar
  21. Lichman, M. (2013). UCI machine learning repository. Accessed 23 Jan 2015.
  22. Lidstone, G. (1920). Note on the general case of the Bayes–Laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty Actuaries, 8, 182–192.Google Scholar
  23. Lim, K., Buntine, W., Chen, C., & Du, L. (2016). Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes. International Journal of Approximate Reasoning, 78, 172–191.MathSciNetCrossRefzbMATHGoogle Scholar
  24. Lyubimov, D., & Palumbo, A. (2016). Apache Mahout: Beyond MapReduce (1st ed.). North Charleston: CreateSpace Independent Publishing Platform.Google Scholar
  25. Martínez, A., Webb, G., Chen, S., & Zaidi, N. (2016). Scalable learning of Bayesian network classifiers. Journal of Machine Learning Research, 17(44), 1–35.MathSciNetzbMATHGoogle Scholar
  26. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.zbMATHGoogle Scholar
  27. Nguyen, V., Phung, D. Q., Venkatesh, S., & Bui, H. H. (2015). A Bayesian nonparametric approach to multilevel regression. In PAKDD (Vol. 1, pp. 330–342).Google Scholar
  28. Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003). Tackling the poor assumptions of naive Bayes text classifiers. In 20th International conference on machine learning, AAAI Press, ICML’03 (pp. 616–623).Google Scholar
  29. Roos, T., Wettig, H., Grünwald, P., Myllymäki, P., & Tirri, H. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59(3), 267–296.zbMATHGoogle Scholar
  30. Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Second international conference on knowledge discovery and data mining (pp. 334–338). AAAI Press, Menlo Park, CA.Google Scholar
  31. Shareghi, E., Cohn, T., & Haffari, G. (2016). Richer interpolative smoothing based on modified Kneser-Ney language modeling. In Empirical methods in natural language processing (pp. 944–949).Google Scholar
  32. Shareghi, E., Haffari, G., & Cohn, T. (2017a). Compressed nonparametric language modelling. In IJCAI. Accepted 23/04/2017.Google Scholar
  33. Shareghi, E., Haffari, G., & Cohn, T. (2017b). Compressed nonparametric language modelling. In Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17 (pp. 2701–2707).
  34. Sonnenburg, S., & Franc, V. (2010). COFFIN: A computational framework for linear SVMs. In J. Fürnkranz & T. Joachims (Eds.), ICML (pp. 999–1006).Google Scholar
  35. Teh, Y. (2006). A Bayesian interpretation of interpolated Kneser-Ney. Tech. Rep. TRA2/06, School of Computing, National University of Singapore.Google Scholar
  36. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.MathSciNetCrossRefzbMATHGoogle Scholar
  37. Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning, 58(1), 5–24.CrossRefzbMATHGoogle Scholar
  38. Webb, G., Boughton, J., Zheng, F., Ting, K., & Salem, H. (2012). Learning by extrapolation from marginal to full-multivariate probability distributions: Decreasingly naive Bayesian classification. Machine Learning, 86(2), 233–272.MathSciNetCrossRefzbMATHGoogle Scholar
  39. Wermuth, N., & Lauritzen, S. (1983). Graphical and recursive models for contigency tables. Biometrika, 70(3), 537–552.MathSciNetCrossRefzbMATHGoogle Scholar
  40. Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. (2011). The sequence memoizer. Communications of the ACM, 54(2), 91–98.CrossRefGoogle Scholar
  41. Zaidi, N. A., Webb, G. I., Carman, M. J., Petitjean, F., Buntine, W., Hynes, M., et al. (2017). Efficient parameter learning of Bayesian network classifiers. Machine Learning, 106(9–10), 1289–1329.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Faculty of Information TechnologyMonash UniversityClaytonAustralia

Personalised recommendations