Accurate parameter estimation for Bayesian network classifiers using hierarchical Dirichlet processes
This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs). The main result of this paper is to show that improved parameter estimation allows BNCs to outperform leading learning methods such as random forest for both 0–1 loss and RMSE, albeit just on categorical datasets. As data assets become larger, entering the hyped world of “big”, efficient accurate classification requires three main elements: (1) classifiers with low-bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently. The latest BNCs satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters’ estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of HDPs for accurate BNC parameter estimation even with lower bias. We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with random forest in terms of prediction, while keeping the out-of-core capability and superior classification time.
KeywordsBayesian network Parameter estimation Graphical models Dirichlet processes Smoothing Classification
This work was supported by the Australian Research Council under awards DE170100037 and DP140100087. The authors would like to thank Joan Capdevila Pujol and the anonymous reviewers for helping us strengthen the original manuscript.
- Buntine, W., & Mishra, S. (2014). Experiments with non-parametric topic models. In 20th ACM SIGKDD international conference on knowledge discovery and data Mining, ACM, New York, NY, USA, KDD ’14 (pp. 881–890).Google Scholar
- Chen, S., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In 34th Annual meeting on association for computational linguistics, ACL ’96 (pp. 310–318).Google Scholar
- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’16 (pp. 785–794). https://doi.org/10.1145/2939672.2939785.
- Gasthaus, J., & Teh, Y. (2010). Improvements to the sequence memoizer. In: J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, & A. Culotta, Proceedings of the 23rd International Conference on Neural Information Processing Systems NIPS’10, Vancouver, British Columbia, Canada, 6–9 December 2010 (Vol. 1, pp. 685–693). Curran Associates Inc.Google Scholar
- Hardy, G. H. [(1889) 1920]. Letter. Transactions of the Faculty of Actuaries 8, pp. 180–181 (originally published on “Insurance Record 457”).Google Scholar
- Huynh, V., Phung, D. Q., Venkatesh, S., Nguyen, X., Hoffman, M. D., & Bui, H. H. (2016). Scalable nonparametric Bayesian multilevel clustering. In UAI.Google Scholar
- Koller, D., & Friedman, N. (2009). Probabilistic graphical models—Principles and techniques. Adaptive computation and machine learning. Cambridge, MA: The MIT Press.Google Scholar
- Lewis, D. (1998). Naive Bayes at forty: The independence assumption in information retrieval. In 10th European conference on machine learning, Springer, London, UK, ECML ’98 (pp. 4–15).Google Scholar
- Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml Accessed 23 Jan 2015.
- Lidstone, G. (1920). Note on the general case of the Bayes–Laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty Actuaries, 8, 182–192.Google Scholar
- Lyubimov, D., & Palumbo, A. (2016). Apache Mahout: Beyond MapReduce (1st ed.). North Charleston: CreateSpace Independent Publishing Platform.Google Scholar
- Nguyen, V., Phung, D. Q., Venkatesh, S., & Bui, H. H. (2015). A Bayesian nonparametric approach to multilevel regression. In PAKDD (Vol. 1, pp. 330–342).Google Scholar
- Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003). Tackling the poor assumptions of naive Bayes text classifiers. In 20th International conference on machine learning, AAAI Press, ICML’03 (pp. 616–623).Google Scholar
- Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Second international conference on knowledge discovery and data mining (pp. 334–338). AAAI Press, Menlo Park, CA.Google Scholar
- Shareghi, E., Cohn, T., & Haffari, G. (2016). Richer interpolative smoothing based on modified Kneser-Ney language modeling. In Empirical methods in natural language processing (pp. 944–949).Google Scholar
- Shareghi, E., Haffari, G., & Cohn, T. (2017a). Compressed nonparametric language modelling. In IJCAI. Accepted 23/04/2017.Google Scholar
- Shareghi, E., Haffari, G., & Cohn, T. (2017b). Compressed nonparametric language modelling. In Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17 (pp. 2701–2707). https://doi.org/10.24963/ijcai.2017/376.
- Sonnenburg, S., & Franc, V. (2010). COFFIN: A computational framework for linear SVMs. In J. Fürnkranz & T. Joachims (Eds.), ICML (pp. 999–1006).Google Scholar
- Teh, Y. (2006). A Bayesian interpretation of interpolated Kneser-Ney. Tech. Rep. TRA2/06, School of Computing, National University of Singapore.Google Scholar