New Generation Computing

, Volume 35, Issue 1, pp 5–27 | Cite as

On Model Selection, Bayesian Networks, and the Fisher Information Integral

Special Feature

Abstract

We study BIC-like model selection criteria and in particular, their refinements that include a constant term involving the Fisher information matrix. We perform numerical simulations that enable increasingly accurate approximation of this constant in the case of Bayesian networks. We observe that for complex Bayesian network models, the constant term is a negative number with a very large absolute value that dominates the other terms for small and moderate sample sizes. For networks with a fixed number of parameters, d, the leading term in the complexity penalty, which is proportional to d, is the same. However, as we show, the constant term can vary significantly depending on the network structure even if the number of parameters is fixed. Based on our experiments, we conjecture that the distribution of the nodes’ outdegree is a key factor. Furthermore, we demonstrate that the constant term can have a dramatic effect on model selection performance for small sample sizes.

Keywords

Model selection Bayesian networks Fisher information approximation NML BIC 

References

  1. 1.
    Clarke, B.S., Barron, A.R.: Jeffreys prior is asymptotically least favorable under entropy risk. J. Stat. Plan. Inference 41(1), 37–61 (1994)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Grünwald, P.D.: The minimum description length principle. MIT Press, Cambridge (2007)Google Scholar
  3. 3.
    Han, C., Carlin, B.P.: Markov chain Monte Carlo methods for computing Bayes factors. J. Am. Stat. Assoc. 96(455), 1122–1132 (2001)CrossRefGoogle Scholar
  4. 4.
    Jeffreys, H.: An invariant form for the prior probability in estimation problems. J. R. Stat. Soc. A. 186(1007), 453–461 (1946)MathSciNetMATHGoogle Scholar
  5. 5.
    Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Kontkanen, P., Myllymäki, P.: A linear-time algorithm for computing the multinomial stochastic complexity. Inf. Process. Lett. 103(6), 227–233 (2007)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Kontkanen, P., Myllymäki, P., Silander, T., Tirri, H., Grünwald, P.: On predictive distributions and Bayesian networks. Stat. Comput. 10, 39–54 (2000)CrossRefGoogle Scholar
  8. 8.
    Krichevsky, R., Trofimov, V.: The performance of universal coding. IEEE Trans. Inf. Theory 27(2), 199–207 (1981)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
    Navarro, D.: A note on the applied use of MDL approximations. Neural Comput. 16(9), 1763–1768 (2004)CrossRefMATHGoogle Scholar
  10. 10.
    Rasmussen, C. E., Ghahramani, Z.: “Occam’s razor”. In: Adv. Neural Inf. Process. Syst. (Leen, T., Dietterich T., Tresp, V.), pp. 294–300 (2001)Google Scholar
  11. 11.
    Rissanen, J.: Fisher information and stochastic complexity. IEEE Trans. Inf. Theory 42(1), 40–47 (1996)MathSciNetCrossRefMATHGoogle Scholar
  12. 12.
    Rissanen, J.: Information and complexity in statistical modeling. Springer, New York (2007)MATHGoogle Scholar
  13. 13.
    Roos, T.: Monte Carlo estimation of minimax regret with an application to MDL model selection. In: Proc. IEEE Information Theory Workshop, IEEE Press, pp. 284–288 (2008)Google Scholar
  14. 14.
    Roos, T., Rissanen, J: On sequentially normalized maximum likelihood models. In: Proc. Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-08) (Rissanen, J., Liski, E., Tabus, I., Myllymäki, P., Kontoyiannis, I., Heikkonen, J.), Tampere, Finland (2008)Google Scholar
  15. 15.
    Roos, T., Zou, Y.: Keep it simple stupid—on the effect of lower-order terms in BIC-like criteria. In: Information Theory and Applications Workshop (ITA), IEEE Press, pp. 1–7 (2013)Google Scholar
  16. 16.
    Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Shtarkov, Y.M.: Universal sequential coding of single messages. Prob. Inf. Transm. 23(3), 3–17 (1987)MathSciNetGoogle Scholar
  18. 18.
    Silander, T., Roos, T., Kontkanen, P., Myllymäki, P.: Factorized normalized maximum likelihood criterion for learning Bayesian network structures. In: Proc. 4th European Workshop on Probabilistic Graphical Models (PGM-08) (Jaeger, M., Nielsen, T. D.), pp. 257–272 (2008)Google Scholar
  19. 19.
    Silander, T., Roos, T., Myllymäki, P.: Learning locally minimax optimal Bayesian networks. Int. J. Approx. Reason 51(5), 544–557 (2010)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Ueno, M.: Robust learning Bayesian networks for prior belief. In: Proc. Uncertainty in Artificial Intelligence (UAI-2011) (Cozman, F.G., Pfeffer, A.), Barcelona, Spain, pp. 698–707 (2011)Google Scholar
  21. 21.
    Xie, Q., Barron, A.R.: Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Trans. Inf. Theory 46(2), 431–445 (2000)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Ohmsha, Ltd. and Springer Japan 2016

Authors and Affiliations

  1. 1.Helsinki Institute for Information Technology HIIT, Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland

Personalised recommendations