Advances in Data Analysis and Classification

, Volume 10, Issue 3, pp 305–326 | Cite as

Marginal and simultaneous predictive classification using stratified graphical models

  • Henrik NymanEmail author
  • Jie Xiong
  • Johan Pensar
  • Jukka Corander
Regular Article


An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.


Classification Context-specific independence Graphical model Predictive inference 

Mathematics Subject Classification

62-09 62H30 62F15 



The authors would like to thank the editor and the anonymous reviewers for their constructive comments and suggestions on the original version of this paper. H.N. and J.P. were supported by the Foundation of Åbo Akademi University, as part of the grant for the Center of Excellence in Optimization and Systems Engineering. J.P. was also supported by the Magnus Ehrnrooth foundation. J.X. and J.C. were supported by the ERC Grant No. 239784 and Academy of Finland Grant No. 251170. J.X. was also supported by the FDPSS graduate school.

Supplementary material

11634_2015_199_MOESM1_ESM.pdf (47 kb)
Supplementary material 1 (pdf 47 KB)
11634_2015_199_MOESM2_ESM.xls (290 kb)
Supplementary material 2 (xls 291 KB)


  1. Bishop CM (2007) Pattern recognition and machine learning. Springer, New YorkzbMATHGoogle Scholar
  2. Cerquides J, De Mántaras RL (2005) TAN classifiers based on decomposable distributions. Mach Learn 59(3):323–354CrossRefzbMATHGoogle Scholar
  3. Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9(4):309–347zbMATHGoogle Scholar
  4. Corander J, Marttinen P (2006) Bayesian identification of admixture events using multi-locus molecular markers. Mol Ecol 15(10):2833–2843MathSciNetCrossRefGoogle Scholar
  5. Corander J, Marttinen P, Sirén J, Tang J (2008) Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinform 9:539CrossRefGoogle Scholar
  6. Corander J, Cui Y, Koski T (2013a) Inductive inference and partition exchangeability in classification. In: Dowe DL (ed) Solomonoff Festschrift, Springer Lecture Notes in Artificial Intelligence (LNAI), vol 7070, pp 91–105Google Scholar
  7. Corander J, Cui Y, Koski T, Sirén J (2013b) Have I seen you before? Principles of Bayesian predictive classification revisited. Stat Comput 23(1):59–73MathSciNetCrossRefzbMATHGoogle Scholar
  8. Corander J, Xiong J, Cui Y, Koski T (2013c) Optimal Viterbi Bayesian predictive classification for data from finite alphabets. J Stat Plan Infer 143(2):261–275MathSciNetCrossRefzbMATHGoogle Scholar
  9. Dawid A, Lauritzen S (1993) Hyper-Markov laws in the statistical analysis of decomposable graphical models. Ann Stat 21:1272–1317MathSciNetCrossRefzbMATHGoogle Scholar
  10. Dawyndt P, Thompson FL, Austin B, Swings J, Koski T, Gyllenberg M (2005) Application of sliding-window discretization and minimization of stochastic complexity for the analysis of fAFLP genotyping fingerprint patterns of Vibrionaceae. Int J Syst Evol Microbiol 55(1):57–66CrossRefGoogle Scholar
  11. Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New YorkzbMATHGoogle Scholar
  12. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163CrossRefzbMATHGoogle Scholar
  13. Geisser S (1964) Posterior odds for multivariate normal classifications. J R Stat Soc B 26:69–76MathSciNetzbMATHGoogle Scholar
  14. Geisser S (1966) Predictive discrimination. In: Krishnajah PR (ed) Multivariate analysis. Academic Press, New YorkGoogle Scholar
  15. Geisser S (1993) Predictive inference: an introduction. Chapman & Hall, LondonCrossRefzbMATHGoogle Scholar
  16. Golumbic MC (2004) Algorithmic graph theory and perfect graphs, 2nd edn. Elsevier, AmsterdamzbMATHGoogle Scholar
  17. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New YorkCrossRefzbMATHGoogle Scholar
  18. Helsingin Sanomat (2011) HS:n vaalikone 2011., visited 15 Oct 2013
  19. Holmes DE, Jain LC (2008) Innovations in Bayesian networks: theory and applications, vol 156. Springer, BerlinCrossRefzbMATHGoogle Scholar
  20. Huo Q, Lee CH (2000) A Bayesian predictive classification approach to robust speech recognition. IEEE Trans Speech Audio Process 8(2):200–204CrossRefGoogle Scholar
  21. Keogh EJ, Pazzani MJ (1999) Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In: Proceedings of the seventh international workshop on artificial intelligence and statistics, pp 225–230Google Scholar
  22. Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT Press, LondonzbMATHGoogle Scholar
  23. Lauritzen SL (1996) Graphical models. Oxford University Press, OxfordzbMATHGoogle Scholar
  24. Madden MG (2009) On the classification performance of TAN and general Bayesian networks. Knowl Based Syst 22(7):489–495CrossRefGoogle Scholar
  25. Maina CW, Walsh JM (2011) Joint speech enhancement and speaker identification using approximate Bayesian inference. IEEE Trans Audio Speech Lang Process 19(6):1517–1529CrossRefGoogle Scholar
  26. Nádas A (1985) Optimal solution of a training problem in speech recognition. IEEE Trans Acoustics Speech Signal Process 33(1):326–329CrossRefGoogle Scholar
  27. Nyman H, Pensar J, Koski T, Corander J (2014) Stratified graphical models—context-specific independence in graphical models. Bayesian Anal 9(4):883–908Google Scholar
  28. Pernkopf F, Bilmes J (2005) Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In: Proceedings of the 22nd international conference on machine learning, pp 657–664Google Scholar
  29. Ripley BD (1988) Statistical inference for spatial processes. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  30. Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  31. Su J, Zhang H (2006) Full Bayesian network classifiers. In: Proceedings of the 23rd international conference on machine learning, pp 897–904Google Scholar
  32. Whittaker J (1990) Graphical models in applied multivariate statistics. Wiley, ChichesterzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Henrik Nyman
    • 1
    Email author
  • Jie Xiong
    • 2
  • Johan Pensar
    • 1
  • Jukka Corander
    • 2
  1. 1.Department of Mathematics and StatisticsÅbo Akademi UniversityTurkuFinland
  2. 2.Department of Mathematics and StatisticsUniversity of HelsinkiHelsinkiFinland

Personalised recommendations