Skip to main content

Part of the book series: Springer Handbooks ((SHB))

Abstract

Probabilistic methods are the heart of machine learning. This chapter shows links between core principles of information theory and probabilistic methods, with a short overview of historical and current examples of unsupervised and inferential models. Probabilistic models are introduced as a powerful idiom to describe the world, using random variables as building blocks held together by probabilistic relationships. The chapter discusses how such probabilistic interactions can be mapped to directed and undirected graph structures, which are the Bayesian and Markov networks. We show how these networks are subsumed by the broader class of the probabilistic graphical models, a general framework that provides concepts and methodological tools to encode, manipulate and process probabilistic knowledge in a computationally efficient way. The chapter then introduces, in more detail, two topical methodologies that are central to probabilistic modeling in machine learning. First, it discusses latent variable models, a probabilistic approach to capture complex relationships between a large number of observable and measurable events (data, in general), under the assumption that these are generated by an unknown, nonobservable process. We show how the parameters of a probabilistic model involving such nonobservable information can be efficiently estimated using the concepts underlying the expectation–maximization algorithms. Second, the chapter introduces a notable example of latent variable model, that is of particular relevance for representing the time evolution of sequence data, that is the hidden Markov model . The chapter ends with a discussion on advanced approaches for modeling complex data-generating processes comprising nontrivial probabilistic interactions between latent variables and observed information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 269.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 349.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

ARD:

automatic relevance determination

BSS:

blind source separation

EM:

expectation maximization

FA:

factor analysis

HMM:

hidden Markov model

i.i.d.:

independent, identically distributed

ICA:

independent component analysis

IO-HMM:

input/output hidden Markov model

KL:

Kullback–Leibler

LDA:

latent Dirichlet allocation

MAP:

maximum a posteriori

ML:

maximum likelihood

MLP:

multilayer perceptron

MPE:

most probable explanation

NMF:

negative matrix and tensor factorization

PCA:

principal component analysis

pdf:

probability density function

pLSA:

probabilistic latent semantic analysis

SCNG:

sparse coding neural gas

SNE:

stochastic neighborhood embedding

SVM:

support vector machine

VB:

variational Bayes

References

  1. S. Kullback, R.A. Leibler: On information and sufficiency, Ann. Math. Stat. 22, 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  2. F. Rosenblatt: The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev. 65, 386–408 (1958)

    Article  MathSciNet  Google Scholar 

  3. G. Deco, W. Finnoff, H.G. Zimmermann: Unsupervised mutual information criterion for elemination of overtraining in supervised mulilayer networks, Neural Comput. 7, 86–107 (1995)

    Article  Google Scholar 

  4. D.J.C. Mackay: Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, Cambridge 2003)

    MATH  Google Scholar 

  5. R. Salakhutdinov, G. Hinton: Using deep belief nets to learn covariance kernels for Gaussian processes, Adv. Neural Inf. Process. Syst. 20, 1249–1256 (2008)

    Google Scholar 

  6. C.M. Bishop: Pattern Recognition and Machine Learning (Springer, New York 2006)

    MATH  Google Scholar 

  7. S. Seth, J.C. Principe: Variable selection: A statistical dependence perspective, Proc. Int. Conf. Mach. Learn. Appl. (ICMLA) (2010)

    Google Scholar 

  8. M. Rao, S. Seth, J. Xu, Y. Chen, H. Tagare, J.C. Principe: A test of independence based on a generalized correlation function, Signal Process. 91, 15–27 (2011)

    Article  MATH  Google Scholar 

  9. D.D. Lee, H.S. Seung: Learning the parts of objects by non-negative matrix factorization, Nature 401(6755), 788–791 (1999)

    Article  Google Scholar 

  10. P. Comon, C. Jutten: Handbook of Blind Source Separation (Academic, Oxford 2010)

    Google Scholar 

  11. A. Hyvärinen, J. Karhunen, E. Oja: Independent Component Analysis (Wiley, New York 2001)

    Book  Google Scholar 

  12. A. Cichocki, R. Zdunek, A.H. Phan, S.-I. Amari: Nonnegative Matrix Tensor Factorizations (Wiley, Chichester 2009)

    Book  Google Scholar 

  13. E. Gaussier, C. Goutte: Relation between plsa and nmf and implications, Proc. 28th Int. ACM Conf. Res. Dev. Inf. Retr. (SIGIR'05) (ACM, New York 2005) pp. 601–602

    Google Scholar 

  14. D.T. Pham: Mutual information approach to blind separation of stationary sources, IEEE Trans. Inf. Theory 48, 1935–1946 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  15. M. Minami, S. Eguchi: Robust blind source separation by beta divergence, Neural Comput. 14, 1859–1886 (2002)

    Article  MATH  Google Scholar 

  16. T.-W. Lee, M. Girolami, T.J. Sejnowski: Independent component analysis using an extended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources, Neural Comput. 11(2), 417–441 (1999)

    Article  Google Scholar 

  17. K. Labusch, E. Barth, T. Martinetz: Sparse coding neural gas: Learning of overcomplete data representations, Neuro 72(7–9), 1547–1555 (2009)

    Google Scholar 

  18. A. Cichocki, S. Cruces, S.-I. Amari: Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization, Entropy 13, 134–170 (2011)

    Article  Google Scholar 

  19. I. Csiszár: Axiomatic characterization of information measures, Entropy 10, 261–273 (2008)

    Article  MATH  Google Scholar 

  20. F. Liese, I. Vajda: On divergences and informations in statistics and information theory, IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. T. Villmann, S. Haase: Divergence based vector quantization, Neural Comput. 23(5), 1343–1392 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  22. P.L. Zador: Asymptotic quantization error of continuous signals and the quantization dimension, IEEE Trans. Inf. Theory 28, 149–159 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  23. T. Villmann, J.-C. Claussen: Magnification control in self-organizing maps and neural gas, Neural Comput. 18(2), 446–469 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  24. B. Hammer, A. Hasenfuss, T. Villmann: Magnification control for batch neural gas, Neurocomputing 70(7–9), 1225–1234 (2007)

    Article  Google Scholar 

  25. E. Merényi, A. Jain, T. Villmann: Explicit magnification control of self-organizing maps for “forbidden” data, IEEE Trans. Neural Netw. 18(3), 786–797 (2007)

    Article  Google Scholar 

  26. T. Villmann, S. Haase: Magnification in divergence based neural maps, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN 2011), ed. by R. Mikkulainen (IEEE, Los Alamitos 2011) pp. 437–441

    Google Scholar 

  27. R. Chalasani, J.C. Principe: Self organizing maps with the correntropy induced metric, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN 2010) (IEEE, Barcelona 2010) pp. 1–6

    Chapter  Google Scholar 

  28. T. Lehn-Schiøler, A. Hegde, D. Erdogmus, J.C. Principe: Vector quantization using information theoretic concepts, Nat. Comput. 4(1), 39–51 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  29. R. Jenssen, D. Erdogmus, J.C. Principe, T. Eltoft: The Laplacian PDF distance: A cost function for clustering in a kernel feature space, Adv. Neural Inf. Process. Syst., Vol. 17 (MIT Press, Cambridge 2005) pp. 625–632

    Google Scholar 

  30. A. Hegde, D. Erdogmus, T. Lehn-Schiøler, Y.N. Rao, J.C. Principe: Vector quantization by density matching in the minimum Kullback-Leibler-divergence sense, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN), Budapest (IEEE, New York 2004) pp. 105–109

    Google Scholar 

  31. G.E. Hinton, S.T. Roweis: Stochastic neighbor embedding, Adv. Neural Inf. Process. Syst., Vol. 15 (MIT Press, Cambridge 2002) pp. 833–840

    Google Scholar 

  32. L. van der Maaten, G. Hinten: Visualizing data using t-SNE, J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  33. K. Bunte, S. Haase, M. Biehl, T. Villmann: Stochastic neighbor embedding (SNE) for dimension reduction and visualization using arbitrary divergences, Neurocomputing 90(9), 23–45 (2012)

    Article  Google Scholar 

  34. M. Strickert, F.-M. Schleif, U. Seiffert, T. Villmann: Derivatives of pearson correlation for gradient-based analysis of biomedical data, Intel. Artif. Rev. Iberoam. Intel. Artif. 37, 37–44 (2008)

    Google Scholar 

  35. M. Strickert, B. Labitzke, A. Kolb, T. Villmann: Multispectral image characterization by partial generalized covariance, Proc. Eur. Symp. Artif. Neural Netw. (ESANN'2011), Louvain-La-Neuve, ed. by M. Verleysen (2011) pp. 105–110

    Google Scholar 

  36. V. Gómez-Verdejo, M. Verleysen, J. Fleury: Information-theoretic feature selection for functional data classification, Neurocomputing 72(16–18), 3580–3589 (2009)

    Article  Google Scholar 

  37. B. Hammer, T. Villmann: Generalized relevance learning vector quantization, Neural Netw. 15(8/9), 1059–1068 (2002)

    Article  Google Scholar 

  38. T. Villmann, M. Kästner: Sparse functional relevance learning in generalized learning vector quantization, Lect. Notes Comput. Sci. 6731, 79–89 (2011)

    Article  Google Scholar 

  39. M. Kästner, B. Hammer, M. Biehl, T. Villmann: Functional relevance learning in generalized learning vector quantization, Neurocomputing 90(9), 85–95 (2012)

    Article  Google Scholar 

  40. A. Kraskov, H. Stogbauer, P. Grassberger: Estimating mutual information, Phys. Rev. E 69(6), 66–138 (2004)

    Article  MathSciNet  Google Scholar 

  41. Y.-I. Moon, B. Rajagopalan, U. Lall: Estimating mutual information by kernel density estimators, Phys. Rev. E 52, 2318–2321 (1995)

    Article  Google Scholar 

  42. J.C. Principe: Information Theoretic Learning (Springer, Heidelberg, 2010)

    Book  MATH  Google Scholar 

  43. R. Andonie, A. Cataron: An information energy LVQ approach for feature ranking, Eur. Symp. Artif. Neural Netw. 2004, ed. by M. Verleysen (d-side, Evere 2004) pp. 471–476

    Google Scholar 

  44. R. Jenssen, D. Erdogmus, J.C. Principe, T. Eltoft: Some equivalences between kernel methods and information theoretic methods, J. VLSI Signal Process. 45, 49–65 (2006)

    Article  MATH  Google Scholar 

  45. P.J.G. Lisboa, T.A. Etchells, I.H. Jarman, C.T.C. Arsene, M.S.H. Aung, A. Eleuteri, A.F.G. Taktak, F. Ambrogi, P. Boracchi, E. Biganzoli: Partial logistic artificial neural network for competing risks regularized with automatic relevance determination, IEEE Trans. Neural Netw. 20(9), 1403–1416 (2009)

    Article  Google Scholar 

  46. M.I. Jordan: Graphical models, Stat. Sci. 19, 140–155 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  47. D. Koller, N. Friedman: Probabilistic Graphical Models: Principles and Techniques – Adaptive Computation and Machine Learning (MIT Press, Cambridge 2009)

    Google Scholar 

  48. A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  49. M.E. Tipping, C.M. Bishop: Probabilistic principal component analysis, J. R. Stat. Soc. Ser. B 61(3), 611–622 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  50. T. Hofmann: Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42(1/2), 177–196 (2001)

    Article  MATH  Google Scholar 

  51. M. Welling, C. Chemudugunta, N. Sutter: Deterministic latent variable models and their pitfalls, SIAM Int. Conf. Data Min. (2008)

    Google Scholar 

  52. D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  53. T. Minka, J. Lafferty: Expectation propagation for the generative aspect model, Proc. Conf. Uncertain. AI (2002)

    Google Scholar 

  54. T. Griffiths, M. Steyvers: Finding scientific topics, Proc. Natl. Acad. Sci. USA 101, 5228–5235 (2004)

    Article  Google Scholar 

  55. M. Blei, D. Blei, T. Griffiths, J. Tenenbaum: Hierarchical topic models and the nested Chinese restaurant process, Adv. Neural Inf. Process. Syst., Vol. 16 (MIT Press, Cambridge 2004) p. 17

    Google Scholar 

  56. M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth: The author-topic model for authors and documents, Proc. 20th Conf. Uncertain. Artif. Intell., UAI '04 (AUAI, Corvallis 2004) pp. 487–494

    Google Scholar 

  57. L.-J. Li, L. Fei-Fei: What, where and who? classifying events by scene and object recognition, IEEE 11th Int. Conf. Comput. Vis. (ICCV) 2007 (2007), pp. 1–8

    Google Scholar 

  58. L.R. Rabiner: A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  59. L.E. Baum, T. Petrie: Statistical inference for probabilistic functions of finite state Markov chains, Ann. Math. Stat. 37(6), 1554–1563 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  60. S.E. Levinson, L.R. Rabiner, M.M. Sondhi: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition, Bell Syst. Tech. J. 62(4), 1035–1074 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  61. P.A. Devijver: Baum's forward-backward algorithm revisited, Pattern Recogn. Lett. 3(6), 369–373 (1985)

    Article  MATH  Google Scholar 

  62. M. Brand, N. Oliver, A. Pentland: Coupled hidden Markov models for complex action recognition, Computer Vision and Pattern Recognition, Proc., 1997 IEEE (1997) pp. 994–999

    Google Scholar 

  63. Z. Ghahramani, M.I. Jordan: Factorial hidden Markov models, Mach. Learn. 29(2), 245–273 (1997)

    Article  MATH  Google Scholar 

  64. Y. Bengio, P. Frasconi: Input-output HMMs for sequence processing, IEEE Trans. Neural Netw. 7(5), 1231–1249 (1996)

    Article  Google Scholar 

  65. Y. Li, H.Y. Shum: Learning dynamic audio-visual mapping with input-output hidden Markov models, IEEE Trans. Multimed. 8(3), 542–549 (2006)

    Article  Google Scholar 

  66. B. Knab, A. Schliep, B. Steckemetz, B. Wichern: Model-based clustering with hidden Markov models and its application to financial time-series data, Proc. GfKl 2002 Data Sci. Appl. Data Anal. (Springer, Berlin, Heidelberg 2003) pp. 561–569

    Google Scholar 

  67. M. Seifert, M. Strickert, A. Schliep, I. Grosse: Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended hidden Markov models, Bioinformatics 27(12), 1645–1652 (2011)

    Article  Google Scholar 

  68. M. Diligenti, P. Frasconi, M. Gori: Hidden tree markov models for document image classification, IEEE Trans. Pattern Anal. Mach. Intell. 25(4), 519–523 (2003)

    Article  MATH  Google Scholar 

  69. D. Bacciu, A. Micheli, A. Sperduti: Compositional generative mapping for tree-structured data – Part I: Bottom-up probabilistic modeling of trees, IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1987–2002 (2012)

    Article  Google Scholar 

  70. D. Bacciu, A. Micheli, A. Sperduti: An input-output hidden Markov model for tree transductions, Neurocomputing 112, 34–46 (2013)

    Article  Google Scholar 

  71. M.J. Beal, Z. Ghahramani, C.E. Rasmussen: The infinite hidden Markov model, Adv. Neural Inf. Process. Syst. 14, 577–584 (2002)

    Google Scholar 

  72. C. Sutton, A. McCallum: An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, ed. by L. Getoor, B. Taskar (MIT Press, Cambridge 2006) pp. 93–128

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Bacciu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bacciu, D., Lisboa, P.J., Sperduti, A., Villmann, T. (2015). Probabilistic Modeling in Machine Learning. In: Kacprzyk, J., Pedrycz, W. (eds) Springer Handbook of Computational Intelligence. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43505-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-43505-2_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-43504-5

  • Online ISBN: 978-3-662-43505-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics