Abstract
Probabilistic methods are the heart of machine learning. This chapter shows links between core principles of information theory and probabilistic methods, with a short overview of historical and current examples of unsupervised and inferential models. Probabilistic models are introduced as a powerful idiom to describe the world, using random variables as building blocks held together by probabilistic relationships. The chapter discusses how such probabilistic interactions can be mapped to directed and undirected graph structures, which are the Bayesian and Markov networks. We show how these networks are subsumed by the broader class of the probabilistic graphical models, a general framework that provides concepts and methodological tools to encode, manipulate and process probabilistic knowledge in a computationally efficient way. The chapter then introduces, in more detail, two topical methodologies that are central to probabilistic modeling in machine learning. First, it discusses latent variable models, a probabilistic approach to capture complex relationships between a large number of observable and measurable events (data, in general), under the assumption that these are generated by an unknown, nonobservable process. We show how the parameters of a probabilistic model involving such nonobservable information can be efficiently estimated using the concepts underlying the expectation–maximization algorithms. Second, the chapter introduces a notable example of latent variable model, that is of particular relevance for representing the time evolution of sequence data, that is the hidden Markov model . The chapter ends with a discussion on advanced approaches for modeling complex data-generating processes comprising nontrivial probabilistic interactions between latent variables and observed information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- ARD:
-
automatic relevance determination
- BSS:
-
blind source separation
- EM:
-
expectation maximization
- FA:
-
factor analysis
- HMM:
-
hidden Markov model
- i.i.d.:
-
independent, identically distributed
- ICA:
-
independent component analysis
- IO-HMM:
-
input/output hidden Markov model
- KL:
-
Kullback–Leibler
- LDA:
-
latent Dirichlet allocation
- MAP:
-
maximum a posteriori
- ML:
-
maximum likelihood
- MLP:
-
multilayer perceptron
- MPE:
-
most probable explanation
- NMF:
-
negative matrix and tensor factorization
- PCA:
-
principal component analysis
- pdf:
-
probability density function
- pLSA:
-
probabilistic latent semantic analysis
- SCNG:
-
sparse coding neural gas
- SNE:
-
stochastic neighborhood embedding
- SVM:
-
support vector machine
- VB:
-
variational Bayes
References
S. Kullback, R.A. Leibler: On information and sufficiency, Ann. Math. Stat. 22, 79–86 (1951)
F. Rosenblatt: The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev. 65, 386–408 (1958)
G. Deco, W. Finnoff, H.G. Zimmermann: Unsupervised mutual information criterion for elemination of overtraining in supervised mulilayer networks, Neural Comput. 7, 86–107 (1995)
D.J.C. Mackay: Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, Cambridge 2003)
R. Salakhutdinov, G. Hinton: Using deep belief nets to learn covariance kernels for Gaussian processes, Adv. Neural Inf. Process. Syst. 20, 1249–1256 (2008)
C.M. Bishop: Pattern Recognition and Machine Learning (Springer, New York 2006)
S. Seth, J.C. Principe: Variable selection: A statistical dependence perspective, Proc. Int. Conf. Mach. Learn. Appl. (ICMLA) (2010)
M. Rao, S. Seth, J. Xu, Y. Chen, H. Tagare, J.C. Principe: A test of independence based on a generalized correlation function, Signal Process. 91, 15–27 (2011)
D.D. Lee, H.S. Seung: Learning the parts of objects by non-negative matrix factorization, Nature 401(6755), 788–791 (1999)
P. Comon, C. Jutten: Handbook of Blind Source Separation (Academic, Oxford 2010)
A. Hyvärinen, J. Karhunen, E. Oja: Independent Component Analysis (Wiley, New York 2001)
A. Cichocki, R. Zdunek, A.H. Phan, S.-I. Amari: Nonnegative Matrix Tensor Factorizations (Wiley, Chichester 2009)
E. Gaussier, C. Goutte: Relation between plsa and nmf and implications, Proc. 28th Int. ACM Conf. Res. Dev. Inf. Retr. (SIGIR'05) (ACM, New York 2005) pp. 601–602
D.T. Pham: Mutual information approach to blind separation of stationary sources, IEEE Trans. Inf. Theory 48, 1935–1946 (2002)
M. Minami, S. Eguchi: Robust blind source separation by beta divergence, Neural Comput. 14, 1859–1886 (2002)
T.-W. Lee, M. Girolami, T.J. Sejnowski: Independent component analysis using an extended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources, Neural Comput. 11(2), 417–441 (1999)
K. Labusch, E. Barth, T. Martinetz: Sparse coding neural gas: Learning of overcomplete data representations, Neuro 72(7–9), 1547–1555 (2009)
A. Cichocki, S. Cruces, S.-I. Amari: Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization, Entropy 13, 134–170 (2011)
I. Csiszár: Axiomatic characterization of information measures, Entropy 10, 261–273 (2008)
F. Liese, I. Vajda: On divergences and informations in statistics and information theory, IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
T. Villmann, S. Haase: Divergence based vector quantization, Neural Comput. 23(5), 1343–1392 (2011)
P.L. Zador: Asymptotic quantization error of continuous signals and the quantization dimension, IEEE Trans. Inf. Theory 28, 149–159 (1982)
T. Villmann, J.-C. Claussen: Magnification control in self-organizing maps and neural gas, Neural Comput. 18(2), 446–469 (2006)
B. Hammer, A. Hasenfuss, T. Villmann: Magnification control for batch neural gas, Neurocomputing 70(7–9), 1225–1234 (2007)
E. Merényi, A. Jain, T. Villmann: Explicit magnification control of self-organizing maps for “forbidden” data, IEEE Trans. Neural Netw. 18(3), 786–797 (2007)
T. Villmann, S. Haase: Magnification in divergence based neural maps, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN 2011), ed. by R. Mikkulainen (IEEE, Los Alamitos 2011) pp. 437–441
R. Chalasani, J.C. Principe: Self organizing maps with the correntropy induced metric, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN 2010) (IEEE, Barcelona 2010) pp. 1–6
T. Lehn-Schiøler, A. Hegde, D. Erdogmus, J.C. Principe: Vector quantization using information theoretic concepts, Nat. Comput. 4(1), 39–51 (2005)
R. Jenssen, D. Erdogmus, J.C. Principe, T. Eltoft: The Laplacian PDF distance: A cost function for clustering in a kernel feature space, Adv. Neural Inf. Process. Syst., Vol. 17 (MIT Press, Cambridge 2005) pp. 625–632
A. Hegde, D. Erdogmus, T. Lehn-Schiøler, Y.N. Rao, J.C. Principe: Vector quantization by density matching in the minimum Kullback-Leibler-divergence sense, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN), Budapest (IEEE, New York 2004) pp. 105–109
G.E. Hinton, S.T. Roweis: Stochastic neighbor embedding, Adv. Neural Inf. Process. Syst., Vol. 15 (MIT Press, Cambridge 2002) pp. 833–840
L. van der Maaten, G. Hinten: Visualizing data using t-SNE, J. Mach. Learn. Res. 9, 2579–2605 (2008)
K. Bunte, S. Haase, M. Biehl, T. Villmann: Stochastic neighbor embedding (SNE) for dimension reduction and visualization using arbitrary divergences, Neurocomputing 90(9), 23–45 (2012)
M. Strickert, F.-M. Schleif, U. Seiffert, T. Villmann: Derivatives of pearson correlation for gradient-based analysis of biomedical data, Intel. Artif. Rev. Iberoam. Intel. Artif. 37, 37–44 (2008)
M. Strickert, B. Labitzke, A. Kolb, T. Villmann: Multispectral image characterization by partial generalized covariance, Proc. Eur. Symp. Artif. Neural Netw. (ESANN'2011), Louvain-La-Neuve, ed. by M. Verleysen (2011) pp. 105–110
V. Gómez-Verdejo, M. Verleysen, J. Fleury: Information-theoretic feature selection for functional data classification, Neurocomputing 72(16–18), 3580–3589 (2009)
B. Hammer, T. Villmann: Generalized relevance learning vector quantization, Neural Netw. 15(8/9), 1059–1068 (2002)
T. Villmann, M. Kästner: Sparse functional relevance learning in generalized learning vector quantization, Lect. Notes Comput. Sci. 6731, 79–89 (2011)
M. Kästner, B. Hammer, M. Biehl, T. Villmann: Functional relevance learning in generalized learning vector quantization, Neurocomputing 90(9), 85–95 (2012)
A. Kraskov, H. Stogbauer, P. Grassberger: Estimating mutual information, Phys. Rev. E 69(6), 66–138 (2004)
Y.-I. Moon, B. Rajagopalan, U. Lall: Estimating mutual information by kernel density estimators, Phys. Rev. E 52, 2318–2321 (1995)
J.C. Principe: Information Theoretic Learning (Springer, Heidelberg, 2010)
R. Andonie, A. Cataron: An information energy LVQ approach for feature ranking, Eur. Symp. Artif. Neural Netw. 2004, ed. by M. Verleysen (d-side, Evere 2004) pp. 471–476
R. Jenssen, D. Erdogmus, J.C. Principe, T. Eltoft: Some equivalences between kernel methods and information theoretic methods, J. VLSI Signal Process. 45, 49–65 (2006)
P.J.G. Lisboa, T.A. Etchells, I.H. Jarman, C.T.C. Arsene, M.S.H. Aung, A. Eleuteri, A.F.G. Taktak, F. Ambrogi, P. Boracchi, E. Biganzoli: Partial logistic artificial neural network for competing risks regularized with automatic relevance determination, IEEE Trans. Neural Netw. 20(9), 1403–1416 (2009)
M.I. Jordan: Graphical models, Stat. Sci. 19, 140–155 (2004)
D. Koller, N. Friedman: Probabilistic Graphical Models: Principles and Techniques – Adaptive Computation and Machine Learning (MIT Press, Cambridge 2009)
A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)
M.E. Tipping, C.M. Bishop: Probabilistic principal component analysis, J. R. Stat. Soc. Ser. B 61(3), 611–622 (1999)
T. Hofmann: Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42(1/2), 177–196 (2001)
M. Welling, C. Chemudugunta, N. Sutter: Deterministic latent variable models and their pitfalls, SIAM Int. Conf. Data Min. (2008)
D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993–1022 (2003)
T. Minka, J. Lafferty: Expectation propagation for the generative aspect model, Proc. Conf. Uncertain. AI (2002)
T. Griffiths, M. Steyvers: Finding scientific topics, Proc. Natl. Acad. Sci. USA 101, 5228–5235 (2004)
M. Blei, D. Blei, T. Griffiths, J. Tenenbaum: Hierarchical topic models and the nested Chinese restaurant process, Adv. Neural Inf. Process. Syst., Vol. 16 (MIT Press, Cambridge 2004) p. 17
M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth: The author-topic model for authors and documents, Proc. 20th Conf. Uncertain. Artif. Intell., UAI '04 (AUAI, Corvallis 2004) pp. 487–494
L.-J. Li, L. Fei-Fei: What, where and who? classifying events by scene and object recognition, IEEE 11th Int. Conf. Comput. Vis. (ICCV) 2007 (2007), pp. 1–8
L.R. Rabiner: A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257–286 (1989)
L.E. Baum, T. Petrie: Statistical inference for probabilistic functions of finite state Markov chains, Ann. Math. Stat. 37(6), 1554–1563 (1966)
S.E. Levinson, L.R. Rabiner, M.M. Sondhi: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition, Bell Syst. Tech. J. 62(4), 1035–1074 (1983)
P.A. Devijver: Baum's forward-backward algorithm revisited, Pattern Recogn. Lett. 3(6), 369–373 (1985)
M. Brand, N. Oliver, A. Pentland: Coupled hidden Markov models for complex action recognition, Computer Vision and Pattern Recognition, Proc., 1997 IEEE (1997) pp. 994–999
Z. Ghahramani, M.I. Jordan: Factorial hidden Markov models, Mach. Learn. 29(2), 245–273 (1997)
Y. Bengio, P. Frasconi: Input-output HMMs for sequence processing, IEEE Trans. Neural Netw. 7(5), 1231–1249 (1996)
Y. Li, H.Y. Shum: Learning dynamic audio-visual mapping with input-output hidden Markov models, IEEE Trans. Multimed. 8(3), 542–549 (2006)
B. Knab, A. Schliep, B. Steckemetz, B. Wichern: Model-based clustering with hidden Markov models and its application to financial time-series data, Proc. GfKl 2002 Data Sci. Appl. Data Anal. (Springer, Berlin, Heidelberg 2003) pp. 561–569
M. Seifert, M. Strickert, A. Schliep, I. Grosse: Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended hidden Markov models, Bioinformatics 27(12), 1645–1652 (2011)
M. Diligenti, P. Frasconi, M. Gori: Hidden tree markov models for document image classification, IEEE Trans. Pattern Anal. Mach. Intell. 25(4), 519–523 (2003)
D. Bacciu, A. Micheli, A. Sperduti: Compositional generative mapping for tree-structured data – Part I: Bottom-up probabilistic modeling of trees, IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1987–2002 (2012)
D. Bacciu, A. Micheli, A. Sperduti: An input-output hidden Markov model for tree transductions, Neurocomputing 112, 34–46 (2013)
M.J. Beal, Z. Ghahramani, C.E. Rasmussen: The infinite hidden Markov model, Adv. Neural Inf. Process. Syst. 14, 577–584 (2002)
C. Sutton, A. McCallum: An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, ed. by L. Getoor, B. Taskar (MIT Press, Cambridge 2006) pp. 93–128
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bacciu, D., Lisboa, P.J., Sperduti, A., Villmann, T. (2015). Probabilistic Modeling in Machine Learning. In: Kacprzyk, J., Pedrycz, W. (eds) Springer Handbook of Computational Intelligence. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43505-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-662-43505-2_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43504-5
Online ISBN: 978-3-662-43505-2
eBook Packages: EngineeringEngineering (R0)