Probabilistic Modeling in Machine Learning

Bacciu, Davide; Lisboa, Paulo J.G.; Sperduti, Alessandro; Villmann, Thomas

doi:10.1007/978-3-662-43505-2_31

Davide Bacciu⁴,
Paulo J.G. Lisboa⁵,
Alessandro Sperduti³ &
…
Thomas Villmann⁶

Part of the book series: Springer Handbooks ((SHB))

11k Accesses
1 Citations

Abstract

Probabilistic methods are the heart of machine learning. This chapter shows links between core principles of information theory and probabilistic methods, with a short overview of historical and current examples of unsupervised and inferential models. Probabilistic models are introduced as a powerful idiom to describe the world, using random variables as building blocks held together by probabilistic relationships. The chapter discusses how such probabilistic interactions can be mapped to directed and undirected graph structures, which are the Bayesian and Markov networks. We show how these networks are subsumed by the broader class of the probabilistic graphical models, a general framework that provides concepts and methodological tools to encode, manipulate and process probabilistic knowledge in a computationally efficient way. The chapter then introduces, in more detail, two topical methodologies that are central to probabilistic modeling in machine learning. First, it discusses latent variable models, a probabilistic approach to capture complex relationships between a large number of observable and measurable events (data, in general), under the assumption that these are generated by an unknown, nonobservable process. We show how the parameters of a probabilistic model involving such nonobservable information can be efficiently estimated using the concepts underlying the expectation–maximization algorithms. Second, the chapter introduces a notable example of latent variable model, that is of particular relevance for representing the time evolution of sequence data, that is the hidden Markov model . The chapter ends with a discussion on advanced approaches for modeling complex data-generating processes comprising nontrivial probabilistic interactions between latent variables and observed information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 269.00; Price excludes VAT (USA)

Hardcover Book: USD 349.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

ARD:: automatic relevance determination
BSS:: blind source separation
EM:: expectation maximization
FA:: factor analysis
HMM:: hidden Markov model
i.i.d.:: independent, identically distributed
ICA:: independent component analysis
IO-HMM:: input/output hidden Markov model
KL:: Kullback–Leibler
LDA:: latent Dirichlet allocation
MAP:: maximum a posteriori
ML:: maximum likelihood
MLP:: multilayer perceptron
MPE:: most probable explanation
NMF:: negative matrix and tensor factorization
PCA:: principal component analysis
pdf:: probability density function
pLSA:: probabilistic latent semantic analysis
SCNG:: sparse coding neural gas
SNE:: stochastic neighborhood embedding
SVM:: support vector machine
VB:: variational Bayes

References

S. Kullback, R.A. Leibler: On information and sufficiency, Ann. Math. Stat. 22, 79–86 (1951)
Article MathSciNet MATH Google Scholar
F. Rosenblatt: The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev. 65, 386–408 (1958)
Article MathSciNet Google Scholar
G. Deco, W. Finnoff, H.G. Zimmermann: Unsupervised mutual information criterion for elemination of overtraining in supervised mulilayer networks, Neural Comput. 7, 86–107 (1995)
Article Google Scholar
D.J.C. Mackay: Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, Cambridge 2003)
MATH Google Scholar
R. Salakhutdinov, G. Hinton: Using deep belief nets to learn covariance kernels for Gaussian processes, Adv. Neural Inf. Process. Syst. 20, 1249–1256 (2008)
Google Scholar
C.M. Bishop: Pattern Recognition and Machine Learning (Springer, New York 2006)
MATH Google Scholar
S. Seth, J.C. Principe: Variable selection: A statistical dependence perspective, Proc. Int. Conf. Mach. Learn. Appl. (ICMLA) (2010)
Google Scholar
M. Rao, S. Seth, J. Xu, Y. Chen, H. Tagare, J.C. Principe: A test of independence based on a generalized correlation function, Signal Process. 91, 15–27 (2011)
Article MATH Google Scholar
D.D. Lee, H.S. Seung: Learning the parts of objects by non-negative matrix factorization, Nature 401(6755), 788–791 (1999)
Article Google Scholar
P. Comon, C. Jutten: Handbook of Blind Source Separation (Academic, Oxford 2010)
Google Scholar
A. Hyvärinen, J. Karhunen, E. Oja: Independent Component Analysis (Wiley, New York 2001)
Book Google Scholar
A. Cichocki, R. Zdunek, A.H. Phan, S.-I. Amari: Nonnegative Matrix Tensor Factorizations (Wiley, Chichester 2009)
Book Google Scholar
E. Gaussier, C. Goutte: Relation between plsa and nmf and implications, Proc. 28th Int. ACM Conf. Res. Dev. Inf. Retr. (SIGIR'05) (ACM, New York 2005) pp. 601–602
Google Scholar
D.T. Pham: Mutual information approach to blind separation of stationary sources, IEEE Trans. Inf. Theory 48, 1935–1946 (2002)
Article MathSciNet MATH Google Scholar
M. Minami, S. Eguchi: Robust blind source separation by beta divergence, Neural Comput. 14, 1859–1886 (2002)
Article MATH Google Scholar
T.-W. Lee, M. Girolami, T.J. Sejnowski: Independent component analysis using an extended infomax algorithm for mixed sub-Gaussian and super-Gaussian sources, Neural Comput. 11(2), 417–441 (1999)
Article Google Scholar
K. Labusch, E. Barth, T. Martinetz: Sparse coding neural gas: Learning of overcomplete data representations, Neuro 72(7–9), 1547–1555 (2009)
Google Scholar
A. Cichocki, S. Cruces, S.-I. Amari: Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization, Entropy 13, 134–170 (2011)
Article Google Scholar
I. Csiszár: Axiomatic characterization of information measures, Entropy 10, 261–273 (2008)
Article MATH Google Scholar
F. Liese, I. Vajda: On divergences and informations in statistics and information theory, IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
Article MathSciNet MATH Google Scholar
T. Villmann, S. Haase: Divergence based vector quantization, Neural Comput. 23(5), 1343–1392 (2011)
Article MathSciNet MATH Google Scholar
P.L. Zador: Asymptotic quantization error of continuous signals and the quantization dimension, IEEE Trans. Inf. Theory 28, 149–159 (1982)
Article MathSciNet MATH Google Scholar
T. Villmann, J.-C. Claussen: Magnification control in self-organizing maps and neural gas, Neural Comput. 18(2), 446–469 (2006)
Article MathSciNet MATH Google Scholar
B. Hammer, A. Hasenfuss, T. Villmann: Magnification control for batch neural gas, Neurocomputing 70(7–9), 1225–1234 (2007)
Article Google Scholar
E. Merényi, A. Jain, T. Villmann: Explicit magnification control of self-organizing maps for “forbidden” data, IEEE Trans. Neural Netw. 18(3), 786–797 (2007)
Article Google Scholar
T. Villmann, S. Haase: Magnification in divergence based neural maps, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN 2011), ed. by R. Mikkulainen (IEEE, Los Alamitos 2011) pp. 437–441
Google Scholar
R. Chalasani, J.C. Principe: Self organizing maps with the correntropy induced metric, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN 2010) (IEEE, Barcelona 2010) pp. 1–6
Chapter Google Scholar
T. Lehn-Schiøler, A. Hegde, D. Erdogmus, J.C. Principe: Vector quantization using information theoretic concepts, Nat. Comput. 4(1), 39–51 (2005)
Article MathSciNet MATH Google Scholar
R. Jenssen, D. Erdogmus, J.C. Principe, T. Eltoft: The Laplacian PDF distance: A cost function for clustering in a kernel feature space, Adv. Neural Inf. Process. Syst., Vol. 17 (MIT Press, Cambridge 2005) pp. 625–632
Google Scholar
A. Hegde, D. Erdogmus, T. Lehn-Schiøler, Y.N. Rao, J.C. Principe: Vector quantization by density matching in the minimum Kullback-Leibler-divergence sense, Proc. Int. Jt. Conf. Artif. Neural Netw. (IJCNN), Budapest (IEEE, New York 2004) pp. 105–109
Google Scholar
G.E. Hinton, S.T. Roweis: Stochastic neighbor embedding, Adv. Neural Inf. Process. Syst., Vol. 15 (MIT Press, Cambridge 2002) pp. 833–840
Google Scholar
L. van der Maaten, G. Hinten: Visualizing data using t-SNE, J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
K. Bunte, S. Haase, M. Biehl, T. Villmann: Stochastic neighbor embedding (SNE) for dimension reduction and visualization using arbitrary divergences, Neurocomputing 90(9), 23–45 (2012)
Article Google Scholar
M. Strickert, F.-M. Schleif, U. Seiffert, T. Villmann: Derivatives of pearson correlation for gradient-based analysis of biomedical data, Intel. Artif. Rev. Iberoam. Intel. Artif. 37, 37–44 (2008)
Google Scholar
M. Strickert, B. Labitzke, A. Kolb, T. Villmann: Multispectral image characterization by partial generalized covariance, Proc. Eur. Symp. Artif. Neural Netw. (ESANN'2011), Louvain-La-Neuve, ed. by M. Verleysen (2011) pp. 105–110
Google Scholar
V. Gómez-Verdejo, M. Verleysen, J. Fleury: Information-theoretic feature selection for functional data classification, Neurocomputing 72(16–18), 3580–3589 (2009)
Article Google Scholar
B. Hammer, T. Villmann: Generalized relevance learning vector quantization, Neural Netw. 15(8/9), 1059–1068 (2002)
Article Google Scholar
T. Villmann, M. Kästner: Sparse functional relevance learning in generalized learning vector quantization, Lect. Notes Comput. Sci. 6731, 79–89 (2011)
Article Google Scholar
M. Kästner, B. Hammer, M. Biehl, T. Villmann: Functional relevance learning in generalized learning vector quantization, Neurocomputing 90(9), 85–95 (2012)
Article Google Scholar
A. Kraskov, H. Stogbauer, P. Grassberger: Estimating mutual information, Phys. Rev. E 69(6), 66–138 (2004)
Article MathSciNet Google Scholar
Y.-I. Moon, B. Rajagopalan, U. Lall: Estimating mutual information by kernel density estimators, Phys. Rev. E 52, 2318–2321 (1995)
Article Google Scholar
J.C. Principe: Information Theoretic Learning (Springer, Heidelberg, 2010)
Book MATH Google Scholar
R. Andonie, A. Cataron: An information energy LVQ approach for feature ranking, Eur. Symp. Artif. Neural Netw. 2004, ed. by M. Verleysen (d-side, Evere 2004) pp. 471–476
Google Scholar
R. Jenssen, D. Erdogmus, J.C. Principe, T. Eltoft: Some equivalences between kernel methods and information theoretic methods, J. VLSI Signal Process. 45, 49–65 (2006)
Article MATH Google Scholar
P.J.G. Lisboa, T.A. Etchells, I.H. Jarman, C.T.C. Arsene, M.S.H. Aung, A. Eleuteri, A.F.G. Taktak, F. Ambrogi, P. Boracchi, E. Biganzoli: Partial logistic artificial neural network for competing risks regularized with automatic relevance determination, IEEE Trans. Neural Netw. 20(9), 1403–1416 (2009)
Article Google Scholar
M.I. Jordan: Graphical models, Stat. Sci. 19, 140–155 (2004)
Article MathSciNet MATH Google Scholar
D. Koller, N. Friedman: Probabilistic Graphical Models: Principles and Techniques – Adaptive Computation and Machine Learning (MIT Press, Cambridge 2009)
Google Scholar
A.P. Dempster, N.M. Laird, D.B. Rubin: Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
M.E. Tipping, C.M. Bishop: Probabilistic principal component analysis, J. R. Stat. Soc. Ser. B 61(3), 611–622 (1999)
Article MathSciNet MATH Google Scholar
T. Hofmann: Unsupervised learning by probabilistic latent semantic analysis, Mach. Learn. 42(1/2), 177–196 (2001)
Article MATH Google Scholar
M. Welling, C. Chemudugunta, N. Sutter: Deterministic latent variable models and their pitfalls, SIAM Int. Conf. Data Min. (2008)
Google Scholar
D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
T. Minka, J. Lafferty: Expectation propagation for the generative aspect model, Proc. Conf. Uncertain. AI (2002)
Google Scholar
T. Griffiths, M. Steyvers: Finding scientific topics, Proc. Natl. Acad. Sci. USA 101, 5228–5235 (2004)
Article Google Scholar
M. Blei, D. Blei, T. Griffiths, J. Tenenbaum: Hierarchical topic models and the nested Chinese restaurant process, Adv. Neural Inf. Process. Syst., Vol. 16 (MIT Press, Cambridge 2004) p. 17
Google Scholar
M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth: The author-topic model for authors and documents, Proc. 20th Conf. Uncertain. Artif. Intell., UAI '04 (AUAI, Corvallis 2004) pp. 487–494
Google Scholar
L.-J. Li, L. Fei-Fei: What, where and who? classifying events by scene and object recognition, IEEE 11th Int. Conf. Comput. Vis. (ICCV) 2007 (2007), pp. 1–8
Google Scholar
L.R. Rabiner: A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
L.E. Baum, T. Petrie: Statistical inference for probabilistic functions of finite state Markov chains, Ann. Math. Stat. 37(6), 1554–1563 (1966)
Article MathSciNet MATH Google Scholar
S.E. Levinson, L.R. Rabiner, M.M. Sondhi: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition, Bell Syst. Tech. J. 62(4), 1035–1074 (1983)
Article MathSciNet MATH Google Scholar
P.A. Devijver: Baum's forward-backward algorithm revisited, Pattern Recogn. Lett. 3(6), 369–373 (1985)
Article MATH Google Scholar
M. Brand, N. Oliver, A. Pentland: Coupled hidden Markov models for complex action recognition, Computer Vision and Pattern Recognition, Proc., 1997 IEEE (1997) pp. 994–999
Google Scholar
Z. Ghahramani, M.I. Jordan: Factorial hidden Markov models, Mach. Learn. 29(2), 245–273 (1997)
Article MATH Google Scholar
Y. Bengio, P. Frasconi: Input-output HMMs for sequence processing, IEEE Trans. Neural Netw. 7(5), 1231–1249 (1996)
Article Google Scholar
Y. Li, H.Y. Shum: Learning dynamic audio-visual mapping with input-output hidden Markov models, IEEE Trans. Multimed. 8(3), 542–549 (2006)
Article Google Scholar
B. Knab, A. Schliep, B. Steckemetz, B. Wichern: Model-based clustering with hidden Markov models and its application to financial time-series data, Proc. GfKl 2002 Data Sci. Appl. Data Anal. (Springer, Berlin, Heidelberg 2003) pp. 561–569
Google Scholar
M. Seifert, M. Strickert, A. Schliep, I. Grosse: Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended hidden Markov models, Bioinformatics 27(12), 1645–1652 (2011)
Article Google Scholar
M. Diligenti, P. Frasconi, M. Gori: Hidden tree markov models for document image classification, IEEE Trans. Pattern Anal. Mach. Intell. 25(4), 519–523 (2003)
Article MATH Google Scholar
D. Bacciu, A. Micheli, A. Sperduti: Compositional generative mapping for tree-structured data – Part I: Bottom-up probabilistic modeling of trees, IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1987–2002 (2012)
Article Google Scholar
D. Bacciu, A. Micheli, A. Sperduti: An input-output hidden Markov model for tree transductions, Neurocomputing 112, 34–46 (2013)
Article Google Scholar
M.J. Beal, Z. Ghahramani, C.E. Rasmussen: The infinite hidden Markov model, Adv. Neural Inf. Process. Syst. 14, 577–584 (2002)
Google Scholar
C. Sutton, A. McCallum: An introduction to conditional random fields for relational learning. In: Introduction to Statistical Relational Learning, ed. by L. Getoor, B. Taskar (MIT Press, Cambridge 2006) pp. 93–128
Google Scholar

Download references

Author information

Authors and Affiliations

Dep. Pure and Applied Mathematics, University of Padova, Via Trieste, 63, 351 21, Padova, Italy
Alessandro Sperduti
Dip. Informatica, Università di Pisa, L.Go B. Pontecorvo, 3, 56127, Pisa, Italy
Davide Bacciu
Dep. Mathematics & Statistics, Liverpool John Moores University, Byrom St, L3 3AF, Liverpool, UK
Paulo J.G. Lisboa
Dep. Mathematics, Natural and Computer Sciences, University of Applied Sciences Mittweida, Technikumplatz 17, 09648, Mittweida, Germany
Thomas Villmann

Authors

Davide Bacciu
View author publications
You can also search for this author in PubMed Google Scholar
Paulo J.G. Lisboa
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Sperduti
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Villmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davide Bacciu .

Editor information

Editors and Affiliations

Systems Research Inst., Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Janusz Kacprzyk
Dep. Electrical and Computer Engineering, University of Alberta, 116 Street 9107, T6J 2V4, Edmonton, Alberta, Canada
Witold Pedrycz

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bacciu, D., Lisboa, P.J., Sperduti, A., Villmann, T. (2015). Probabilistic Modeling in Machine Learning. In: Kacprzyk, J., Pedrycz, W. (eds) Springer Handbook of Computational Intelligence. Springer Handbooks. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43505-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-662-43505-2_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43504-5
Online ISBN: 978-3-662-43505-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics