We give a tutorial and overview of the field of unsupervised learning from the perspective of statistical modeling. Unsupervised learning can be motivated from information theoretic and Bayesian principles. We briefly review basic models in unsupervised learning, including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov models, state-space models, and many variants and extensions. We derive the EM algorithm and give an overview of fundamental concepts in graphical models, and inference algorithms on graphs. This is followed by a quick tour of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC), Laplace approximation, BIC, variational approximations, and expectation propagation (EP). The aim of this chapter is to provide a high-level view of the field. Along the way, many state-of-the-art ideas and future directions are also reviewed.


Hide Markov Model Markov Chain Monte Carlo Bayesian Information Criterion Independent Component Analysis Unsupervised Learning 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    MacKay, D.J.C.: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar
  2. 2.
    Jaynes, E.T.: Probability Theory: The Logic of Science (Edited by G. Larry Bretthorst). Cambridge University Press, Cambridge (2003)CrossRefzbMATHGoogle Scholar
  3. 3.
    Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, 219–269 (1995)CrossRefGoogle Scholar
  4. 4.
    Green, P.J.: Penalized likelihood. In: Encyclopedia of Statistical Sciences, Update vol. 2 (1998)Google Scholar
  5. 5.
    Roweis, S.T., Ghahramani, Z.: A unifying review of linear Gaussian models. Neural Computation 11, 305–345 (1999)CrossRefGoogle Scholar
  6. 6.
    Roweis, S.T.: EM algorithms for PCA and SPCA. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in Neural Information Processing Systems, vol. 10. The MIT Press, Cambridge (1998)Google Scholar
  7. 7.
    Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61, 611–622 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Salakhutdinov, R., Roweis, S.T., Ghahramani, Z.: Optimization with EM and Expectation-Conjugate-Gradient. In: International Conference on Machine Learning (ICML 2003), pp. 672–679 (2003)Google Scholar
  9. 9.
    Shumway, R.H., Stoffer, D.S.: An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis 3, 253–264 (1982)CrossRefzbMATHGoogle Scholar
  10. 10.
    Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. University of Toronto, Technical Report CRG-TR-96-1 (1996)Google Scholar
  11. 11.
    Hinton, G.E., Dayan, P., Revow, M.: Modeling the manifolds of images of handwritten digits. IEEE Trans. Neural Networks 8, 65–74 (1997)CrossRefGoogle Scholar
  12. 12.
    Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Computation 11, 443–482 (1999)CrossRefGoogle Scholar
  13. 13.
    Ghahramani, Z., Roweis, S.T.: Learning nonlinear dynamical systems using an EM algorithm. In: NIPS, vol. 11, pp. 431–437 (1999)Google Scholar
  14. 14.
    Handschin, J.E., Mayne, D.Q.: Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering. International Journal of Control 9, 547–559 (1969)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Gordon, N.J., Salmond, D.J., Smith, A.F.M.: A novel approach to nonlinear/non-Gaussian Bayesian state space estimation. IEEE Proceedings F: Radar and Signal Processing 140, 107–113 (1993)Google Scholar
  16. 16.
    Kanazawa, K., Koller, D., Russell, S.J.: Stochastic simulation algorithms for dynamic probabilistic networks. In: Besnard, P., Hanks, S. (eds.) Uncertainty in Artificial Intelligence. Proceedings of the Eleventh Conference, pp. 346–351. Morgan Kaufmann Publishers, San Francisco (1995)Google Scholar
  17. 17.
    Kitagawa, G.: Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. of Computational and Graphical Statistics 5, 1–25 (1996)MathSciNetGoogle Scholar
  18. 18.
    Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking (1998)Google Scholar
  19. 19.
    Doucet, A., de Freitas, J.F.G., Gordon, N.J.: Sequential Monte Carlo Methods in Practice. Springer, New York (2000)zbMATHGoogle Scholar
  20. 20.
    Anderson, B.D.O., Moore, J.B.: Optimal Filtering. Prentice-Hall, Englewood Cliffs (1979)zbMATHGoogle Scholar
  21. 21.
    Julier, S.J., Uhlmann, J.K.: A new extension of the Kalman filter to nonlinear systems. In: Int. Symp. Aerospace/Defense Sensing, Simulation and Controls (1997)Google Scholar
  22. 22.
    Wan, E.A., van der Merwe, R., Nelson, A.T.: Dual estimation and the unscented transformation. In: NIPS, vol. 12, pp. 666–672 (2000)Google Scholar
  23. 23.
    Minka, T.P.: Expectation propagation for approximate Bayesian inference. In: Uncertainty in Artificial Intelligence: Proceedings of the Seventeenth Conference (UAI 2001), pp. 362–369. Morgan Kaufmann, San Francisco (2001)Google Scholar
  24. 24.
    Neal, R.M., Beal, M.J., Roweis, S.T.: Inferring state sequences for non-linear systems with embedded hidden Markov models. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)Google Scholar
  25. 25.
    Bishop, C.M., Svensen, M., Williams, C.K.I.: GTM: The generative topographic mapping. Neural Computation 10, 215–234 (1998)CrossRefzbMATHGoogle Scholar
  26. 26.
    Shepard, R.N.: The analysis of proximities: multidimensional scaling with an unknown distance function i and ii. Psychometrika 27, 125–139, 219–246 (1962)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27, 115–129 (1964)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Hastie, T., Stuetzle, W.: Principle curves. Journal of the American Statistical Association 84, 502–516 (1989)MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)CrossRefGoogle Scholar
  30. 30.
    Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)CrossRefGoogle Scholar
  31. 31.
    Ghahramani, Z., Jordan, M.I.: Factorial hidden Markov models. Machine Learning 29, 245–273 (1997)CrossRefzbMATHGoogle Scholar
  32. 32.
    Murphy, K.P.: Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, UC Berkeley, Computer Science Division (2002)Google Scholar
  33. 33.
    Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann machines. Cognitive Science 9, 147–169 (1985)CrossRefGoogle Scholar
  34. 34.
    Hinton, G.E., Dayan, P., Frey, B.J., Neal, R.M.: The wake-sleep algorithm for unsupervised neural networks. Science 268, 1158–1161 (1995)CrossRefGoogle Scholar
  35. 35.
    Karklin, Y., Lewicki, M.S.: Learning higher-order structures in natural images. Network: Computation in Neural Systems 14, 483–499 (2003)CrossRefGoogle Scholar
  36. 36.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo (1988)zbMATHGoogle Scholar
  37. 37.
    Besag, J.: Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Ser. B 6, 192–236 (1974)MathSciNetzbMATHGoogle Scholar
  38. 38.
    Cooper, G.F.: The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence 42, 393–405 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Weiss, Y.: Correctness of local probability propagation in graphical models with loops. Neural Computation 12, 1–41 (2000)CrossRefGoogle Scholar
  40. 40.
    Weiss, Y., Freeman, W.T.: On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms 47 (2001)Google Scholar
  41. 41.
    Gallager, R.G.: Low-Density Parity-Check Codes. MIT Press, Cambridge (1963)zbMATHGoogle Scholar
  42. 42.
    Berrou, C., Glavieux, A., Thitimajshima, P.: Near shannon limit error-correcting coding and decoding: Turbo-codes (1). In: Proc. ICC 1993, pp. 1064–1070 (1993)Google Scholar
  43. 43.
    McEliece, R.J., MacKay, D.J.C., Cheng, J.F.: Turbo decoding as an instance of Pearl’s Belief Propagation algorithm. IEEE Journal on Selected Areas in Communications 16, 140–152 (1998)CrossRefGoogle Scholar
  44. 44.
    MacKay, D.J.C., Neal, R.M.: Good error-correcting codes based on very sparse matrices. IEEE Transactions on Information Theory 45, 399–431 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  45. 45.
    Yedidia, J.S., Freeman, W.T., Weiss, Y.: Generalized belief propagation. In: NIPS 13. MIT Press, Cambridge (2001)Google Scholar
  46. 46.
    Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 157–224 (1988)Google Scholar
  47. 47.
    Arnborg, S., Corneil, D.G., Proskurowski, A.: Complexity of finding embeddings in a k-tree. SIAM Journal of Algebraic and Discrete Methods 8, 277–284 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Kjaerulff, U.: Triangulation of graphs—algorithms giving small total state space (1990)Google Scholar
  49. 49.
    Heckerman, D.: A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research (1996)Google Scholar
  50. 50.
    Murray, I., Ghahramani, Z.: Bayesian learning in undirected graphical models: Approximate MCMC algorithms. In: Proceedings of UAI (2004)Google Scholar
  51. 51.
    Neal, R.M.: Connectionist learning of belief networks. Artificial Intelligence 56, 71–113 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  52. 52.
    Elidan, G., Lotner, N., Friedman, N., Koller, D.: Discovering hidden variables: A structure-based approach. In: Advances in Neural Information Processing Systems (NIPS) (2001)Google Scholar
  53. 53.
    Neal, R.M.: Probabilistic inference using Markov chain Monte Carlo methods. Technical report, Department of Computer Science, University of Toronto (1993)Google Scholar
  54. 54.
    Beal, M.J., Ghahramani, Z.: The variational Bayesian EM algorithm for incomplete data: With application to scoring graphical model structures. In: Bernardo, J.M., Dawid, A.P., Berger, J.O., West, M., Heckerman, D., Bayarri, M.J. (eds.) Bayesian Statistics, vol. 7. Oxford University Press, Oxford (2003)Google Scholar
  55. 55.
    Heckerman, D., Chickering, D.M.: A comparison of scientific and engineering criteria for Bayesian model selection (1996)Google Scholar
  56. 56.
    Friedman, N.: The Bayesian structural EM algorithm. In: Proc. Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI 1998). Morgan Kaufmann, San Francisco (1998)Google Scholar
  57. 57.
    Moore, A., Wong, W.K.: Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 552–559. AAAI Press, Menlo Park (2003)Google Scholar
  58. 58.
    Friedman, N., Koller, D.: Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning 50, 95–126 (2003)CrossRefzbMATHGoogle Scholar
  59. 59.
    Jefferys, W., Berger, J.: Ockham’s razor and Bayesian analysis. American Scientist 80, 64–72 (1992)Google Scholar
  60. 60.
    MacKay, D.J.C.: Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6, 469–505 (1995)CrossRefzbMATHGoogle Scholar
  61. 61.
    Rasmussen, C.E., Ghahramani, Z.: Occam’s razor. In: Advances in Neural Information Processing Systems 13. MIT Press, Cambridge (2001)Google Scholar
  62. 62.
    Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods in graphical models. Machine Learning 37, 183–233 (1999)CrossRefzbMATHGoogle Scholar
  63. 63.
    Winn, J.: Variational Message Passing and its Applications. PhD thesis, Department of Physics, University of Cambridge (2003)Google Scholar
  64. 64.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Dept. of Statistics (2003)Google Scholar
  65. 65.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  66. 66.
    Neal, R.M., Hinton, G.E.: A new view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models. Kluwer Academic Press, Dordrecht (1998)Google Scholar
  67. 67.
    Attias, H.: Inferring parameters and structure of latent variable models by variational Bayes. In: Proc. 15th Conf. on Uncertainty in Artificial Intelligence (1999)Google Scholar
  68. 68.
    Ghahramani, Z., Beal, M.J.: Propagation algorithms for variational Bayesian learning. In: Advances in Neural Information Processing Systems 13. MIT Press, Cambridge (2001)Google Scholar
  69. 69.
    Minka, T.P.: A family of algorithms for approximate Bayesian inference. PhD thesis, MIT (2001)Google Scholar
  70. 70.
    Minka, T.P.: The EP energy function and minimization schemes. Technical report (2001)Google Scholar
  71. 71.
    Seeger, M.: Learning with labeled and unlabeled data. Technical report, University of Edinburgh (2001)Google Scholar
  72. 72.
    Szummer, M., Jaakkola, T.S.: Partially labeled classification with Markov random walks. In: NIPS (2001)Google Scholar
  73. 73.
    Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: The Twentieth International Conference on Machine Learning (ICML 2003) (2003)Google Scholar
  74. 74.
    Belkin, M., Niyogi, P.: Semi-supervised learning on Riemannian manifolds. Machine Learning 56, 209–239 (2004)CrossRefzbMATHGoogle Scholar
  75. 75.
    Kemp, C., Griffiths, T.L., Stromsten, S., Tenenbaum, J.B.: Semi-supervised learning with trees. In: NIPS, vol. 16 (2004)Google Scholar
  76. 76.
    Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics 2, 1152–1174 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  77. 77.
    Ferguson, T.S.: Bayesian density estimation by mixtures of normal distributions. In: Recent Advances in Statistics, pp. 287–302. Academic Press, New York (1983)CrossRefGoogle Scholar
  78. 78.
    Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577–588 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  79. 79.
    Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9, 249–265 (2000)MathSciNetGoogle Scholar
  80. 80.
    Rasmussen, C.E.: The infinite Gaussian mixture model. Adv. Neur. Inf. Proc. Sys. 12, 554–560 (2000)Google Scholar
  81. 81.
    Blei, D., Jordan, M.I.: Variational methods for the Dirichlet process. In: Proceedings of the 21st International Conference on Machine Learning (2004)Google Scholar
  82. 82.
    Minka, T.P., Ghahramani, Z.: Expectation propagation for infinite mixtures. Technical report, Presented at NIPS 2003 Workshop on Nonparametric Bayesian Methods and Infinite Models (2003)Google Scholar
  83. 83.
    Beal, M., Ghahramani, Z., Rasmussen, C.: The infinite hidden Markov model. In: Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge (2001)Google Scholar
  84. 84.
    Neal, R.M.: Density modeling and clustering using Dirichlet diffusion trees. In: Bernardo, J.M., et al. (eds.) Bayesian Statistics, vol. 7, pp. 619–629 (2003)Google Scholar
  85. 85.
    Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. Technical Report 653, Department of Statistics, University of California at Berkeley (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Zoubin Ghahramani
    • 1
  1. 1.Gatsby Computational Neuroscience UnitUniversity College LondonUK

Personalised recommendations