Probabilistic and Bayesian Networks

Chapter

Abstract

The Bayesian network model was introduced by Pearl in 1985 [147]. It is the best known family of graphical models in artificial intelligence (AI). Bayesian networks are a powerful tool of common knowledge representation and reasoning for partial beliefs under uncertainty. They are probabilistic models that combine probability theory and graph theory.

Keywords

Entropy Manifold Covariance Coherence 

References

  1. 1.
    Abbeel, P., Koller, D., & Ng, A. Y. (2006). Learning factor graphs in polynomial time and sample complexity. Journal of Machine Learning Research, 7, 1743–1788.MATHMathSciNetGoogle Scholar
  2. 2.
    Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169.Google Scholar
  3. 3.
    Ahn, J.-H., Oh, J.-H., & Choi, S. (2007). Learning principal directions: Integrated-squared-error minimization. Neurocomputing, 70, 1372–1381.Google Scholar
  4. 4.
    Akiyama, Y., Yamashita, A., Kajiura, M., & Aiso, H. (1989). Combinatorial optimization with Gaussian machines. In Proceedings of IEEE International Joint Conference Neural Network (pp. 533–540), Washington DC, USA.Google Scholar
  5. 5.
    Andrieu, C., de Freitas, N., & Doucet, A. (2001). Robust full Bayesian learning for radial basis networks. Neural Computation, 13, 2359–2407.MATHGoogle Scholar
  6. 6.
    Archambeau, C., & Verleysen, M. (2007). Robust Bayesian clustering. Neural Networks, 20, 129–138.MATHGoogle Scholar
  7. 7.
    Archambeau, C., Delannay, N., & Verleysen, M. (2008). Mixtures of robust probabilistic principal component analyzers. Neurocomputing, 71, 1274–1282.Google Scholar
  8. 8.
    Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (pp. 21–30).Google Scholar
  9. 9.
    Attias, H. (1999). Independent factor analysis. Neural Computation, 11, 803–851.Google Scholar
  10. 10.
    Azencott, R., Doutriaux, A., & Younes, L. (1993). Synchronous Boltzmann machines and curve identification tasks. Network, 4, 461–480.MATHGoogle Scholar
  11. 11.
    Bauer, E., Koller, D., & Singer, Y. (1997). Update rules for parameter estimation in Bayesian networks. In Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence (pp. 3–13).Google Scholar
  12. 12.
    Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171.MATHMathSciNetGoogle Scholar
  13. 13.
    Beinlich, I., Suermondt, H., Chavez, R., & Cooper, G. (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for Bayesian networks. In Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine (pp. 247–256).Google Scholar
  14. 14.
    Benavent, A. P., Ruiz, F. E., & Saez, J. M. (2009). Learning Gaussian mixture models with entropy-based criteria. IEEE Transactions on Neural Networks, 20(11), 1756–1771.Google Scholar
  15. 15.
    Binder, J., Koller, D., Russell, S., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213–244.MATHGoogle Scholar
  16. 16.
    Bouchaert, R. R. (1994). Probabilistic network construction using the minimum description length principle. Technical report UU-CS-1994-27. The Netherlands: Department of Computer Science, Utrecht University.Google Scholar
  17. 17.
    Bouguila, N., & Ziou, D. (2007). High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10), 1716–1731.Google Scholar
  18. 18.
    Boutemedjet, S., Bouguila, N., & Ziou, D. (2009). A hybrid feature extraction selection approach for high-dimensional non-Gaussian data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(8), 1429–1443.Google Scholar
  19. 19.
    Bradley, P. S., Fayyad, U. M., & Reina, C. A. (1998). Scaling EM (Expectation-Maximization) clustering to large databases. MSR-TR-98-35, Microsoft Research.Google Scholar
  20. 20.
    Breese, J. S., & Heckerman, D. (1996). Decision-theoretic troubleshooting: A framework for repair and experiment. In Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (pp. 124–132), Portland, OR.Google Scholar
  21. 21.
    Bromberg, F., Margaritis, D., & Honavar, V. (2006). Effcient Markov network structure discovery using independence tests. In Proceedings of SIAM Conference Data Mining (SDM).Google Scholar
  22. 22.
    Buntine, W. (1991). Theory refinement of Bayesian networks. In B. D. D’Ambrosio, P. Smets, & P. P. Bonisone (Eds.), Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (pp. 52–60), San Mateo, CA: Morgan Kaufmann.Google Scholar
  23. 23.
    de Campos, C. P., & Ji, Q. (2011). Efficient structure learning of Bayesian networks using constraints. Journal of Machine Learning Research, 12, 663–689.MATHGoogle Scholar
  24. 24.
    Celeux, G., & Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis, 14(3), 315–332.MATHMathSciNetGoogle Scholar
  25. 25.
    Centeno, T. P., & Lawrence, N. D. (2006). Optimising kernel parameters and regularisation coefficients for non-linear discriminant analysis. Journal of Machine Learning Research, 7, 455–491.MATHGoogle Scholar
  26. 26.
    Chan, K., Lee, T.-W., & Sejnowski, T. J. (2003). Variational Bayesian learning of ICA with missing data. Neural Computation, 15, 1991–2011.MATHGoogle Scholar
  27. 27.
    Chang, R., & Hancock, J. (1966). On receiver structures for channels having memory. IEEE Transactions on Information Theory, 12(4), 463–468.Google Scholar
  28. 28.
    Chatzis, S. P., & Demiris, Y. (2011). Echo state Gaussian process. IEEE Transactions on Neural Networks, 22(9), 1435–1445.Google Scholar
  29. 29.
    Chatzis, S. P., & Kosmopoulos, D. I. (2011). A variational Bayesian methodology for hidden Markov models utilizing student’s-t mixtures. Pattern Recognition, 44(2), 295–306.MATHGoogle Scholar
  30. 30.
    Chen, X.-W., Anantha, G., & Lin, X. (2008). Improving Bayesian network structure learning with mutual information-based node ordering in the K2 algorithm. IEEE Transactions on Knowledge and Data Engineering, 20(5), 1–13.Google Scholar
  31. 31.
    Cheng, J., Greiner, R., Kelly, J., Bell, D., & Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence, 137(1), 43–90.MATHMathSciNetGoogle Scholar
  32. 32.
    Cheng, S.-S., Fu, H.-C., & Wang, H.-M. (2009). Model-based clustering by probabilistic self-organizing maps. IEEE Transactions on Neural Networks, 20(5), 805–826.Google Scholar
  33. 33.
    Cheung, Y. M. (2005). Maximum weighted likelihood via rival penalized EM for density mixture clustering with automatic model selection. IEEE Transactions on Knowledge and Data Engineering, 17(6), 750–761.Google Scholar
  34. 34.
    Chickering, D. M. (1996). Learning Bayesian networks is NP-complete. In D. Fisher & H. Lenz (Eds.), Learning from Data: Artificial Intelligence and Statistics (Vol.5, pp.121–130). Berlin: Springer-Verlag.Google Scholar
  35. 35.
    Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research, 3, 507–554.MathSciNetGoogle Scholar
  36. 36.
    Chickering, D. M., Heckerman, D., & Meek, C. (2004). Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5, 1287–1330.MATHMathSciNetGoogle Scholar
  37. 37.
    Chien, J.-T., & Hsieh, H.-L. (2013). Nonstationary source separation using sequential and variational Bayesian learning. IEEE Transactions on Neural Networks and Learning Systems, 24(5), 681–694.Google Scholar
  38. 38.
    Choudrey, R. A., & Roberts, S. J. (2003). Variational mixture of Bayesian independent component analyzers. Neural Computation, 15, 213–252.MATHGoogle Scholar
  39. 39.
    Cohen, I., Bronstein, A., & Cozman, F. G. (2001). Adaptive online learning of Bayesian network parameters. HPL-2001-156. Palo Alto, CA: HP Laboratories.Google Scholar
  40. 40.
    Cohn, I., El-Hay, T., Friedman, N., & Kupferman, R. (2010). Mean field variational approximation for continuous-time Bayesian networks. Journal of Machine Learning Research, 11, 2745–2783.MATHMathSciNetGoogle Scholar
  41. 41.
    Constantinopoulos, C., Titsias, M. K., & Likas, A. (2006). Bayesian feature and model selection for Gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 1013–1018.Google Scholar
  42. 42.
    Cooper, G. F. (1990). The computational complexity of probabilistic inference using Bayesian Inference. Artificial Intelligence, 42, 393–405.MATHMathSciNetGoogle Scholar
  43. 43.
    Cooper, G. F., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347.MATHGoogle Scholar
  44. 44.
    Cowell, R. (2001). Conditions under which conditional independence and scoring methods lead to identical selection of Bayesian network models. In J. Breese, & D. Koller (Eds.), Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, (pp. 91–97), San Mateo, CA: Morgan Kaufmann.Google Scholar
  45. 45.
    Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence, 60(1), 141–154.MATHMathSciNetGoogle Scholar
  46. 46.
    Dagum, P., & Luby, M. (1997). An optimal approximation algorithm for Bayesian inference. Artificial Intelligence, 93, 1–27.MATHMathSciNetGoogle Scholar
  47. 47.
    Darwiche, A. (2001). Constant-space reasoning in dynamic Bayesian networks. International Journal of Approximate Reasoning, 26(3), 161–178.MATHMathSciNetGoogle Scholar
  48. 48.
    Dauwels, J., Korl, S., & Loeliger, H.-A. (2005). Expectation maximization as message passing. In Proceedings of the IEEE International Symposium on Information Theory (pp. 1–4), Adelaide, Australia.Google Scholar
  49. 49.
    Dawid, A. P. (1992). Applications of a general propagation algorithm for probalilistic expert systems. Statistics and Computing, 2, 25–36.Google Scholar
  50. 50.
    de Campos, L. M., & Castellano, J. G. (2007). Bayesian network learning algorithms using structural restrictions. International Journal of Approximate Reasoning, 45, 233–254.MATHMathSciNetGoogle Scholar
  51. 51.
    Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39(1), 1–38.MATHMathSciNetGoogle Scholar
  52. 52.
    Du, K.-L., & Swamy, M. N. S. (2010). Wireless communication systems. Cambridge, UK: Cambridge University Press.Google Scholar
  53. 53.
    El-Hay, T., Friedman, N., & Kupferman, R. (2008). Gibbs sampling in factorized continuous-time Markov processes. In Proceedings of 24th Conference Uncertainty in Artificial Intelligence.Google Scholar
  54. 54.
    Elidan, G., & Friedman, N. (2005). Learning hidden variable networks: The information bottleneck approach. Journal of Machine Learning Research, 6, 81–127.MATHMathSciNetGoogle Scholar
  55. 55.
    Engel, A., & Van den Broeck, C. (2001). Statistical mechanics of learning. Cambridge, UK: Cambridge University Press.Google Scholar
  56. 56.
    Ephraim, Y., & Merhav, N. (2002). Hidden Markov processes. IEEE Transactions on Information Theory, 48(6), 1518–1569.MATHMathSciNetGoogle Scholar
  57. 57.
    Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11, 625–660.MATHMathSciNetGoogle Scholar
  58. 58.
    Fan, Y., Xu, J., & Shelton, C. R. (2010). Importance sampling for continuous time Bayesian networks. Journal of Machine Learning Research, 11, 2115–2140.MATHMathSciNetGoogle Scholar
  59. 59.
    Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 305(5814), 972–976.MathSciNetGoogle Scholar
  60. 60.
    Friedman, N. (1997). Learning Bayesian networks in the presence of missing values and hidden variables. In D. Fisher (Ed.), Proceedings of 14th Conference Uncertainty in Artificial Intelligence (pp. 125–133). San Mateo, CA: Morgan Kaufmann.Google Scholar
  61. 61.
    Galland, C. C. (1993). The limitations of deterministic Boltzmann machine learning. Network, 4, 355–380.MATHGoogle Scholar
  62. 62.
    Gandhi, P., Bromberg, F., & Margaritis, D. (2008). Learning Markov network structure using few independence tests. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM) (pp. 680–691).Google Scholar
  63. 63.
    Gao, B., Woo, W. L., & Dlay, S. S. (2012). Variational regularized 2-D nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 23(5), 703–716.Google Scholar
  64. 64.
    Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409.MATHMathSciNetGoogle Scholar
  65. 65.
    Gelly, S., & Teytaud, O. (2005). Bayesian networks: A better than frequentist approach for parametrization, and a more accurate structural complexity measure than the number of parameters. In Proceedings of CAP, Nice, France.Google Scholar
  66. 66.
    Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741.MATHGoogle Scholar
  67. 67.
    Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Learning probabilistic models of link structure. Journal of Machine Learning Research, 3, 679–707.MathSciNetGoogle Scholar
  68. 68.
    Ghahramani, Z., & Beal, M. (1999). Variational inference for Bayesian mixture of factor analysers. In Advances in neural information processing systems (Vol. 12). Cambridge, MA: MIT Press.Google Scholar
  69. 69.
    Glauber, R. J. (1963). Time-dependent statistics of the Ising model. Journal of Mathematical Physics, 4, 294–307.MATHMathSciNetGoogle Scholar
  70. 70.
    Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732.MATHMathSciNetGoogle Scholar
  71. 71.
    Handschin, J. E., & Mayne, D. Q. (1969). Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering. International Journal of Control, 9(5), 547–559.MATHMathSciNetGoogle Scholar
  72. 72.
    Hammersely, J. M., & Morton, K. W. (1954). Poor man’s Monte Carlo. Journal of the Royal Statistical Society Series B, 16, 23–38.Google Scholar
  73. 73.
    Hartman, E. (1991). A high storage capacity neural network content-addressable memory. Network, 2, 315–334.Google Scholar
  74. 74.
    Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109.MATHGoogle Scholar
  75. 75.
    Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall.Google Scholar
  76. 76.
    Heckerman, D. (1996). A tutorial on learning with Bayesian networks. Microsoft technical report MSR-TR-95-06, March 1995.Google Scholar
  77. 77.
    Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3), 197–243.MATHGoogle Scholar
  78. 78.
    Hennig, P., & Kiefel, M. (2013). Quasi-Newton methods: A new direction. Journal of Machine Learning Research, 14, 843–865.MathSciNetGoogle Scholar
  79. 79.
    Heskes, T. (2004). On the uniqueness of loopy belief propagation fixed points. Neural Computation, 16, 2379–2413.MATHGoogle Scholar
  80. 80.
    Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D.E. Rumelhart, & J.L. McClelland (Eds.), Parallel distributed processing: Explorations in microstructure of cognition (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press.Google Scholar
  81. 81.
    Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1, 143–150.Google Scholar
  82. 82.
    Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.MATHMathSciNetGoogle Scholar
  83. 83.
    Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.MATHMathSciNetGoogle Scholar
  84. 84.
    Hojen-Sorensen, P. A. d. F. R., Winther, O., & Hansen, L. K. (2002). Mean-field approaches to independent component analysis. Neural Computation, 14, 889–918.Google Scholar
  85. 85.
    Holmes, C. C., & Mallick, B. K. (1998). Bayesian radial basis functions of variable dimension. Neural Computation, 10(5), 1217–1233.Google Scholar
  86. 86.
    Huang, Q., Yang, J., & Zhou, Y. (2008). Bayesian nonstationary source separation. Neurocomputing, 71, 1714–1729.Google Scholar
  87. 87.
    Huang, J. C., & Frey, B. J. (2011). Cumulative distribution networks and the derivative-sum-product algorithm: Models and inference for cumulative distribution functions on graphs. Journal of Machine Learning Research, 12, 301–348.MATHMathSciNetGoogle Scholar
  88. 88.
    Huang, S., Li, J., Ye, J., Fleisher, A., Chen, K., Wu, T., et al. (2013). A sparse structure learning algorithm for Gaussian Bayesian network identification from high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6), 1328–1342.Google Scholar
  89. 89.
    Huda, S., Yearwood, J., & Togneri, R. (2009). A stochastic version of expectation maximization algorithm for better estimation of hidden Markov model. Pattern Recognition Letters, 30, 1301–1309.Google Scholar
  90. 90.
    Ilin, A., & Raiko, T. (2010). Practical approaches to principal component analysis in the presence of missing values. Journal of Machine Learning Research, 11, 1957–2000.MATHMathSciNetGoogle Scholar
  91. 91.
    Ihler, A. T., Fisher, J. W, I. I. I., & Willsky, A. S. (2005). Loopy belief propagation: Convergence and effects of message errors. Journal of Machine Learning Research, 6, 905–936.MATHMathSciNetGoogle Scholar
  92. 92.
    Jensen, F. V., Lauritzen, S. L., & Olesen, K. G. (1990). Bayesian updating in causal probabilistic networks by local computations. Computational Statistics Quarterly, 4, 269–282.MathSciNetGoogle Scholar
  93. 93.
    Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183–233.MATHGoogle Scholar
  94. 94.
    Kalisch, M., & Buhlmann, P. (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research, 8, 613–636.MATHGoogle Scholar
  95. 95.
    Kam, M., & Cheng, R. (1989). Convergence and pattern stabilization in the Boltzmann machine. In D. S. Touretzky (Ed.), Advances in neural information processing systems (Vol. 1, pp. 511–518). San Mateo, CA: Morgan Kaufmann.Google Scholar
  96. 96.
    Kappen, H. J., & Rodriguez, F. B. (1998). Efficient learning in Boltzmann machine using linear response theory. Neural Computation, 10, 1137–1156.Google Scholar
  97. 97.
    Khreich, W., Granger, E., Miri, A., & Sabourin, R. (2010). On the memory complexity of the forward-backward algorithm. Pattern Recognition Letters, 31, 91–99.Google Scholar
  98. 98.
    Kinouchi, O., & Caticha, N. (1992). Optimal generalization in perceptrons. Journal of Physics A: Mathematical and General, 25, 6243–6250.MATHMathSciNetGoogle Scholar
  99. 99.
    Kjaerulff, U. (1995). dHugin: A computational system for dynamic time-sliced Bayesian networks. International Journal of Forecasting, 11(1), 89–113.Google Scholar
  100. 100.
    Koivisto, M., & Sood, K. (2004). Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5, 549–573.MATHMathSciNetGoogle Scholar
  101. 101.
    Kschischang, F. R., Frey, B. J., & Loeliger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.MATHMathSciNetGoogle Scholar
  102. 102.
    Kurita, N., & Funahashi, K. I. (1996). On the Hopfield neural networks and mean field theory. Neural Networks, 9, 1531–1540.MATHGoogle Scholar
  103. 103.
    Lam, W., & Bacchus, F. (1994). Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10, 269–293.Google Scholar
  104. 104.
    Lam, W., & Segre, A. M. (2002). A distributed learning algorithm for Bayesian inference networks. IEEE Transactions on Knowledge and Data Engineering, 14(1), 93–105.Google Scholar
  105. 105.
    Langari, R., Wang, L., & Yen, J. (1997). Radial basis function networks, regression weights, and the expectation-maximization algorithm. IEEE Transactions on Systems, Man, and Cybernetics Part A, 27(5), 613–623.Google Scholar
  106. 106.
    Lappalainen, H., & Honkela, A. (2000). Bayesian nonlinear independent component analysis by multilayer perceptron. In M. Girolami (Ed.), Advances in independent component analysis (pp. 93–121). Berlin: Spring-Verlag.Google Scholar
  107. 107.
    Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 1, 1–40.Google Scholar
  108. 108.
    Lauritzen, S. L. (1992). Propagation of probabilities, means and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420), 1098–1108.MATHMathSciNetGoogle Scholar
  109. 109.
    Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application on expert systems. Journal of the Royal Statistical Society Series B, 50(2), 157–224.MATHMathSciNetGoogle Scholar
  110. 110.
    Lazaro, M., Santamaria, I., & Pantaleon, C. (2003). A new EM-based training algorithm for RBF networks. Neural Networks, 16, 69–77.Google Scholar
  111. 111.
    Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6, 1783–1816.MATHMathSciNetGoogle Scholar
  112. 112.
    Le Roux, N., & Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20, 1631–1649.MATHMathSciNetGoogle Scholar
  113. 113.
    Le Roux, N., & Bengio, Y. (2010). Deep Belief networks are compact universal approximators. Neural Computation, 22, 2192–2207.MATHMathSciNetGoogle Scholar
  114. 114.
    Levy, B. C., & Adams, M. B. (1987). Global optimization with stochastic neural networks. In Proceedings of the 1st IEEE Conference on Neural Networks (Vol. 3, pp. 681–689), San Diego, CA.Google Scholar
  115. 115.
    Li, J., & Tao, D. (2013). Exponential family factors for Bayesian factor analysis. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 964–976.MATHMathSciNetGoogle Scholar
  116. 116.
    Liang, F. (2007). Annealing stochastic approximation Monte Carlo algorithm for neural network training. Machine Learning, 68, 201–233.Google Scholar
  117. 117.
    Liang, F., Liu, C., & Carroll, R. J. (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association, 102, 305–320.MATHMathSciNetGoogle Scholar
  118. 118.
    Lin, C. T., & Lee, C. S. G. (1995). A multi-valued Boltzmann machine. IEEE Transactions on Systems, Man, and Cybernetics, 25(4), 660–669.MathSciNetGoogle Scholar
  119. 119.
    Ling, C. X., & Zhang, H. (2002). The representational power of discrete Bayesian networks. Journal of Machine Learning Research, 3, 709–721.MathSciNetGoogle Scholar
  120. 120.
    Lopez-Rubio, E., Ortiz-de-Lazcano-Lobato, J. M., & Lopez-Rodriguez, D. (2009). Probabilistic PCA self-organizing maps. IEEE Transactions on Neural Networks, 20(9), 1474–1489.Google Scholar
  121. 121.
    Lopez-Rubio, E. (2009). Multivariate student-t self-organizing maps. Neural Networks, 22, 1432–1447.Google Scholar
  122. 122.
    Lu, X., Wang, Y., & Yuan, Y. (2013). Sparse coding from a Bayesian perspective. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 929–939.Google Scholar
  123. 123.
    Luis, R., Sucar, L. E., & Morales, E. F. (2010). Inductive transfer for learning Bayesian networks. Machine Learning, 79, 227–255.MathSciNetGoogle Scholar
  124. 124.
    Ma, S., Ji, C., & Farmer, J. (1997). An efficient EM-based training algorithm for feedforward neural networks. Neural Networks, 10, 243–256.Google Scholar
  125. 125.
    Ma, J., Xu, L., & Jordan, M. I. (2000). Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Computation, 12, 2881–2907.Google Scholar
  126. 126.
    Mackay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472.Google Scholar
  127. 127.
    Margaritis, D., & Thrun, S. (2000). Bayesian network induction via local neighborhoods. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems (Vol. 12 pp. 505–511). Cambridge, MA: MIT Press.Google Scholar
  128. 128.
    Mateescu, R., & Dechter, R. (2009). Mixed deterministic and probabilistic networks. Annals of Mathematics and Artificial Intelligence, 54(1–3), 3–51.Google Scholar
  129. 129.
    Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society Series B, 51(1), 127–138.MATHMathSciNetGoogle Scholar
  130. 130.
    Minka, T. (2001). Expectation propagation for approximate Bayesian inference. Doctoral dissertation, MIT Media Lab.Google Scholar
  131. 131.
    Miskin, J. W., & MacKay, D. J. C. (2001). Ensemble learning for blind source separation. In S. Roberts & R. Everson (Eds.), Independent component analysis: Principles and practice (pp. 209–233). Cambridge, UK: Cambridge University Press.Google Scholar
  132. 132.
    Mongillo, G., & Deneve, S. (2008). Online learning with hidden Markov models. Neural Computation, 20, 1706–1716.MATHMathSciNetGoogle Scholar
  133. 133.
    Moral, S., Rumi, R., & Salmeron, A. (2001). Mixtures of truncated exponentials in hybrid Bayesian networks. In LNAI 2143 (pp. 135–143). Berlin: Springer.Google Scholar
  134. 134.
    Nasios, N., & Bors, A. (2006). Variational learning for Gaussian mixtures. IEEE Transactions on Systems, Man, and Cybernetics Part B, 36(4), 849–862.Google Scholar
  135. 135.
    Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical report CRG-TR-93-1, Toronto: Department of Computer Science, University of Toronto.Google Scholar
  136. 136.
    Ngo, L., & Haddawy, P. (1995). Probabilistic logic programming and bayesian networks. In Algorithms, Concurrency and Knowledge (Proceedings of ACSC’95), LNCS 1023 (pp. 286–300). Berlin: Springer-Verlag.Google Scholar
  137. 137.
    Nielsen, S. H., & Nielsen, T. D. (2008). Adapting Bayes network structures to non-stationary domains. International Journal of Approximate Reasoning, 49, 379–397.MATHMathSciNetGoogle Scholar
  138. 138.
    Nodelman, U., Shelton, C. R., & Koller, D. (2002). Continuous time Bayesian networks. In Proceedings of UAI (pp. 378–387).Google Scholar
  139. 139.
    Noorshams, N., & Wainwright, M. J. (2013). Stochastic belief propagation: A low-complexity alternative to the sum-product algorithm. IEEE Transactions on Information Theory, 59(4), 1981–2000.MathSciNetGoogle Scholar
  140. 140.
    Opper, M. (1998). A Bayesian approach to online learning. In D. Saad (Ed.), On-line learning in neural networks (pp. 363–378). Cambridge, UK: Cambridge University Press.Google Scholar
  141. 141.
    Opper, M., & Winther, O. (2000). Gaussian processes for classification: Mean field algorithms. Neural Computation, 12, 2655–2684.Google Scholar
  142. 142.
    Opper, M., & Winther, O. (2001). Tractable approximations for probabilistic models: The adaptive Thouless-Anderson-Palmer mean field approach. Physical Review Letters, 86, 3695–3699.Google Scholar
  143. 143.
    Opper, M., & Winther, O. (2005). Expectation consistent approximate inference. Journal of Machine Learning Research, 6, 2177–2204.MATHMathSciNetGoogle Scholar
  144. 144.
    Osoba, O., & Kosko, B. (2013). Noise-enhanced clustering and competitive learning algorithms. Neural Networks, 37, 132–140.MATHGoogle Scholar
  145. 145.
    Ott, G. (1967). Compact encoding of stationary Markov sources. IEEE Transactions on Information Theory, 13(1), 82–86.MATHMathSciNetGoogle Scholar
  146. 146.
    Park, H., & Ozeki, T. (2009). Singularity and slow convergence of the EM algorithm for Gaussian mixtures. Neural Processing Letters, 29, 45–59.Google Scholar
  147. 147.
    Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann.Google Scholar
  148. 148.
    Perez, A., Larranaga, P., & Inza, I. (2009). Bayesian classifiers based on kernel density estimation: Flexible classifiers. International Journal of Approximate Reasoning, 50, 341–362.MATHGoogle Scholar
  149. 149.
    Peterson, C., & Anderson, J. R. (1987). A mean field learning algorithm for neural networks. Complex Systems, 1(5), 995–1019.MATHMathSciNetGoogle Scholar
  150. 150.
    Pietra, S. D., Pietra, V. D., & Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380–393.Google Scholar
  151. 151.
    Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.Google Scholar
  152. 152.
    Raviv, J. (1967). Decision making in Markov chains applied to the problem of pattern recognition. IEEE Transactions on Information Theory, 13(4), 536–551.Google Scholar
  153. 153.
    Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B, 59(4), 731–792.MATHMathSciNetGoogle Scholar
  154. 154.
    Romero, V., Rumi, R., & Salmeron, A. (2006). Learning hybrid Bayesian networks using mixtures of truncated exponentials. International Journal of Approximate Reasoning, 42, 54–68.MATHMathSciNetGoogle Scholar
  155. 155.
    Roos, T., Grunwald, P., & Myllymaki, P. (2005). On discriminative Bayesian network classifiers and logistic regression. Machine Learning, 59, 267–296.MATHGoogle Scholar
  156. 156.
    Rosipal, R., & Girolami, M. (2001). An expectation-maximization approach to nonlinear component analysis. Neural Computation, 13, 505–510.Google Scholar
  157. 157.
    Roweis, S. (1998). EM algorithms for PCA and SPCA. In Advances in neural information processing systems (Vol. 10, pp. 626–632). Cambridge, MA: MIT Press.Google Scholar
  158. 158.
    Rusakov, D., & Geiger, D. (2005). Asymptotic model selection for naive Bayesian networks. Journal of Machine Learning Research, 6, 1–35.MATHMathSciNetGoogle Scholar
  159. 159.
    Sanguinetti, G. (2008). Dimensionality reduction of clustered data sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(3), 535–540.Google Scholar
  160. 160.
    Sarela, J., & Valpola, H. (2005). Denoising source separation. Journal of Machine Learning Research, 6, 233–272.MathSciNetGoogle Scholar
  161. 161.
    Sato, M., & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12, 407–432.Google Scholar
  162. 162.
    Sato, M. (2001). Online model selection based on the variational Bayes. Neural Computation, 13, 1649–1681.MATHGoogle Scholar
  163. 163.
    Seeger, M. W. (2008). Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research, 9, 759–813.MATHMathSciNetGoogle Scholar
  164. 164.
    Shelton, C. R., Fan, Y., Lam, W., Lee, J., & Xu, J. (2010). Continuous time Bayesian network reasoning and learning engine. Journal of Machine Learning Research, 11, 1137–1140.MATHGoogle Scholar
  165. 165.
    Shutin, D., Zechner, C., Kulkarni, S. R., & Poor, H. V. (2012). Regularized variational Bayesian learning of echo state networks with delay and sum readout. Neural Computation, 24, 967–995.MATHMathSciNetGoogle Scholar
  166. 166.
    Silander, T., & Myllymaki, P. (2006). A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the 22nd Annual Conference Uncertainty in Artificial Intelligence (pp. 445–452).Google Scholar
  167. 167.
    Silander, T., Kontkanen, P., & Myllymaki, P. (2007). On sensitivity of the MAP Bayesian network structure to the equivalent sample size parameter. In R. Parr, & L. van der Gaag (Eds.), Proceedings of the 23rd Conference Uncertainty in Artificial Intelligence (pp. 360–367).Google Scholar
  168. 168.
    Silander, T., Roos, T., & Myllymaki, P. (2009). Locally minimax optimal predictive modeling with Bayesian networks. In Proceedings of Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS). Journal of Machine Learning Research, Proceedings of Track 5. Clearwater Beach, Florida, USA (pp. 504–511).Google Scholar
  169. 169.
    Smyth, P., Hecherman, D., & Jordan, M. I. (1997). Probabilistic independent networks for hidden Markov probabilities models. Neural Computation, 9(2), 227–269.MATHGoogle Scholar
  170. 170.
    Spiegelhalter, D. J., & Lauritzen, S. L. (1990). Sequential updating of conditional probabilities on directed graphical structures. Networks, 20(5), 579–605.MATHMathSciNetGoogle Scholar
  171. 171.
    Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search (2nd ed.). Cambridge, MA: MIT Press.Google Scholar
  172. 172.
    Sutskever, I., & Hinton, G. E. (2008). Deep narrow sigmoid belief networks are universal approximators. Neural Computation, 20, 2629–2636.MATHGoogle Scholar
  173. 173.
    Szu, H. H., & Hartley, R. L. (1987). Nonconvex optimization by fast simulated annealing. Proceedings of the IEEE, 75, 1538–1540.Google Scholar
  174. 174.
    Takekawa, T., & Fukai, T. (2009). A novel view of the variational Bayesian clustering. Neurocomputing, 72, 3366–3369.Google Scholar
  175. 175.
    Tamada, Y., Imoto, S., & Miyano, S. (2011). Parallel algorithm for learning optimal Bayesian network structure. Journal of Machine Learning Research, 12, 2437–2459.MATHMathSciNetGoogle Scholar
  176. 176.
    Tan, X., & Li, J. (2010). Computationally efficient sparse Bayesian learning via belief propagation. IEEE Transactions on Signal Processing, 58(4), 2010–2021.MathSciNetGoogle Scholar
  177. 177.
    Tanner, M., & Wong, W. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–540.MATHMathSciNetGoogle Scholar
  178. 178.
    Tatikonda, S., & Jordan, M. (2002). Loopy belief propagation and gibbs measures. In Proceedings of Uncertainty in Artificial Intelligence.Google Scholar
  179. 179.
    Thouless, D. J., Anderson, P. W., & Palmer, R. G. (1977). Solution of “solvable model of a spin glass”. Philosophical Magazine, 35(3), 593–601.Google Scholar
  180. 180.
    Ting, J.-A., D’Souza, A., Vijayakumar, S., & Schaal, S. (2010). Efficient learning and feature selection in high-dimensional regression. Neural Computation, 22, 831–886.MATHMathSciNetGoogle Scholar
  181. 181.
    Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B, 61(3), 611–622.MATHMathSciNetGoogle Scholar
  182. 182.
    Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11, 443–482.Google Scholar
  183. 183.
    Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max-min hill-climbing bayesian network structure learning algorithm. Machine Learning, 65(1), 31–78.Google Scholar
  184. 184.
    Ueda, N., & Nakano, R. (1998). Deterministic annealing EM algorithm. Neural Networks, 11(2), 271–282.Google Scholar
  185. 185.
    Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128.Google Scholar
  186. 186.
    Valpola, H. & Pajunen, P. (2000). Fast algorithms for Bayesian independent component analysis. In Proceedings of 2nd International Workshop on ICA (pp. 233–237). Helsinki, Finland.Google Scholar
  187. 187.
    Valpola, H. (2000). Nonlinear independent component analysis using ensemble learning: Theory. In Proceedings of International Workshop on ICA. Helsinki, Finland (pp. 251–256).Google Scholar
  188. 188.
    Valpola, H., & Karhunen, J. (2002). An unsupervised ensemble learning for nonlinear dynamic state-space models. Neural Computation, 141(11), 2647–2692.Google Scholar
  189. 189.
    Verma, T., & Pearl, J. (1990). Equivalence and synthesis of causal models. In Proceedings of 6th Conference Uncertainty in Artificial Intelligence (pp. 255–268), Cambridge, MA.Google Scholar
  190. 190.
    Watanabe, K., & Watanabe, S. (2006). Stochastic complexities of Gaussian mixtures in variational Bayesian approximation. Journal of Machine Learning Research, 7(4), 625–644.MATHGoogle Scholar
  191. 191.
    Watanabe, K., & Watanabe, S. (2007). Stochastic complexities of general mixture models in variational Bayesian learning. Neural Networks, 20, 210–219.MATHGoogle Scholar
  192. 192.
    Watanabe, K., Akaho, S., Omachi, S., & Okada, M. (2009). VB mixture model on a subspace of exponential family distributions. IEEE Transactions on Neural Networks, 20(11), 1783–1796.Google Scholar
  193. 193.
    Welling, M., & Weber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, 677–689.MATHGoogle Scholar
  194. 194.
    Winn, J., & Bishop, C. M. (2005). Variational message passing. Journal of Machine Learning Research, 6, 661–694.MATHMathSciNetGoogle Scholar
  195. 195.
    Winther, O., & Petersen, K. B. (2007). Flexible and efficient implementations of Bayesian independent component analysis. Neurocomputing, 71, 221–233.Google Scholar
  196. 196.
    Wu, J. M. (2004). Annealing by two sets of interactive dynamics. IEEE Transactions on Systems, Man, and Cybernetics Part B, 34(3), 1519–1525.Google Scholar
  197. 197.
    Xiang, Y. (2000). Belief updating in multiply sectioned Bayesian networks without repeated local propagations. International Journal of Approximate Reasoning, 23, 1–21.MATHMathSciNetGoogle Scholar
  198. 198.
    Xie, X., & Geng, Z. (2008). A recursive method for structural learning of directed acyclic graphs. Journal of Machine Learning Research, 9, 459–483.MATHMathSciNetGoogle Scholar
  199. 199.
    Xie, X., Yan, S., Kwok, J., & Huang, T. (2008). Matrix-variate factor analysis and its applications. IEEE Transactions on Neural Networks, 19(10), 1821–1826.Google Scholar
  200. 200.
    Xu, L., Jordan, M. I., & Hinton, G. E. (1995). An alternative model for mixtures of experts. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 633–640). Cambridge, MA: MIT Press.Google Scholar
  201. 201.
    Yamazaki, K., & Watanabe, S. (2003). Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, 16, 1023–1038.Google Scholar
  202. 202.
    Yang, Z. R. (2006). A novel radial basis function neural network for discriminant analysis. IEEE Transactions on Neural Networks, 17(3), 604–612.Google Scholar
  203. 203.
    Yap, G.-E., Tan, A.-H., & Pang, H.-H. (2008). Explaining inferences in Bayesian networks. Applied Intelligence, 29, 263–278.Google Scholar
  204. 204.
    Yasuda, M., & Tanaka, K. (2009). Approximate learning algorithm in Boltzmann machines. Neural Computation, 21, 3130–3178.MATHMathSciNetGoogle Scholar
  205. 205.
    Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001). Generalized belief propagation. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems (Vol. 13, pp. 689–695). Cambridge, MA: MIT Press.Google Scholar
  206. 206.
    Younes, L. (1996). Synchronous Boltzmann machines can be universal approximators. Applied Mathematics Letters, 9(3), 109–113.MATHMathSciNetGoogle Scholar
  207. 207.
    Yuille, A. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14, 1691–1722.MATHGoogle Scholar
  208. 208.
    Zhang, B., Zhang, C., & Yi, X. (2004). Competitive EM algorithm for finite mixture models. Pattern Recognition, 37, 131–144.MATHGoogle Scholar
  209. 209.
    Zhang, Z., & Cheung, Y. M. (2006). On weight design of maximum weighted likelihood and an extended EM algorithm. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1429–1434.Google Scholar
  210. 210.
    Zhao, J., & Jiang, Q. (2006). Probabilistic PCA for \(t\) distributions. Neurocomputing, 69, 2217–2226.Google Scholar
  211. 211.
    Zhao, J., Yu, P. L. H., & Kwok, J. T. (2012). Bilinear probabilistic principal component analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(3), 492–503.Google Scholar
  212. 212.
    Zhong, M., & Du, J. (2007). A parametric density model for blind source separation. Neural Processing Letters, 25, 199–207.Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  1. 1.Enjoyor LabsEnjoyor Inc.HangzhouChina
  2. 2.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada

Personalised recommendations