Machine Learning

, Volume 96, Issue 1–2, pp 155–188 | Cite as

A comparison of collapsed Bayesian methods for probabilistic finite automata



This paper describes several collapsed Bayesian methods, which work by first marginalizing out transition probabilities, for inferring several kinds of probabilistic finite automata. The methods include collapsed Gibbs sampling (CGS) and collapsed variational Bayes, as well as two new methods. Their targets range over general probabilistic finite automata, hidden Markov models, probabilistic deterministic finite automata, and variable-length grams. We implement and compare these algorithms over the data sets from the Probabilistic Automata Learning Competition (PAutomaC), which are generated by various types of automata. We report that the CGS-based algorithm designed to target general probabilistic finite automata performed the best for any types of data.


Collapsed Gibbs sampling Variational Bayesian methods State-merging algorithms 



We are grateful to the committee of PAutomaC for offering various useful data sets and detailed information on them. We deeply appreciate the valuable comments and suggestions of the anonymous reviewers, which improved the quality of this paper.


  1. Abe, N., & Warmuth, M. K. (1992). On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9, 205–260. MATHGoogle Scholar
  2. Asuncion, A. U., Welling, M., Smyth, P., & Teh, Y. W. (2009). On smoothing and inference for topic models. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (pp. 27–34). Google Scholar
  3. Balle, B., Castro, J., & Gavaldà, R. (2013). Learning probabilistic automata: a study in state distinguishability. Theoretical Computer Science, 473, 46–60. MATHMathSciNetCrossRefGoogle Scholar
  4. Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. PhD thesis, Gatsby Computational Neuroscience Unit, University College London. Google Scholar
  5. Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of K-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105. MATHMathSciNetGoogle Scholar
  6. Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer. Chap. 11. MATHGoogle Scholar
  7. Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. In Proceedings of the second international colloquium of grammatical inference (pp. 139–152). Google Scholar
  8. Castro, J., & Gavaldà, R. (2008). Towards feasible pac-learning of probabilistic deterministic finite automata. In Proceedings of the 9th international colloquium on grammatical inference (pp. 163–174). Google Scholar
  9. Cawley, G. C., & Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107. MATHMathSciNetGoogle Scholar
  10. Clark, A., & Thollard, F. (2004). PAC-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research, 5, 473–497. MATHMathSciNetGoogle Scholar
  11. Gao, J., & Johnson, M. (2008). A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the 2008 conference on empirical methods in natural language (pp. 344–352). Google Scholar
  12. Goldwater, S., & Griffiths, T. (2007). A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 744–751). Google Scholar
  13. Hsu, D., Kakade, S. M., & Zhang, T. (2009). A spectral algorithm for learning hidden Markov models. In Proceedings of the 22nd conference on learning theory. Google Scholar
  14. Johnson, M., Griffiths, T. L., & Goldwater, S. (2007). Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proceedings of human language technology conference of the North American chapter of the association of computational linguistics (pp. 139–146). Google Scholar
  15. Kearns, M. J., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R. E., & Sellie, L. (1994). On the learnability of discrete distributions. In Proceedings of the 26th annual ACM symposium on theory of computing (pp. 273–282). Google Scholar
  16. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the fourteenth international joint conference on artificial intelligence (pp. 1137–1145). Google Scholar
  17. Liang, P., Petrov, S., Jordan, M. I., & Klein, D. (2007). The infinite PCFG using hierarchical Dirichlet processes. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 688–697). Google Scholar
  18. Niesler, T., & Woodland, P. C. (1999). Variable-length category n-gram language models. Computer Speech & Language, 13(1), 99–124. CrossRefGoogle Scholar
  19. Pfau, D., Bartlett, N., & Wood, F. (2010). Probabilistic deterministic infinite automata. In Advances in neural information processing systems 23 (NIPS) (pp. 1930–1938). Google Scholar
  20. Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 44th annual meeting of the association of computational linguistics (pp. 985–992). Google Scholar
  21. Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006a). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581. MATHMathSciNetCrossRefGoogle Scholar
  22. Teh, Y. W., Newman, D., & Welling, M. (2006b). A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in neural information processing systems 19 (NIPS) (pp. 1353–1360). Google Scholar
  23. Thollard, F. (2001). Improving probabilistic grammatical inference core algorithms with post-processing techniques. In Proceedings of the eighteenth international conference on machine learning (pp. 561–568). Google Scholar
  24. Thollard, F., Dupont, P., & de la Higuera, C. (2000). Probabilistic DFA inference using Kullback-Leibler divergence and minimality. In Proceedings of the seventeenth international conference on machine learning (pp. 975–982). Google Scholar
  25. Verwer, S., Eyraud, R., & de la Higuera, C. (2012). Results of the PAutomaC probabilistic automaton learning competition. In Proceedings of the 11th international conference on grammatical inference (Vol. 21, pp. 243–248). Google Scholar
  26. Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23(4), 550–560. MATHMathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.School of Computer ScienceTokyo University of TechnologyTokyoJapan
  2. 2.Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations