Inducing probabilistic grammars by Bayesian model merging

  • Andreas Stolcke
  • Stephen Omohundro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 862)

Abstract

We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are incorporated by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are merged to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a trade-off between a close fit to the data and a default preference for simpler models (‘Occam's Razor’). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, class-based n-grams, and stochastic context-free grammars.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angluin, D., & C. H. Smith. 1983. Inductive inference: Theory and methods. ACM Computing Surveys 15.237–269.CrossRefGoogle Scholar
  2. Baker, James K. 1979. Trainable grammars for speech recognition. In Speech Communication Papers for the 97th Meeting of the Acoustical Society of America, ed. by Jared J. Wolf & Dennis H. Klatt, 547–550, MIT, Cambridge, Mass.Google Scholar
  3. Baum, Leonard E., Ted Petrie, George Soules, & Norman Weiss. 1970. A maximization technique occuring in the statistical analysis of probabilistic functions in Markov chains. The Annals of Mathematical Statistics 41.164–171.Google Scholar
  4. Bell, Timothy C., John G. Cleary, & Ian H. Witten. 1990. Text Compression. Englewood Cliffs, N.J.: Prentice Hall.Google Scholar
  5. Booth, Taylor L., & Richard A. Thompson. 1973. Applying probability measures to abstract languages. IEEE Transactions on Computers C-22.442–450.Google Scholar
  6. Brown, Peter F., Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, & Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics 18.467–479.Google Scholar
  7. Buntine, Wray. 1992. Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and Statistics III, ed. by D. J. Hand. Chapman & Hall.Google Scholar
  8. Carrasco, Rafael C., & José Oncina, 1994. Learning stochastic regular grammars by means of a state merging method. This volume.Google Scholar
  9. Cook, Craig M., Azriel Rosenfeld, & Alan R. Aronson. 1976. Grammatical inference by hill climbing. Information Sciences 10.59–80.Google Scholar
  10. Dempster, A. P., N. M. Laird, & D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 34.1–38.Google Scholar
  11. Gull, S. F. 1988. Bayesian inductive inference and maximum entropy. In Maximum Entropy and Bayesian Methods in Science and Engineering, Volume 1: Foundations, ed. by G. J. Erickson & C. R. Smith, 53–74. Dordrecht: Kluwer.Google Scholar
  12. Hopcroft, John E., & Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Reading, Mass.: Addison-Wesley.Google Scholar
  13. Horning, James Jay. 1969. A study of grammatical inference. Technical Report CS 139, Computer Science Department, Stanford University, Stanford, Ca.Google Scholar
  14. Jelinek, Frederick, John D. Lafferty, & Robert L. Mercer. 1992. Basic methods of probabilistic context free grammars. In Speech Recognition and Understanding. Recent Advances, Trends, and Applications, ed. by Pietro Laface & Renato De Mori, volume F75 of NATO Advanced Sciences Institutes Series, 345–360. Berlin: Springer Verlag. Proceedings of the NATO Advanced Study Institute, Cetraro, Italy, July 1990.Google Scholar
  15. Langley, Pat, 1994. Simplicity and representation change in grammar induction. Unpublished mss.Google Scholar
  16. Omohundro, Stephen M. 1992. Best-first model merging for dynamic learning and recognition. Technical Report TR-92-004, International Computer Science Institute, Berkeley, Ca.Google Scholar
  17. Oncina, José, Pedro García, & Enrique Vidal. 1993. Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 15.448–458.CrossRefGoogle Scholar
  18. Quinlan, J. Ross, & Ronald L. Rivest. 1989. Inferring decision trees using the minimum description length principle. Information and Computation 80.227–248.CrossRefGoogle Scholar
  19. Rabiner, L. R., & B. H. Juang. 1986. An introduction to hidden Markov models. IEEE ASSP Magazine 3.4–16.Google Scholar
  20. Ron, Dana, Yoram Singer, & Naftali Tishby. 1994. The power of amnesia. In Advances in Neural Information Processing Systems 6, ed. by Jack Cowan, Gerald Tesauro, & Joshua Alspector. San Mateo, CA: Morgan Kaufmann.Google Scholar
  21. Sakakibara, Yasubumi. 1990. Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science 76.223–242.Google Scholar
  22. Stolcke, Andreas, 1994. Bayesian Learning of Probabilistic Language Models. Berkeley, CA: University of California dissertation.Google Scholar
  23. —, & Stephen Omohundro. 1994. Best-first model merging for hidden Markov model induction. Technical Report TR-94-003, International Computer Science Institute, Berkeley, CA.Google Scholar
  24. Wolff, J. G. 1987. Cognitive development as optimisation. In Computational models of learning, ed. by L. Bolc, 161–205. Berlin: Springer Verlag.Google Scholar
  25. Wooters, Chuck, & Andreas Stolcke. 1994. Multiple-pronunciation lexical modeling in a speaker-independent speech understanding system. In Proceedings International Conference on Spoken Language Processing, Yokohama.Google Scholar

Copyright information

© Springer-Verlag 1994

Authors and Affiliations

  • Andreas Stolcke
    • 1
  • Stephen Omohundro
    • 1
  1. 1.International Computer Science InstituteBerkeley

Personalised recommendations