Advertisement

The Use of the Maximum Likelihood Criterion in Language Modelling

  • Hermann Ney
Part of the NATO ASI Series book series (volume 169)

Summary

This paper gives an overview over the use of the maximum likelihood criterion in stochastic language modelling. This criterion and its associated estimation techniques provide a unifying framework for various approaches that seem very much unrelated and different at first glance, such as smoothing and cross-validation, decision trees (CART), word classes obtained by clustering, word trigger pairs and maximum entropy models.

Keywords

Speech Recognition Sparse Data Likelihood Criterion Maximum Entropy Model Maximum Likelihood Criterion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer: “A Tree Based Statistical Language Model for Natural Language Speech Recognition”, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 37, pp. 1001–1008, July 1989.CrossRefGoogle Scholar
  2. [2]
    L. R. Bahl, F. Jelinek, R. L. Mercer: “A Maximum Likelihood Approach to Continuous Speech Recognition”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 5, pp. 179–190, March 1983.CrossRefGoogle Scholar
  3. [3]
    L. R. Bahl, F. Jelinek, R. L. Mercer, A. Nadas: “Next Word Statistical Predictor”, IBM Tech. Disclosure Bulletin, Vol. 27, No. 7A, pp. 3941–42, Dec. 84.Google Scholar
  4. [4]
    Y. M. M. Bishop, S. E. Fienberg, P. W. Holland: ‘Discrete Multivariate Analysis’, MIT Press, Cambridge, MA, 1975.MATHGoogle Scholar
  5. [5]
    L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone: ‘Classification And Regression Trees’, Wadsworth, Belmont, CA, 1984.MATHGoogle Scholar
  6. [6]
    P. F. Brown, V. Delia Pietra, P. de Souza, R. L. Mercer: “Class-Based n-gram Models of Natural Language”, Computational Linguistics, Vol. 18, No. 4, pp. 467–479, 1992.Google Scholar
  7. [7]
    J. N. Darroch, D. Ratcliff: “Generalized Iterative Scaling for Log-Linear Models”, Annals of Mathematical Statistics, Vol. 43, pp. 1470–1480, 1972.MathSciNetMATHCrossRefGoogle Scholar
  8. [8]
    A. P. Dempster, N. M. Laird, D. B. Rubin: “Maximum Likelihood from Incomplete Data via the EM Algorithm”, J. Royal Statist. Soc. Ser. B (methodological), Vol. 39, pp. 1–38, 1977.MathSciNetMATHGoogle Scholar
  9. [9]
    A. M. Derouault, B. Merialdo: “Natural Language Modeling for Phoneme-to-Text Transcription”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 8, pp. 742–749, Nov. 1986.CrossRefGoogle Scholar
  10. [10]
    R. O. Duda, P. E. Hart: ‘Pattern Classification and Scene Analysis’, John Wiley & Sons, New York, 1973.MATHGoogle Scholar
  11. [11]
    B. Efron, R. J. Tibshirani: ‘An Introduction to the Bootstrap’, Chapman & Hall, New York, 1993.MATHGoogle Scholar
  12. [12]
    I. J. Good: “The Population Frequencies of Species and the Estimation of Population Parameters”, Biometrika, Vol. 40, pp. 237–264, Dec. 1953.MathSciNetMATHGoogle Scholar
  13. [13]
    A. L. Gorin, S. E. Levinson, A. N. Gertner, E. R. Goldman: “Adaptive Acquisition of Language”, Computer, Speech and Language, Vol. 5, No. 2, pp. 101–132, April 1991.CrossRefGoogle Scholar
  14. [14]
    F. Jelinek: “Self-Organized Language Modeling for Speech Recognition”, pp. 450–506, in A. Waibel, K.-F. Lee (eds.): ‘Readings in Speech Recognition’, Morgan Kaufmann Publishers, San Mateo, CA, 1991.Google Scholar
  15. [15]
    F. Jelinek, J. Lafferty, R. L. Mercer: “Basic Methods of Probabilistic Context Free Grammars”, pp. 347–360, in P. Laface, R. de Mori (eds.): ‘Speech Recognition and Understanding’, Springer, Berlin, 1992.Google Scholar
  16. [16]
    F. Jelinek, R. L. Mercer, S. Roukos: “Classifying Words for Improved Statistical Language Models”, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Albuquerque, NM, pp. 621–624, April 1990.Google Scholar
  17. [17]
    F. Jelinek, R. L. Mercer, S. Roukos: “Principles of Lexical Language Modeling for Speech Recognition”, pp. 651–699, in S. Furui, M. M. Sondhi (eds.): ‘Advances in Speech Signal Processing’, Marcel Dekker, New York, 1991.Google Scholar
  18. [18]
    S. M. Katz: “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer”, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 35, pp. 400–401, March 1987.CrossRefGoogle Scholar
  19. [19]
    R. Kneser, H. Ney: “Improved Clustering Techniques for Class-Based Statistical Language Modelling”, Third European Conference on Speech Communication and Technology, Berlin, pp. 973–976, Sep. 1993.Google Scholar
  20. [20]
    R. Kuhn, R. de Mori: “A Cache-Based Natural Language Model for Speech Recognition”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 12, pp. 570–583, June 1990.CrossRefGoogle Scholar
  21. [21]
    R. Kuhn, R. de Mori: “Recent Results in Automatic Learning Rules for Semantic Interpretation”, Int. Conf. on Spoken Language Processing, Yokohama, Japan, pp. 75–78, Sep. 1994.Google Scholar
  22. [22]
    R. Lau, R. Rosenfeld, S. Roukos: “Trigger-Based Language Models: A Maximum Entropy Approach”, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Minneapolis, MN, Vol. II, pp. 45–48, April 1993.Google Scholar
  23. [23]
    E. L. Lehmann: ‘Theory of Point Estimation’, J. Wiley, New York, 1983.MATHGoogle Scholar
  24. [24]
    S. Martin, J. Liermann, H. Ney: “Algorithms for Bigram and Trigram Word Clustering”, Fourth European Conference on Speech Communication and Technology, Madrid, pp. 1253–1256, Sep. 1995.Google Scholar
  25. [25]
    A. Nadas: “On Turing’s Formula for Word Probabilities”, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 33, pp. 1414–1416, Dec. 1985.MATHCrossRefGoogle Scholar
  26. [26]
    H. Ney, U. Essen: “Estimating Small Probabilities by Leaving-One-Out”, Third European Conference on Speech Communication and Technology, Berlin, pp. 2239–2242, Sep. 1993.Google Scholar
  27. [27]
    H. Ney, U. Essen, R. Kneser: “On Structuring Probabilistic Dependencies in Language Modelling”, Computer Speech and Language, Vol. 8, pp. 1–38, 1994.CrossRefGoogle Scholar
  28. [28]
    H. Ney, S. Martin, F. Wessel: “Statistical Language Modelling by Leaving-One-Out”, in G. Bloothooft, S. Young (eds.): ‘Corpus-Based Methods in Speech and Language’, Kluwer Academic Publishers, Dordrecht, pp. 174–207, 1997.Google Scholar
  29. [29]
    R. Pieraccini, E. Levin, E. Vidal: “Learning how to Understand Language”, Third European Conference on Speech Communication and Technology, Berlin, pp. 1407–1412, Sep. 1993.Google Scholar
  30. [30]
    S. Della Pietra, V. Della Pietra, J. Gillett, J. Lafferty, H. Printz, L. Ures: “Inference and Estimation of a Long-Range Trigram Model”, Second Int. Colloquium ‘Grammatical Inference and Applications’, Alicante, Spain, Sep. 1994, pp. 78–92, Springer-Verlag, Berlin, 1994.Google Scholar
  31. [31]
    S. Della Pietra, V. Della Pietra, J. Lafferty: “Inducing Features of Random Fields”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19, No. 4, pp. 380–393, April 1997.CrossRefGoogle Scholar
  32. [32]
    R. Rosenfeld: “Adaptive Statistical Language Modeling: A Maximum Entropy Approach”, Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, CMU-CS-94–138, 1994.Google Scholar
  33. [33]
    M. Simons, H. Ney, S. Martin: “Distant Bigram Language Modelling Using Maximum Entropy”, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Munich, Vol. 2, pp. 787–790, April 1997.Google Scholar
  34. [34]
    C. Tillmann, H. Ney: “Selection Criteria for Word Triggers in Language Modeling”, Fourth Int. Colloquium on Grammatical Inference, Montpellier, pp. 95–106, Springer, Lecture Notes in Artificial Intelligence 1147, Sep. 1996.Google Scholar
  35. [35]
    C. Tillmann, H. Ney: “Word Triggers and the EM Algorithm”, ACL Special Interest Group Workshop on Computational Natural Language Learning (Assoc. for Comput. Linguistics), Madrid pp. 117–124, July 1997.Google Scholar
  36. [36]
    J. Yamron, J. Cant, A. Demetds, T. Dietzel, Y. Ito: “The Automatic Component of the LINGSTAT Machine-Aided Translation System”, ARPA Human Language Technology Workshop, Plainsboro, NJ, Morgan Kaufmann Publishers, San Mateo, CA, pp. 158–163, March 1994.Google Scholar
  37. [37]
    S. J. Young, J. J. Odell, P. C. Woodland: “Tree-Based State Tying for High Accuracy Acoustic Modelling”, ARPA Human Language Technology Workshop, Plainsboro, NJ, Morgan Kaufmann Publishers, San Mateo, CA, pp. 286–291, March 1994.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Hermann Ney
    • 1
  1. 1.Lehrstuhl für Informatik VIRWTH Aachen — University of TechnologyAachenGermany

Personalised recommendations