Advertisement

Robustness in Statistical Language Modeling: Review and Perspectives

  • Jerome R. Bellegarda
Part of the Text, Speech and Language Technology book series (TLTB, volume 17)

Abstract

Robustness in statistical language modeling refers to the need to maintain adequate speech recognition accuracy as fewer and fewer constraints are placed on the spoken utterances, or more generally when the lexical, syntactic, or semantic characteristics of the discourse in the training and testing tasks differ. Obstacles to robustness involve the dual issues of model coverage and parameter reliability, which are intricately related to the quality and quantity of training data, as well as the estimation paradigm selected. Domain-to-domain differences impose further variations in vocabulary, context, grammar, and style. This chapter reviews a selected subset of recent approaches proposed to deal with some of these issues, and discusses possible future directions of improvement.

Keywords

Speech Recognition Language Model Automatic Speech Recognition Latent Semantic Analysis Latent Semantic Indexing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adda, G., Jardino, M. and Gauvain, J. L. (1999). Language modeling for broadcast news transcription, Proceedings of the Sixth European Conference Speech Communication and Technology, Vol. 4, Budapest, Hungary, pp. 1759–1762.Google Scholar
  2. Bahl, L. R., Brown, P. E, de Souza, P. V. and Mercer, R. L. (1989). A tree-based statistical language model for natural language speech recognition, IEEE Transactions on Acoustics, Speech, and Signal ProcessingASSP-37(7): 1001–1008.Google Scholar
  3. Bahl, L. R., Jelinek, E. and Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI5 (2): 179–190.Google Scholar
  4. Bellegarda, J. R. (1996). Context-dependent vector clustering for speech recognition, in C.-H. Lee, E K. Soong and K. K. Paliwal (eds), Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers, New York, chapter 6, pp. 133–157.Google Scholar
  5. Bellegarda, J. R. (1997). A latent semantic analysis framework for large-span language modeling, Proceedings of the Fifth European Conference Speech Communication and Technology, Vol. 3, Rhodes, Greece, pp. 1451–1454.Google Scholar
  6. Bellegarda, J. R. (1998a). Exploiting both local and global constraints for multi-span statistical language modeling, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, Seattle, WA, pp. 677–680.Google Scholar
  7. Bellegarda, J. R. (1998b). A multi-span language modeling framework for large vocabulary speech recognition, IEEE Transactions on Speech and Audio Processing 6 (5): 456–467.CrossRefGoogle Scholar
  8. Bellegarda, J. R. (1999). Speech recognition experiments using multi-span statistical language modeling, Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. II, Phoenix, AZ, pp. 717–720.Google Scholar
  9. Bellegarda, J. R., Butzberger, J. W, Chow, Y.-L., Coccaro, N. B. and Naik, D. (1996). A novel word clustering algorithm based on latent semantic analysis, Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Atlanta, GA, pp. 172–175.CrossRefGoogle Scholar
  10. Bellegarda, J. R. and Nahamoo, D. (1990). Tied mixture continuous parameter modeling for speech recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP38(12): 2033–2045.Google Scholar
  11. Berry, M. and Sameh, A. (1989). An overview of parallel algorithms for the singular value and dense symmetric eigenvalue problems, Journal of Computational Applied Mathematics 27: 191–213.CrossRefGoogle Scholar
  12. Berry, M. W. (1992). Large-scale sparse singular value computations, International Journal for Supercomputer Applications 6 (1): 13–49.Google Scholar
  13. Berry, M. W, Dumais, S. T. and O’Brien, G. W. (1995). Using linear algebra for intelligent information retrieval, SIAM Review 37 (4): 573–595.CrossRefGoogle Scholar
  14. Brousseau, J., Drouin, C., Foster, G., Isabelle, P, Kuhn, R., Normandin, Y. and Plamondon, P (1995). French speech recognition in an automatic dictation system for translators: The TransTalk project, Proceedings of the Fourth European Conference Speech Communication and Technology, Vol. 1, Madrid, pp. 193–196.Google Scholar
  15. Chase, L., Rosenfeld, R. and Ward, W. (1994). Error-responsive modifications to speech recognizers: Negative n-grams, Proceedings of the 1994 International Conference Spoken Language Processing, Yokohama.Google Scholar
  16. Chelba, C., Engle, D., Jelinek, F., Jimenez, V, Khudanpur, S., Mangu, L., Printz, H., Ristad, E. S., Rosenfeld, R., Stolcke, A. and Wu, D. (1997). Structure and performance of a dependency language model, Proceedings of the Fifth European Conference Speech Communication and Technology, Vol. 5, Rhodes, Greece, pp. 2775–2778.Google Scholar
  17. Chelba, C. and Jelinek, E (1999). Recognition performance of a structured language model, Proceedings of the Sixth European Conference Speech Communication and Technology, Vol. 4, Budapest, pp. 1567–1570.Google Scholar
  18. Chen, S. (1996). Building Probabilistic Models for Natural Language, PhD thesis, Harvard University, Cambridge, MA.Google Scholar
  19. Chou, P. A. (1988). Applications of Information Theory to Pattern Recognition and the Design of Decision Trees and Trellises, PhD thesis, Stanford University, Stanford, CA.Google Scholar
  20. Church, K. W. (1987). Phonological Parsing in Speech Recognition, Kluwer Academic Publishers, New York.CrossRefGoogle Scholar
  21. Clarkson, P. R. and Robinson, A. J. (1997). Language model adaptation using mixtures and an exponentially decaying cache, Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, Munich, pp. 799–802.Google Scholar
  22. Cullum, J. K. and Willoughby, R. A. (1985). Real rectangular matrices, Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. 1 Theory, Brickhauser, Boston, chapter 5.Google Scholar
  23. Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models, Annals of Mathematical Statistics 43 (5): 1470–1480.CrossRefGoogle Scholar
  24. Deerwester, S., Dumais, S. T, Fumas, G. W, Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis, Journal of the American Society for Information Science 41: 391–407.CrossRefGoogle Scholar
  25. Della Pietra, S., Della Pietra, V. and Lafferty, J. (1997). Inducing features of random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-19(1): 1–13.Google Scholar
  26. Della Pietra, S., Della Pietra, V, Mercer, R. and Roukos, S. (1992). Adaptive language model estimation using minimum discrimination estimation, Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, San Francisco, CA, pp. 633–636.Google Scholar
  27. Dumais, S. T. (1991). Improving the retrieval of information from external sources, Behavior Research on Methods, Instrumentation, and Computers 23 (2): 229–236.Google Scholar
  28. Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2, in D. Harman (ed.), Second Text REtrieval Conference (TREC-2), NIST Publication 500–215, pp. 105–116.Google Scholar
  29. Essen, U. and Steinbiss, V. (1992). Co-occurrence smoothing for stochastic language modeling, Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA, pp. 161–164.Google Scholar
  30. Farhat, A., Isabelle, J. and O’Shaughnessy, D. (1996). Clustering words for statistical language models based on contextual word similarity, Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Atlanta, GA, pp. 180–183.Google Scholar
  31. Federico, M. and de Mori, R. (1998). Language modeling, in R. de Mori (ed.), Spoken Dialogues with Computers, Academic Press, London, chapter 7, pp. 199–230.Google Scholar
  32. Foltz, P. W. and Dumais, S. T. (1992). Personalized information delivery: An analysis of information filtering methods, Communications of the ACM 35 (12): 51–60.CrossRefGoogle Scholar
  33. Gildea, D. and Hoffman, T. (1999). Topic-based language modeling using EM, Proceedings of the Sixth European Conference Speech Communication and Technology, VoL 5, Budapest, pp. 2167–2170.Google Scholar
  34. Gotoh, Y. and Renais, S. (1997). Document space models using latent semantic analysis, Proceedings of the Fifth European Conference Speech Communication and Technology, Vol. 3, Rhodes, Greece, pp. 1443–1448.Google Scholar
  35. Isotani, R. and Matsunaga, S. (1994). A stochastic language model for speech recognition integrating local and global constraints, Proceedings of the 1994 IFF.R International Conference on Acoustics, Speech, and Signal Processing, Vol. II, Adelaide, Australia, pp. 5–8.Google Scholar
  36. Iyer, R. and Ostendorf, M. (1999). Modeling long distance dependencies in language: Topic mixtures versus dynamic cache models, IEEE Transactions on Speech and Audio Processing 7 (1): 30–39.CrossRefGoogle Scholar
  37. Iyer, R., Ostendorf, M. and Rohlicek, J. R. (1994). Language modeling with sentence-level mixtures, Proceedings of the ARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers, pp. 82–86.Google Scholar
  38. Jardino, M. (1996). Multilingual stochastic n-gram class language models, Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Atlanta, GA, pp. 161–163.CrossRefGoogle Scholar
  39. Jardino, M. and Adda, G. (1993). Automatic word classification using simulated annealing, Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, MN, pp. 41–44.Google Scholar
  40. Jelinek, F. (1985). The development of an experimental discrete dictation recognizer, Proceedings of the IEEE 73 (11): 1616–1624.CrossRefGoogle Scholar
  41. Jelinek, E. (1990). Self-organized language modeling for speech recognition, in A. Waibel and K.-F. Lee (eds), Readings in Speech Recognition, Morgan Kaufmann Publishers, pp. 450–506.Google Scholar
  42. Jelinek, F. and Chelba, C. (1999). Putting language into language modeling, Proceedings of the Sixth European Conference Speech Communication and Technology, Vol. 1, Budapest, pp. KN1KN5.Google Scholar
  43. Jelinek, F. and Lafferty, J. D. (1991). Computation of the probability of initial substring generation by stochastic context-free grammars, Computational Linguistics 17: 315–323.Google Scholar
  44. Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data, Pattern Recognition in Practice, Amsterdam, pp. 381–397.Google Scholar
  45. Jurafsky, D., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G. and Morgan, N. (1995). Using a stochastic context-free grammar as a language model for speech recognition, Proceedings of the 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Detroit, MI, pp. 189–192.CrossRefGoogle Scholar
  46. Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP35: 400–401.Google Scholar
  47. Kenne, P. E., O’Kane, M. and Pearcy, H. G. (1995). Language modeling of spontaneous speech in a court context, Proceedings of the Fourth European Conference Speech Communication and Technology, Vol. 3, Madrid, pp. 1801–1804.Google Scholar
  48. Kneser, R. (1996). Statistical language modeling using a variable context, Proceedings of the 1996 International Conference on Spoken Language Processing, Philadelphia, PA, pp. 494–497.Google Scholar
  49. Kneser, R. and Ney, H. (1995). Improved backing-off for n-gram language modeling, Proceedings of the 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Detroit, MI, pp. 181–184.CrossRefGoogle Scholar
  50. Kneser, R. and Steinbiss, V. (1993). On the dynamic adaptation of stochastic language models, Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. II, Minneapolis, MN, pp. 586–588.CrossRefGoogle Scholar
  51. Kubala, F, Bellegarda, J. R., Cohen, J. R., Pallett, D., Paul, D. B., Phillips, M., Rajasekaran, R., Richardson, F, Riley, M., Rosenfeld, R., Roth, R. and Weintraub, M. (1994). The hub and spoke paradigm for CSR evaluation, Proceedings of the ARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers, pp. 40–44.Google Scholar
  52. Kuhn, R. and de Mori, R. (1990). A cache-based natural language method for speech recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-12-(6): 570–582.Google Scholar
  53. Lafferty, J. D. and Suhm, B. (1995). Cluster expansion and iterative scaling for maximum entropy language models, in K. Hanson and R. Silver (eds), Maximum Entropy and Bayesian Methods, Kluwer Academic Publishers, Norwell, MA.Google Scholar
  54. Landauer, T. K. and Dumais, S. T. (1997). Solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychological Review 104 (2): 211–240.CrossRefGoogle Scholar
  55. Landauer, T. K., Laham, D., Rehder, B. and Schreiner, M. E. (1998). How well can passage meaning be derived without using word order: A comparison of latent semantic analysis and humans, Proceedings of the Cognitive Science Society.Google Scholar
  56. Lau, R., Rosenfeld, R. and Roukos, S. (1993). Trigger-based language models: A maximum entropy approach, Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. II, Minneapolis, MN, pp. 45–48.CrossRefGoogle Scholar
  57. Maltese, G. and Mancini, F. (1992). An automatic technique to include grammatical and morphological information in a trigram-based statistical language model, Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, San Francisco, CA, pp. 157–160.Google Scholar
  58. Martin, S. C., Liermann, J. and Ney, H. (1997). Adaptive topic-dependent language modelling using word-based varigrams, Proceedings of the Fifth European Conference Speech Communication and Technology, Vol. 3, Rhodes, Greece, pp. 1447–1450.Google Scholar
  59. Mood, A., Graybill, F. and Boes, D. (1974). Introduction to the Theory of Statistics, McGraw-Hill, New York.Google Scholar
  60. Ney, H., Essen, U. and Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modeling, Computer, Speech, and Language 8: 1–38.CrossRefGoogle Scholar
  61. Niesler, T. and Woodland, p (1996). A variable-length category-based n-gram language model, Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Atlanta, GA, pp. 164–167.CrossRefGoogle Scholar
  62. Pereira, E C., Singer, Y. and Tishby, N. (1995). Beyond word n-grams, in D. Yarowsky and K. Church (eds), Proceedings of the Third Workshop on Very Large Corpora, Massachusetts Institute of Technology, Cambridge, MA, pp. 95–106.Google Scholar
  63. Rabiner, L. R., Juang, B.-H. and Lee, C.-H. (1996). An overview of automatic speech recognition, in C.-H. Lee, F. K. Soong and K. K. Paliwal (eds), Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers, Boston, MA, chapter 1, pp. 1–30.Google Scholar
  64. Rosenfeld, R. (1994). The CMU statistical language modeling toolkit and its use in the 1994 ARPA CSR evaluation, Proceedings of the ARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers.Google Scholar
  65. Rosenfeld, R. (1995). Optimizing lexical and n-gram coverage via judicious use of linguistic data, Proceedings of the Fourth European Conference on Speech Communication and Technology, Madrid, pp. 1763–1766.Google Scholar
  66. Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling, Computer Speech and Language 10: 187–228.CrossRefGoogle Scholar
  67. Roukos, S. (1997). Language representation, in R. Cole (ed.), Survey of the State of the Art in Human Language Technology, Cambridge University Press, chapter 6.Google Scholar
  68. Schwartz, R., Imai, T, Kubala, F., Nguyen, L. and Makhoul, J. (1997). A maximum likelihood model for topic classification of broadcast news, Proceedings of the Fifth European Conference Speech Communication and Technology, Vol. 3, Rhodes, Greece, pp. 1455–1458.Google Scholar
  69. Spies, M. (1995). A language model for compound words in speech recognition, Proceedings of the Fourth European Conference on Speech Communication and Technology, Madrid, pp. 1767–1770.Google Scholar
  70. Stolcke, A. and Segal, J. (1994). Precise n-gram probabilities from stochastic context-free grammars, Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Las Cruces, NM, pp. 74–79.Google Scholar
  71. Story, R. E. (1996). An explanation of the effectiveness of latent semantic indexing by means of a bayesian regression model, Information Processing and Management 32 (3): 329–344.CrossRefGoogle Scholar
  72. Tamoto, M. and Kawabata, T. (1995). Clustering word category based on binomial posteriori cooccurrence distribution, Proceedings of the 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, Detroit, MI, pp. 165–168.Google Scholar
  73. Witten, I. H. and Bell, T. C. (1991). The zero-frequency problem: Estimating the probability of novel events in adaptive text compression, IEEE Transactions on Information Theory 37(4): 10851094.Google Scholar
  74. Woodland, P C., Odell, J. J., Valtchev, V. and Young, S. J. (1994). Large vocabulary continuous speech recognition using HTK, Proceedings of the 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia, pp. 125–128.Google Scholar
  75. Younger, D. H. (1967). Recognition and parsing of context-free languages in time N 3, Information and Control 10: 198–208.CrossRefGoogle Scholar
  76. Zhang, R., Black, E. and Finch, A. (1999). Using detailed linguistic structure in language modeling, Proceedings of the Sixth European Conference Speech Communication and Technology, Vol. 4, Budapest, pp. 1815–1818.Google Scholar
  77. Zue, V, Glass, J., Goodine, D., Leung, H., Phillips, M., Polifroni, J. and Seneff, S. (1991). Integration of speech recognition and natural language processing in the MIT voyager system, Proceedings of the 1991 IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, pp. 713–716.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2001

Authors and Affiliations

  • Jerome R. Bellegarda
    • 1
  1. 1.Two Infinite LoopApple ComputerCupertinoUSA

Personalised recommendations