Robustness in Language and Speech Technology pp 101-121 | Cite as

# Robustness in Statistical Language Modeling: Review and Perspectives

## Abstract

Robustness in statistical language modeling refers to the need to maintain adequate speech recognition accuracy as fewer and fewer constraints are placed on the spoken utterances, or more generally when the lexical, syntactic, or semantic characteristics of the discourse in the training and testing tasks differ. Obstacles to robustness involve the dual issues of model coverage and parameter reliability, which are intricately related to the quality and quantity of training data, as well as the estimation paradigm selected. Domain-to-domain differences impose further variations in vocabulary, context, grammar, and style. This chapter reviews a selected subset of recent approaches proposed to deal with some of these issues, and discusses possible future directions of improvement.

## Keywords

Speech Recognition Language Model Automatic Speech Recognition Latent Semantic Analysis Latent Semantic Indexing## Preview

Unable to display preview. Download preview PDF.

## References

- Adda, G., Jardino, M. and Gauvain, J. L. (1999). Language modeling for broadcast news transcription,
*Proceedings of the Sixth European Conference Speech Communication and Technology*, Vol. 4, Budapest, Hungary, pp. 1759–1762.Google Scholar - Bahl, L. R., Brown, P. E, de Souza, P. V. and Mercer, R. L. (1989). A tree-based statistical language model for natural language speech recognition,
*IEEE Transactions on Acoustics*,*Speech*,*and Signal Processing*ASSP-37(7): 1001–1008.Google Scholar - Bahl, L. R., Jelinek, E. and Mercer, R. L. (1983). A maximum likelihood approach to continuous speech recognition,
*IEEE Transactions on Pattern Analysis and Machine Intelligence*PAMI5 (2): 179–190.Google Scholar - Bellegarda, J. R. (1996). Context-dependent vector clustering for speech recognition, in C.-H. Lee, E K. Soong and K. K. Paliwal (eds),
*Automatic Speech and Speaker Recognition: Advanced Topics*, Kluwer Academic Publishers, New York, chapter 6, pp. 133–157.Google Scholar - Bellegarda, J. R. (1997). A latent semantic analysis framework for large-span language modeling,
*Proceedings of the Fifth European Conference Speech Communication and Technology*, Vol. 3, Rhodes, Greece, pp. 1451–1454.Google Scholar - Bellegarda, J. R. (1998a). Exploiting both local and global constraints for multi-span statistical language modeling,
*Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. 2, Seattle, WA, pp. 677–680.Google Scholar - Bellegarda, J. R. (1998b). A multi-span language modeling framework for large vocabulary speech recognition,
*IEEE Transactions on Speech and Audio Processing*6 (5): 456–467.CrossRefGoogle Scholar - Bellegarda, J. R. (1999). Speech recognition experiments using multi-span statistical language modeling,
*Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. II, Phoenix, AZ, pp. 717–720.Google Scholar - Bellegarda, J. R., Butzberger, J. W, Chow, Y.-L., Coccaro, N. B. and Naik, D. (1996). A novel word clustering algorithm based on latent semantic analysis,
*Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. I, Atlanta, GA, pp. 172–175.CrossRefGoogle Scholar - Bellegarda, J. R. and Nahamoo, D. (1990). Tied mixture continuous parameter modeling for speech recognition,
*IEEE Transactions on Acoustics*,*Speech*,*and Signal Processing*ASSP38(12): 2033–2045.Google Scholar - Berry, M. and Sameh, A. (1989). An overview of parallel algorithms for the singular value and dense symmetric eigenvalue problems,
*Journal of Computational Applied Mathematics*27: 191–213.CrossRefGoogle Scholar - Berry, M. W. (1992). Large-scale sparse singular value computations,
*International Journal for Supercomputer Applications***6**(1): 13–49.Google Scholar - Berry, M. W, Dumais, S. T. and O’Brien, G. W. (1995). Using linear algebra for intelligent information retrieval,
*SIAM Review*37 (4): 573–595.CrossRefGoogle Scholar - Brousseau, J., Drouin, C., Foster, G., Isabelle, P, Kuhn, R., Normandin, Y. and Plamondon, P (1995). French speech recognition in an automatic dictation system for translators: The TransTalk project,
*Proceedings of the Fourth European Conference Speech Communication and Technology*, Vol.**1**, Madrid, pp. 193–196.Google Scholar - Chase, L., Rosenfeld, R. and Ward, W. (1994). Error-responsive modifications to speech recognizers: Negative n-grams,
*Proceedings of the 1994 International Conference Spoken Language Processing*, Yokohama.Google Scholar - Chelba, C., Engle, D., Jelinek, F., Jimenez, V, Khudanpur, S., Mangu, L., Printz, H., Ristad, E. S., Rosenfeld, R., Stolcke, A. and Wu, D. (1997). Structure and performance of a dependency language model,
*Proceedings of the Fifth European Conference Speech Communication and Technology*, Vol. 5, Rhodes, Greece, pp. 2775–2778.Google Scholar - Chelba, C. and Jelinek, E (1999). Recognition performance of a structured language model,
*Proceedings of the Sixth European Conference Speech Communication and Technology*, Vol. 4, Budapest, pp. 1567–1570.Google Scholar - Chen, S. (1996).
*Building Probabilistic Models for Natural Language*, PhD thesis, Harvard University, Cambridge, MA.Google Scholar - Chou, P. A. (1988).
*Applications of Information Theory to Pattern Recognition and the Design of Decision Trees and Trellises*, PhD thesis, Stanford University, Stanford, CA.Google Scholar - Church, K. W. (1987).
*Phonological Parsing in Speech Recognition*, Kluwer Academic Publishers, New York.CrossRefGoogle Scholar - Clarkson, P. R. and Robinson, A. J. (1997). Language model adaptation using mixtures and an exponentially decaying cache,
*Proceedings of the 1997 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, Vol. 1, Munich, pp. 799–802.Google Scholar - Cullum, J. K. and Willoughby, R. A. (1985). Real rectangular matrices,
*Lanczos Algorithms for Large Symmetric Eigenvalue Computations*, Vol. 1 Theory, Brickhauser, Boston, chapter 5.Google Scholar - Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models,
*Annals of Mathematical Statistics*43 (5): 1470–1480.CrossRefGoogle Scholar - Deerwester, S., Dumais, S. T, Fumas, G. W, Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic analysis,
*Journal of the American Society for Information Science***41**: 391–407.CrossRefGoogle Scholar - Della Pietra, S., Della Pietra, V. and Lafferty, J. (1997). Inducing features of random fields,
*IEEE Transactions on Pattern Analysis and Machine Intelligence*PAMI-**19**(1): 1–13.Google Scholar - Della Pietra, S., Della Pietra, V, Mercer, R. and Roukos, S. (1992). Adaptive language model estimation using minimum discrimination estimation,
*Proceedings of the 1992 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, Vol. I, San Francisco, CA, pp. 633–636.Google Scholar - Dumais, S. T. (1991). Improving the retrieval of information from external sources,
*Behavior Research on Methods*,*Instrumentation*,*and Computers*23 (2): 229–236.Google Scholar - Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2,
*in*D. Harman (ed.),*Second Text REtrieval Conference (TREC-2)*, NIST Publication 500–215, pp. 105–116.Google Scholar - Essen, U. and Steinbiss, V. (1992). Co-occurrence smoothing for stochastic language modeling,
*Proceedings of the 1992 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, San Francisco, CA, pp. 161–164.Google Scholar - Farhat, A., Isabelle, J. and O’Shaughnessy, D. (1996). Clustering words for statistical language models based on contextual word similarity,
*Proceedings of the 1996 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, Vol. I, Atlanta, GA, pp. 180–183.Google Scholar - Federico, M. and de Mori, R. (1998). Language modeling,
*in*R. de Mori (ed.),*Spoken Dialogues with Computers*, Academic Press, London, chapter 7, pp. 199–230.Google Scholar - Foltz, P. W. and Dumais, S. T. (1992). Personalized information delivery: An analysis of information filtering methods,
*Communications of the ACM*35 (12): 51–60.CrossRefGoogle Scholar - Gildea, D. and Hoffman, T. (1999). Topic-based language modeling using EM,
*Proceedings of the Sixth European Conference Speech Communication and Technology*, VoL 5, Budapest, pp. 2167–2170.Google Scholar - Gotoh, Y. and Renais, S. (1997). Document space models using latent semantic analysis,
*Proceedings of the Fifth European Conference Speech Communication and Technology*, Vol. 3, Rhodes, Greece, pp. 1443–1448.Google Scholar - Isotani, R. and Matsunaga, S. (1994). A stochastic language model for speech recognition integrating local and global constraints,
*Proceedings of the 1994 IFF.R International Conference on Acoustics, Speech, and Signal Processing*, Vol. II, Adelaide, Australia, pp. 5–8.Google Scholar - Iyer, R. and Ostendorf, M. (1999). Modeling long distance dependencies in language: Topic mixtures versus dynamic cache models,
*IEEE Transactions on Speech and Audio Processing*7 (1): 30–39.CrossRefGoogle Scholar - Iyer, R., Ostendorf, M. and Rohlicek, J. R. (1994). Language modeling with sentence-level mixtures,
*Proceedings of the ARPA Speech and Natural Language Workshop*, Morgan Kaufmann Publishers, pp. 82–86.Google Scholar - Jardino, M. (1996). Multilingual stochastic n-gram class language models,
*Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. I, Atlanta, GA, pp. 161–163.CrossRefGoogle Scholar - Jardino, M. and Adda, G. (1993). Automatic word classification using simulated annealing,
*Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Minneapolis, MN, pp. 41–44.Google Scholar - Jelinek, F. (1985). The development of an experimental discrete dictation recognizer,
*Proceedings of the IEEE*73 (11): 1616–1624.CrossRefGoogle Scholar - Jelinek, E. (1990). Self-organized language modeling for speech recognition,
*in*A. Waibel and K.-F. Lee (eds),*Readings in Speech Recognition*, Morgan Kaufmann Publishers, pp. 450–506.Google Scholar - Jelinek, F. and Chelba, C. (1999). Putting language into language modeling,
*Proceedings of the Sixth European Conference Speech Communication and Technology*, Vol. 1, Budapest, pp. KN1KN5.Google Scholar - Jelinek, F. and Lafferty, J. D. (1991). Computation of the probability of initial substring generation by stochastic context-free grammars,
*Computational Linguistics*17: 315–323.Google Scholar - Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data,
*Pattern Recognition in Practice*, Amsterdam, pp. 381–397.Google Scholar - Jurafsky, D., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G. and Morgan, N. (1995). Using a stochastic context-free grammar as a language model for speech recognition,
*Proceedings of the 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. I, Detroit, MI, pp. 189–192.CrossRefGoogle Scholar - Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer,
*IEEE Transactions on Acoustics*,*Speech*,*and Signal Processing*ASSP35: 400–401.Google Scholar - Kenne, P. E., O’Kane, M. and Pearcy, H. G. (1995). Language modeling of spontaneous speech in a court context,
*Proceedings of the Fourth European Conference Speech Communication and Technology*, Vol. 3, Madrid, pp. 1801–1804.Google Scholar - Kneser, R. (1996). Statistical language modeling using a variable context,
*Proceedings of the 1996 International Conference on Spoken Language Processing*, Philadelphia, PA, pp. 494–497.Google Scholar - Kneser, R. and Ney, H. (1995). Improved backing-off for n-gram language modeling,
*Proceedings of the 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. I, Detroit, MI, pp. 181–184.CrossRefGoogle Scholar - Kneser, R. and Steinbiss, V. (1993). On the dynamic adaptation of stochastic language models,
*Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. II, Minneapolis, MN, pp. 586–588.CrossRefGoogle Scholar - Kubala, F, Bellegarda, J. R., Cohen, J. R., Pallett, D., Paul, D. B., Phillips, M., Rajasekaran, R., Richardson, F, Riley, M., Rosenfeld, R., Roth, R. and Weintraub, M. (1994). The hub and spoke paradigm for CSR evaluation,
*Proceedings of the ARPA Speech and Natural Language Workshop*, Morgan Kaufmann Publishers, pp. 40–44.Google Scholar - Kuhn, R. and de Mori, R. (1990). A cache-based natural language method for speech recognition,
*IEEE Transactions on Pattern Analysis and Machine Intelligence*PAMI-12-(6): 570–582.Google Scholar - Lafferty, J. D. and Suhm, B. (1995). Cluster expansion and iterative scaling for maximum entropy language models, in K. Hanson and R. Silver (eds),
*Maximum Entropy and Bayesian Methods*, Kluwer Academic Publishers, Norwell, MA.Google Scholar - Landauer, T. K. and Dumais, S. T. (1997). Solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge,
*Psychological Review***104**(2): 211–240.CrossRefGoogle Scholar - Landauer, T. K., Laham, D., Rehder, B. and Schreiner, M. E. (1998). How well can passage meaning be derived without using word order: A comparison of latent semantic analysis and humans,
*Proceedings of the Cognitive Science Society*.Google Scholar - Lau, R., Rosenfeld, R. and Roukos, S. (1993). Trigger-based language models: A maximum entropy approach,
*Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. II, Minneapolis, MN, pp. 45–48.CrossRefGoogle Scholar - Maltese, G. and Mancini, F. (1992). An automatic technique to include grammatical and morphological information in a trigram-based statistical language model,
*Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing*, San Francisco, CA, pp. 157–160.Google Scholar - Martin, S. C., Liermann, J. and Ney, H. (1997). Adaptive topic-dependent language modelling using word-based varigrams,
*Proceedings of the Fifth European Conference Speech Communication and Technology*, Vol. 3, Rhodes, Greece, pp. 1447–1450.Google Scholar - Mood, A., Graybill, F. and Boes, D. (1974).
*Introduction to the Theory of Statistics*, McGraw-Hill, New York.Google Scholar - Ney, H., Essen, U. and Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modeling, Computer,
*Speech, and Language***8**: 1–38.CrossRefGoogle Scholar - Niesler, T. and Woodland, p (1996). A variable-length category-based n-gram language model,
*Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing*, Vol. I, Atlanta, GA, pp. 164–167.CrossRefGoogle Scholar - Pereira, E C., Singer, Y. and Tishby, N. (1995). Beyond word n-grams,
*in*D. Yarowsky and K. Church (eds),*Proceedings of the Third Workshop on Very Large Corpora*, Massachusetts Institute of Technology, Cambridge, MA, pp. 95–106.Google Scholar - Rabiner, L. R., Juang, B.-H. and Lee, C.-H. (1996). An overview of automatic speech recognition, in C.-H. Lee, F. K. Soong and K. K. Paliwal (eds),
*Automatic Speech and Speaker Recognition: Advanced Topics*, Kluwer Academic Publishers, Boston, MA, chapter 1, pp. 1–30.Google Scholar - Rosenfeld, R. (1994). The CMU statistical language modeling toolkit and its use in the 1994 ARPA CSR evaluation,
*Proceedings of the ARPA Speech and Natural Language Workshop*, Morgan Kaufmann Publishers.Google Scholar - Rosenfeld, R. (1995). Optimizing lexical and n-gram coverage via judicious use of linguistic data,
*Proceedings of the Fourth European Conference on Speech Communication and Technology*, Madrid, pp. 1763–1766.Google Scholar - Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language modeling,
*Computer Speech and Language***10**: 187–228.CrossRefGoogle Scholar - Roukos, S. (1997). Language representation, in R. Cole (ed.),
*Survey of the State of the Art in Human Language Technology*, Cambridge University Press, chapter 6.Google Scholar - Schwartz, R., Imai, T, Kubala, F., Nguyen, L. and Makhoul, J. (1997). A maximum likelihood model for topic classification of broadcast news,
*Proceedings of the Fifth European Conference Speech Communication and Technology*, Vol. 3, Rhodes, Greece, pp. 1455–1458.Google Scholar - Spies, M. (1995). A language model for compound words in speech recognition,
*Proceedings of the Fourth European Conference on Speech Communication and Technology*, Madrid, pp. 1767–1770.Google Scholar - Stolcke, A. and Segal, J. (1994). Precise n-gram probabilities from stochastic context-free grammars,
*Proceedings of the 32nd Meeting of the Association for Computational Linguistics*, Las Cruces, NM, pp. 74–79.Google Scholar - Story, R. E. (1996). An explanation of the effectiveness of latent semantic indexing by means of a bayesian regression model,
*Information Processing and Management*32 (3): 329–344.CrossRefGoogle Scholar - Tamoto, M. and Kawabata, T. (1995). Clustering word category based on binomial posteriori cooccurrence distribution,
*Proceedings of the 1995 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, Vol. I, Detroit, MI, pp. 165–168.Google Scholar - Witten, I. H. and Bell, T. C. (1991). The zero-frequency problem: Estimating the probability of novel events in adaptive text compression,
*IEEE Transactions on Information Theory*37(4): 10851094.Google Scholar - Woodland, P C., Odell, J. J., Valtchev, V. and Young, S. J. (1994). Large vocabulary continuous speech recognition using HTK,
*Proceedings of the 1994 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, Adelaide, Australia, pp. 125–128.Google Scholar - Younger, D. H. (1967). Recognition and parsing of context-free languages in time
*N**3*,*Information and Control**10*: 198–208.CrossRefGoogle Scholar - Zhang, R., Black, E. and Finch, A. (1999). Using detailed linguistic structure in language modeling,
*Proceedings of the Sixth European Conference Speech Communication and Technology*, Vol. 4, Budapest, pp. 1815–1818.Google Scholar - Zue, V, Glass, J., Goodine, D., Leung, H., Phillips, M., Polifroni, J. and Seneff, S. (1991). Integration of speech recognition and natural language processing in the MIT voyager system,
*Proceedings of the 1991 IEEE International Conference on Acoustics*,*Speech*,*and Signal Processing*, Toronto, pp. 713–716.Google Scholar