Advertisement

Neural Probabilistic Language Models

  • Yoshua Bengio
  • Holger Schwenk
  • Jean-Sébastien Senécal
  • Fréderic Morin
  • Jean-Luc Gauvain
Part of the Studies in Fuzziness and Soft Computing book series (STUDFUZZ, volume 194)

Abstract

A central goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on several methods to speed-up both training and probability computation, as well as comparative experiments to evaluate the improvements brought by these techniques. We finally describe the incorporation of this new language model into a state-of-the-art speech recognizer of conversational speech.

Keywords

Speech Recognition Language Model Importance Sampling Proposal Distribution Training Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Automatically tuned linear algebra software, https://sourceforge.net/projects/mathatlas/atlasGoogle Scholar
  2. Baker, D. and McCallum, A. (1998). Distributional clustering of words for text classification. In SIGIR’98.Google Scholar
  3. Bellegarda, J. (1997). A latent semantic analysis framework for large-span language modeling. In Proceedings of Eurospeech 97, pages 1451–1454, Rhodes, Greece.Google Scholar
  4. Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining and Knowledge Discovery, 11(3), 550–557.Google Scholar
  5. Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. In S. Solla, T. Leen, and K.-R. Muller, editors, Advances in Neural Information Processing Systems 12, pages 400–406. MIT Press.Google Scholar
  6. Bengio, Y. and Senecal, J.-S. (2003). Quick training of probabilistic neural nets by sampling. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, volume 9, Key West, Florida. AI and Statistics. 38 Authors Suppressed Due to Excessive LengthGoogle Scholar
  7. Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.CrossRefGoogle Scholar
  8. Berger, A., Della Pietra, S., and Della Pietra, V. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39–71.Google Scholar
  9. Bilmes, J., Asanovic, K., Chin, C.-W., and Demmel, J. (1997). Using phipac to speed error back-propagation learning. In International Conference on Acoustics, Speech, and Signal Processing, pages V: 4153–4156.Google Scholar
  10. Blitzer, J., K.Q. Weinberger, Saul, L., and Pereira, F. (2005). Hierarchical distributed representations for statistical language models. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press.Google Scholar
  11. Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140.Google Scholar
  12. Brown, A. and Hinton, G. (2000). Products of hidden markov models. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London.Google Scholar
  13. Brown, P., Pietra, V. D., DeSouza, P., Lai, J., and Mercer, R. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18, 467–479.Google Scholar
  14. Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for language modeling. Computer, Speech and Language, 13(4), 359–393.CrossRefGoogle Scholar
  15. Cheng, J. and Druzdzel, M. J. (2000). Ais-bn: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Research, 13, 155–188.MathSciNetGoogle Scholar
  16. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRefGoogle Scholar
  17. Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179–211.CrossRefGoogle Scholar
  18. Emami, A., Xu, P., and Jelinek, F. (2003). Using a connectionist model in a syntactical based language model. In International Conference on Acoustics, Speech, and Signal Processing, pages I: 272–375.Google Scholar
  19. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press.Google Scholar
  20. Fiscus, J., Garofolo, J., Lee, A., Martin, A., Pallett, D., Przybocki, M., and Sanders, G. (Nov 2004). Results of the fall 2004 STT and MDE evaluation. In DARPA Rich Transcription Workshop, Palisades NY.Google Scholar
  21. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285.zbMATHMathSciNetCrossRefGoogle Scholar
  22. Gauvain, J.-L., Lamel, L., Schwenk, H., Adda, G., Chen, L., and Lefevre, F. (2003). Conversational telephone speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages I: 212–215.Google Scholar
  23. Goodman, J. (2001a). A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research.Google Scholar
  24. Goodman, J. (2001b). Classes for fast maximum entropy training. In International Conference on Acoustics, Speech, and Signal Processing, Utah.Google Scholar
  25. Hinton, G. (1986). Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1–12, Amherst 1986. Lawrence Erlbaum, Hillsdale.Google Scholar
  26. Hinton, G. (2000). Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London.Google Scholar
  27. Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. 1 Neural Probabilistic Language Models 39zbMATHCrossRefGoogle Scholar
  28. Hinton, G. and Roweis, S. (2003). Stochastic neighbor embedding. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, Cambridge, MA.Google Scholar
  29. Jelinek, F. and Mercer, R. (2000). Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, pages 381–397.Google Scholar
  30. Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. North-Holland, Amsterdam.Google Scholar
  31. Jensen, K. and Riis, S. (2000). Self-organizing letter code-book for text-to-phoneme neural network model. In Proceedings ICSLP.Google Scholar
  32. Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35(3), 400–401.CrossRefGoogle Scholar
  33. Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech, and Signal Processing, pages 181–184.Google Scholar
  34. Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348, Department of Statistics, University of Chicago.Google Scholar
  35. Kong, A., Liu, J. S., and Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89, 278–288.CrossRefGoogle Scholar
  36. Lee, A., Fiscus, J., Garofolo, J., Przybocki, M., Martin, A., Sanders, G., and Pallett, D. (May 2003). Spring speech-to-text transcription evaluation results. In Rich Transcription Workshop, Boston.Google Scholar
  37. Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer.Google Scholar
  38. Luis, O. and Leslie, K. (2000). Adaptive importance sampling for estimation in structured domains. In Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), pages 446–454.Google Scholar
  39. Intel math kernel library (2004)., http://www.intel.com/software/products/mkl/.Google Scholar
  40. Miikkulainen, R. and Dyer, M. (1991). Natural language processing with modular neural networks and distributed lexicon. Cognitive Science, 15, 343–399.CrossRefGoogle Scholar
  41. Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical language modeling. In European Conference on Speech Communication and Technology (Eurospeech), pages 973–976, Berlin.Google Scholar
  42. Niesler, T., Whittaker, E., and Woodland, P. (1998). Comparison of part-of-speech and automatically derived category-based language models for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages 177–180.Google Scholar
  43. Paccanaro, A. and Hinton, G. (2000). Extracting distributed representations of concepts and relations from positive and negative propositions. In Proceedings of the International Joint Conference on Neural Network, IJCNN’2000, Como, Italy. IEEE, New York.Google Scholar
  44. Pereira, F., Tishby, N., and Lee, L. (1993). Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, pages 183–190, Columbus, Ohio.Google Scholar
  45. Riis, S. and Krogh, A. (1996). Improving protein secondary structure prediction using structured neural networks and multiple sequence profiles. Journal of Computational Biology, pages 163–183. 40 Authors Suppressed Due to Excessive LengthGoogle Scholar
  46. Robert, C. P. and Casella, G. (2000). Monte Carlo Statistical Methods. Springer. Springer texts in statistics.Google Scholar
  47. Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRefGoogle Scholar
  48. Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1), 142–146.MathSciNetCrossRefGoogle Scholar
  49. Schutze, H. (1993). Word space. In C. Giles, S. Hanson, and J. Cowan, editors, Advances in Neural Information Processing Systems 5, pages pp. 895–902, San Mateo CA. Morgan Kaufmann.Google Scholar
  50. Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages I: 765–768.CrossRefGoogle Scholar
  51. Schwenk, H. and Gauvain, J.-L. (2003). Using continuous space language models for conversational speech recognition. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo.Google Scholar
  52. Schwenk, H. (2004). Efficient training of large neural networks for language modeling. In IEEE joint conference on neural networks, pages 3059–3062.Google Scholar
  53. Schwenk, H. and Gauvain, J.-L. (2004). Neural network language models for conversational speech recognition. In International Conference on Speech and Language Processing, pages 1215–1218.Google Scholar
  54. Schwenk, H. and Gauvain, J.-L. (2005). Building Continuous Language Models for Transcribing European Languages. In Eurospeech. To appear.Google Scholar
  55. Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proceedings of the International Conference on Statistical Language Processing, Denver, Colorado.Google Scholar
  56. Xu, W. and Rudnicky, A. (2000). Can artificial neural network learn language models? In International Conference on Statistical Language Processing, pages M1–13, Beijing, China.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yoshua Bengio
    • 1
  • Holger Schwenk
    • 2
  • Jean-Sébastien Senécal
    • 1
  • Fréderic Morin
    • 1
  • Jean-Luc Gauvain
    • 2
  1. 1.Département d’Informatique et Recherche OpérationnelleUniversité de MontréalMontréalCanada
  2. 2.Groupe Traitement du Langage ParléLIMSI-CNRSOrsayFrance

Personalised recommendations