Chapter

Innovations in Machine Learning

Volume 194 of the series Studies in Fuzziness and Soft Computing pp 137-186

Neural Probabilistic Language Models

  • Yoshua BengioAffiliated withDépartement d’Informatique et Recherche Opérationnelle, Université de Montréal
  • , Holger SchwenkAffiliated withGroupe Traitement du Langage Parlé, LIMSI-CNRS
  • , Jean-Sébastien SenécalAffiliated withDépartement d’Informatique et Recherche Opérationnelle, Université de Montréal
  • , Fréderic MorinAffiliated withDépartement d’Informatique et Recherche Opérationnelle, Université de Montréal
  • , Jean-Luc GauvainAffiliated withGroupe Traitement du Langage Parlé, LIMSI-CNRS

* Final gross prices may vary according to local VAT.

Get Access

Abstract

A central goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on several methods to speed-up both training and probability computation, as well as comparative experiments to evaluate the improvements brought by these techniques. We finally describe the incorporation of this new language model into a state-of-the-art speech recognizer of conversational speech.