Skip to main content

A Neural Syntactic Language Model


This paper presents a study of using neural probabilistic models in a syntactic based language model. The neural probabilistic model makes use of a distributed representation of the items in the conditioning history, and is powerful in capturing long dependencies. Employing neural network based models in the syntactic based language model enables it to use efficiently the large amount of information available in a syntactic parse in estimating the next word in a string. Several scenarios of integrating neural networks in the syntactic based language model are presented, accompanied by the derivation of the training procedures involved. Experiments on the UPenn Treebank and the Wall Street Journal corpus show significant improvements in perplexity and word error rate over the baseline SLM. Furthermore, comparisons with the standard and neural net based N-gram models with arbitrarily long contexts show that the syntactic information is in fact very helpful in estimating the word string probability. Overall, our neural syntactic based model achieves the best published results in perplexity and WER for the given data sets.


  • Bellegarda, J. R. (1997). A latent semantic analysis framework for large–span language modeling. In Proceedings of the 5th European Conference on Speech Communication and Technology (pp. 1451&1454). Vol. 3. Rhodes, Greece.

  • Bengio, Y., Ducharme, R., & Vincent, P. (2001). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13, 933–938.

    Google Scholar 

  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neuralprobabilistic language model. Journal of Machine Learning Reseach, 3, 1137–1155.

    Article  Google Scholar 

  • Berger, A. L., Pietra, S. A. D., & Pietra, V. J. D. (1996). A maximum entropyapproach to natural language processing. Computational Linguistics, 22:1, 39–72.

    Google Scholar 

  • Bridle, J. S. (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical patternrecognition. In F. Fougelman-Soulie and J. Herault (Eds.), Neuro-computing: Algorithms, architectures and applicatations (pp. 227&236).

  • Byrne, W., Gunawardana, A., & Khudanpur, S. (1998). Information geometry and EMvariants. Technical Report CLSP Research Note (17). Department of Electrical andComputer Engineering, The Johns Hopkins University, Baltimore, MD.

  • Charniak, E. (2001). Immediate-head parsing for language models. In Proceedings of the 39th Annual Meeting and 10th Conference of the European Chapter of ACL (pp. 116–123). Toulouse, France.

  • Chelba, C. (1997). A structured language model. In ACL-EACL, Student Section (pp. 498&500). Madrid, Spain.

  • Chelba, C., & Jelinek, F. (2000). Structured language modeling. Computer Speech and Language, 14:4, 283–332.

    Article  Google Scholar 

  • Chelba, C., & Xu, P. (2001). Richer syntactic dependencies for structuredlanguage modeling. In Proceedings of the Automatic Speech Recognition and Understanding Workshop. Madonna di Campiglio, Trento-Italy.

  • Chen, S. F. & Goodman, J. (1999). An empirical study of smoothing techniquesfor language modeling. Computer Speech and Language, 13, 359–394.

    Article  Google Scholar 

  • Collins, M. (1996). A new statistical parser based on bigram lexicaldependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (pp. 184&191). Santa Cruz, CA.

  • Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41:6, 391–407.

    Article  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.

    Google Scholar 

  • Elman, J. L. (1991). Distributed representations, simple recurrent networks,and grammatical structure. Machine Learning, 7, 195–225.

    Google Scholar 

  • Emami, A. (2003). Improving a connectionist based syntactical language model. In Proceedings of the 8th European Conference on Speech Communication and Technology (pp. 413–416), Vol. 1. Geneva, Switzerland.

  • Emami, A., & Jelinek, F. (2004). Exact training of a neural syntactic languagemodel. In Proceedings of the IEEE International Conference onAcoustics, Speech and Signal Processing. Montreal,Quebec.

  • Emami, A., Xu, P., & Jelinek, F. (2003). Using a connectionist model in asyntactical based language model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 372–375). Vol. I. Hong Kong.

  • Fodor, J. A. & Pylyshyn, Z.W. (1988). Connectionism and cognitive structure: A critical analysis. Cognition, 28, 3–71.

    Article  PubMed  Google Scholar 

  • Goodman, J. (2001). A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research, Redmond, WA.

  • Gropp,W., Lusk, E., & Skjellum, A. (1999). Using MPI: Portable parallelProgramming with themessage-passing interface. Cambridge: MA: MIT Press.

    Google Scholar 

  • Henderson, J. (2000). A neural network parser that handles sparse data. In Proceedings of 6th International Workshop on Parsing Technologies (pp. 123–134). Trento, Italy.

  • Henderson, J. (2003). Inducing history representations for broad coveragestatistical parsing. In Proceedings of the North American Chapter of Association Computational Linguistics and Human Language Technology Conference HLT-NAACL.

  • Hinton, G. E. (1986). Learning distributed representations of concepts. In R. G. M. Morris (Ed.), Parallel distributed processing:Implications for psychology and Neurobiology (pp. 46–61). Oxford, UK: Oxford University Press.

    Google Scholar 

  • Ho, E. & Chan, L. (1999). How to design a connectionist holistic parser. Neural Computation, 11:8, 1995–2016.

    Article  PubMed  Google Scholar 

  • Jelinek, F. (1998). Statistical methods for speech recognition. Cambridge, MA and London: MIT Press.

    Google Scholar 

  • Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov sourceparameters from sparse data. In Proceedings of Workshop on Pattern Recognition in Practice (pp. 381–397). Amsterdam, The Netherlands: North Holland Publishing Co.

  • Kim, W., Khudanpur, S., & Wu, J. (2001). Smoothing issues in the structuredlanguage model. In Proceedings of the 7th European Conference on Speech Communication and Technology (pp. 717–720). Alborg, Denmark.

  • Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram languagemodeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 181&184), Vol. I.

  • Lawrence, S., Giles, C. L., & Fong, S. (1996). Can recurrent neural networkslearn natural language grammars?. In Proceedings of the IEEE International Conference on Neural Networks (pp. 1853&1858). Piscataway, NJ: IEEE Press.

  • Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basiclinear algebra subprograms for fortran usage. ACM Transactions on Mathematical Software, 5:3, 308–323.

    Article  Google Scholar 

  • LeCun, Y. (1985). A learning scheme for asymmetric threshold networks. In Proceedings of Cognitiva 85 (pp. 599–604). Paris, France.

  • Miikkulainen, R. & Dyer, M. G. (1991). Natural language processing withmodular neural networks and distributed lexicon. Cognitive Science, 15, 343–399.

    Article  Google Scholar 

  • Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilisticdependencies in stochastic language modeling.. Computer Speech and Language, 8, 1–38.

    Article  Google Scholar 

  • Paul, D. B., & Baker, J. M. (1992). The design for the wall street journal-based CSR corpus. In Proceedings of the DARPA SLS Workshop.

  • Ratnaparkhi, A. (1997). A linear observed time statistical parser based onmaximum entropy models. In Second Conference on Empirical Methods in Natural Language Processing (pp. 1–10). Providence, RI.

  • Roark, B. (2001). Robust probabilistic predictive syntactic processing: Motivations, models and applications. Ph.D. thesis, Brown University, Providence, RI.

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Leaning internalrepresentations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Paralleldistributed processing, I. Cambridge, MA: MIT Press.

    Google Scholar 

  • Schwenk, H., & Gauvain, J.-L. (2002). Connectionist language modeling for largevocabulary continuous speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (pp. 765–768). Vol. II. Orlando, FL.

  • Van Uystel, D. H., Van Compernolle, D., & Wambacq, P. (2001). Maximum-likelihood training of the PLCG-based language model. In Proceedings of the Automatic Speech Recognition andUnderstanding Workshop. Madonna di Campiglio, Trento-Italy.

  • Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysisin the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA.

  • Xu, P., Chelba, C., & Jelinek, F. (2002). A study on richer syntacticdependencies for structured language modeling. In Proceedings of the 40th Annual Meeting of the Associationfor Computational Linguistics. Philadelphia, PA.

  • Xu, P., Emami, A., & Jelinek, F. (2003). Training connectionist models for thestructured language model. In M. Collins, & M. Steedman (Eds.), Proceedings of the 2003conference on empirical methods in natural language processing. Sapporo, Japan: (pp. 160–167). Association for Computational Linguistics.

  • Xu, W., & Rudnicky, A. (2000). Can artificial neural networks learn languagemodels? In Proceedings of 6th International Conference on Spoken Language Processing. Beijing, China.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ahmad Emami.

Additional information

This work was supported by the National Science Foundation under grant No. IIS-0085940.


Dan Roth and Pascale Fung

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Emami, A., Jelinek, F. A Neural Syntactic Language Model. Mach Learn 60, 195–227 (2005).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • statistical language models
  • neural networks
  • speech recognition
  • parsing