Distances between Distributions: Comparing Language Models

  • Thierry Murgue
  • Colin de la Higuera
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3138)


Language models are used in a variety of fields in order to support other tasks: classification, next-symbol prediction, pattern analysis. In order to compare language models, or to measure the quality of an acquired model with respect to an empirical distribution, or to evaluate the progress of a learning process, we propose to use distances based on the L 2 norm, or quadratic distances. We prove that these distances can not only be estimated through sampling, but can be effectively computed when both distributions are represented by stochastic deterministic finite automata. We provide a set of experiments showing a fast convergence of the distance through sampling and a good scalability, enabling us to use this distance to decide if two distributions are equal when only samples are provided, or to classify texts.


Speech Recognition Language Model Automatic Speech Recognition Finite Automaton Deterministic Finite Automaton 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Amengual, J.C., Sanchis, A., Vidal, E., Benedí, J.M.: Language Simplification Through Error-Correcting and Grammatical Inference Techniques. Machine Learning Journal 44, 143–159 (2001)zbMATHCrossRefGoogle Scholar
  2. 2.
    Ney, H.: Stochastic Grammars and Pattern Recognition. In: Proc. of the NATO Advanced Study Institute, pp. 313–344. Springer, Heidelberg (1992)Google Scholar
  3. 3.
    Lucas, S., Vidal, E., Amari, A., Hanlon, S., Amengual, J.C.: A Comparison of Syntactic and Statistical Techniques for Off-Line OCR. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 168–179. Springer, Heidelberg (1994)Google Scholar
  4. 4.
    Lyngsø, R.B., Pedersen, C.N.S., Nielsen, H.: Metrics and Similarity Measures for Hidden Markov Models. In: Proc. of ISMB 1999, pp. 178–186 (1999)Google Scholar
  5. 5.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Google Scholar
  6. 6.
    Morgan, N., Bourlard, H.: Continuous Speech Recognition: an Introduction to the Hybrid HMM/connectionnist Approach. IEEE Signal Processing Magazine 12, 24–42 (1995)CrossRefGoogle Scholar
  7. 7.
    García, P., Segarra, E., Vidal, E., Galiano, I.: On the Use of the Morphic Generator Grammatical Inference (MGGI) Methodology in Automatic Speech Recognition. International Journal of Pattern Recognition and Artificial Intelligence 4, 667–685 (1994)CrossRefGoogle Scholar
  8. 8.
    Thollard, F., Dupont, P., de la Higuera, C.: Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality. In: Proc. of ICML 2000, pp. 975–982 (2000)Google Scholar
  9. 9.
    Bunke, H., Sanfeliu, A. (eds.): Syntactic and Structural Pattern Recognition, Theory and Applications, vol. 7. World Scientific, Singapore (1990)zbMATHGoogle Scholar
  10. 10.
    Fu, K.S., Booth, T.L.: Grammatical Inference: Introduction and Survey. Part I and II. IEEE Transactions on Syst. Man. and Cybern. 5, 59–72 & 409–423 (1975)zbMATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Miclet, L.: Grammatical Inference. In: Syntactic and Structural Pattern Recognition, Theory and Applications, pp. 237–290. World Scientific, Singapore (1990)Google Scholar
  12. 12.
    McAllester, D., Schapire, R.: Learning theory and language modeling. In: Exploring Artificial Intelligence in the New Millenium, Morgan Kaufmann, San Francisco (2002)Google Scholar
  13. 13.
    Fred, A.: Computation of Substring Probabilities in Stochastic Grammars. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp. 103–114. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  14. 14.
    Carrasco, R.C.: Accurate Computation of the Relative Entropy Between Stochastic Regular Grammars. RAIRO 31, 437–444 (1997)zbMATHMathSciNetGoogle Scholar
  15. 15.
    Carrasco, R.C., Rico-Juan, J.R.: A Similarity Between Probabilistic Tree Languages: Application to XML Document Families. Pattern Recognition 36, 2197–2199 (2002) (in press)CrossRefGoogle Scholar
  16. 16.
    Reber, A.S.: Implicit Learning of Artificial Grammars. Journal of verbal learning and verbal behaviour 6, 855–863 (1967)CrossRefGoogle Scholar
  17. 17.
    Murgue, T., de la Higuera, C.: Distances Between Distributions: Comparing Language Models. Technical Report RR-0104, EURISE, Saint-Etienne, France (2004)Google Scholar
  18. 18.
    Kearns, M., Valiant, L.: Cryptographic Limitations on Learning Boolean Formulae and Finite Automata. In: 21st ACM Symposium on Theory of Computing, pp. 433–444 (1989)Google Scholar
  19. 19.
    MADCOW: Multi-Site Data Collection for a Spoken Language Corpus. In: Proc. DARPA Speech and Natural Language Workshop 1992, pp. 7–14 (1992)Google Scholar
  20. 20.
    Carrasco, R., Oncina, J.: Learning Stochastic Regular Grammars by Means of a State Merging Method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 139–150. Springer, Heidelberg (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Thierry Murgue
    • 1
    • 2
  • Colin de la Higuera
    • 2
  1. 1.RIMEcole des Mines de Saint-EtienneSaint-Etienne cedex 2France
  2. 2.EURISEUniversity of Saint-EtienneSaint-Etienne cedex 2France

Personalised recommendations