Machine Learning

, Volume 25, Issue 2–3, pp 117–149 | Cite as

The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length

  • Dana Ron
  • Yoram Singer
  • Naftali Tishby


We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second application we construct a simple stochastic model for E. coli DNA.

Learning distributions probabilistic automata Markov models suffix trees text correction 


  1. Abe, N. & Warmuth, M. (1992). On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9:205–260.Google Scholar
  2. Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov chains. Inequalities, 3:1–8.Google Scholar
  3. Baum, L.E., Petrie, T., Soules, G. & Weiss, N. (1970). A maximization technique occuring in the statistical analysis of probabilistic functions of markov chains. Annals of Mathematical Statistics, 41(1):164–171.Google Scholar
  4. Bellman, R. (1957). Dynamic Programming. Princeton University Press, 1957.Google Scholar
  5. Blumer, A. (1990). Applications of DAWGs to data compression. In A. Capocelli, editor, Sequences: Combinatorics, compression, security, and transmition, pages 303–311. Springer-Verlag.Google Scholar
  6. Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley.Google Scholar
  7. Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., B39:1–38.Google Scholar
  8. Fill, J.A. (1991). Eigenvalue bounds on convergence to stationary for nonreversible Markov chains, with an application to exclusion process. Annals of Applied Probability, 1:62–87.Google Scholar
  9. Freund, Y., Kearns, M., Ron, D., Rubinfeld, R., Schapire, R.E. & Sellie, L. (1993). Efficient learning of typical finite automata from random walks. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing, pages 315–324.Google Scholar
  10. Gillman, D. & Sipser, M. (1994). inference and minimization of hidden markov chains. In Proceedings of the Seventh Annual Workshop on Computational Learning Theory, pages 147–158.Google Scholar
  11. Good, G.I. (1969). Statistics of language: Introduction. In A. R. Meetham and R.A. Hudson, editors, Encyclopedia of Linguistics, Information and Control, pages 567–581. Pergamon Press, Oxford, England.Google Scholar
  12. H¨offgen, K.-U. (1993). Learning and robust learning of product distributions. In Proceedings of the Sixth Annual Workshop on Computational Learning Theory, pages 97–106.Google Scholar
  13. Jelinek, F. (1969). A fast sequential decoding algorithm using a stack. IBM J. Res. Develop., 13:675–685.Google Scholar
  14. Jelinek, F. (1985). Self-organized language modeling for speech recognition. Technical report, IBM T.J. Watson Research Center.Google Scholar
  15. Kearns, M., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R.E. & Sellie, L. (1994). On the learnability of discrete distributions. In The 25th Annual ACM Symposium on Theory of Computing.Google Scholar
  16. Krishnan, P. & Vitter, J. S. (1993). Optimal prediction for prefetching in the worst case. Technical Report CS-1993–26, Duke University.Google Scholar
  17. Krogh, A., Mian, S.I. & Haussler, D. (1993). A hidden markov model that finds genes in E. coli DNA. Technical Report UCSC-CRL–93–16, University of California at Santa-Cruz.Google Scholar
  18. Kushilevitz, E. & Mansour, Y. (1993). Learning decision trees using the Fourier spectrum. SIAM Journal on Computing, 22(6):1331–1348.Google Scholar
  19. Laird, P. & Saul, R. (1994). Discrete sequence prediction and its applications. Machine Learning, 15:43–68.Google Scholar
  20. Mihail, M. (1989). Conductance and convergence of Markov chains-A combinatorial treatment of expanders. In Proceedings 30th Annual Conference on Foundations of Computer Science.Google Scholar
  21. Nadas, A. (1984). Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Trans. on ASSP, 32(4):859–861.Google Scholar
  22. Rabiner, L.R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE.Google Scholar
  23. Rissanen, J. (1983). A universal data compression system. IEEE Trans. Inform. Theory, 29(5):656–664.Google Scholar
  24. Rissanen, J. (1986). Complexity of strings in the class of Markov sources. IEEE Trans. Inform. Theory, 32(4):526–532.Google Scholar
  25. Ron, D., Singer, Y. & Tishby, N. (1993). The power of amnesia. In Advances in Neural Information Processing Systems, volume 6. Morgan Kaufmann.Google Scholar
  26. Ron, D., Singer, Y. & Tishby, N. (1995). On the learnability and usage of acyclic probabilistic finite automata. In Proc. of the 8th Annual Conf. on Computational Learning Theory.Google Scholar
  27. Rudd, K.E. (1993). Maps, genes, sequences, and computers: An Escherichia coli case study. ASM News, 59:335–341.Google Scholar
  28. Schütze, H. & Singer, Y. (1994). Part-of-Speech tagging using a variable memory Markov model. In Proceedings of ACL 32'nd.Google Scholar
  29. Shannon, C.E. (1951). Prediction and entropy of printed english. Bell Sys. Tech. Jour., 30(1):50–64.Google Scholar
  30. Singer, Y. & Tishby, N. (1995). An adaptive cursive handwriting recognition system. Technical Report CS-TR-22, Hebrew University.Google Scholar
  31. Vitter, J.S. & Krishnan, P. (1991). Optimal prefetching via data compression. In Proceedings of the Thirty-Second Annual Symposium on Foundations of Computer Science, pages 121–130.Google Scholar
  32. Weinberger, M.J., Lempel, A. & Ziv, J. (1982). A sequential algorithm for the universal coding of finite-memory sources. IEEE Trans. Inform. Theory, 38:1002–1014.Google Scholar
  33. Willems, F.M.J., Shtarkov, Y.M. & Tjalkens, T.J. (1993). The context tree weighting method: Basic properties. IEEE Trans. Inform. Theory. Submitted for publication.Google Scholar
  34. Ziv, J. & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, 24:530–536.Google Scholar

Copyright information

© Kluwer Academic Publishers 1996

Authors and Affiliations

  • Dana Ron
    • 1
  • Yoram Singer
    • 2
  • Naftali Tishby
    • 3
  1. 1.Laboratory for Computer Science, MIT, CambridgeMA
  2. 2.AT&T LabsNJ
  3. 3.Institute of Computer ScienceHebrew UniversityJerusalemIsrael; E-mail

Personalised recommendations