Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer networks. The paper also discusses some open research issues.


Hide Markov Model Loss Function Supervise Learning Feature Subset Hide Unit 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    G. Bakiri and T. G. Dietterich. Achieving high-accuracy text-to-speech with machine learning. In R. I. Damper, editor, Data Mining Techniques in Speech Synthesis. Chapman and Hall, New York, NY, 2002.Google Scholar
  2. 2.
    Y. Bengio and P. Frasconi. Input-output HMM’s for sequence processing. IEEE Transactions on Neural Networks, 7(5):1231–1249, September 1996.Google Scholar
  3. 3.
    L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth International Group, 1984.Google Scholar
  4. 4.
    C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:462–467, 1968.zbMATHCrossRefGoogle Scholar
  5. 5.
    N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.Google Scholar
  6. 6.
    A. P. Dempster, N. M. Laird, and D. B Rubin. Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., B39:1–38, 1977.MathSciNetGoogle Scholar
  7. 7.
    J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.CrossRefGoogle Scholar
  8. 8.
    T. Fawcett and F. Provost. Adaptive fraud detection. Knowledge Discovery and Data Mining, 1:291–316, 1997.CrossRefGoogle Scholar
  9. 9.
    C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 5(2), 1994.Google Scholar
  10. 10.
    A. E. Hoerl and R. W. Kennard. Ridge regression: biased estimation of non-orthogonal components. Technometrics, 12:55–67, 1970.zbMATHCrossRefGoogle Scholar
  11. 11.
    M. I. Jordan. Serial order: A parallel distributed processing approach. ICS Rep. 8604, Inst. for Cog. Sci., UC San Diego, 1986.Google Scholar
  12. 12.
    Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1–2):273–324, 1997.zbMATHCrossRefGoogle Scholar
  13. 13.
    Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proc. 13th Int. Conf. Machine Learning, pages 284–292. Morgan Kaufmann, 1996.Google Scholar
  14. 14.
    Igor Kononenko, Edvard Šimec, and Marko Robnik-Šikonja. Overcoming the myopic of inductive learning algorithms with RELIEFF. Applied Intelligence, 7(1): 39–55, 1997.CrossRefGoogle Scholar
  15. 15.
    John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Int. Conf. Machine Learning, San Francisco, CA, 2001. Morgan Kaufmann.Google Scholar
  16. 16.
    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.CrossRefGoogle Scholar
  17. 17.
    Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Adv. Neural Inf. Proc. Sys. 6, 59–66. Morgan Kaufmann, 1994.Google Scholar
  18. 18.
    Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. In Int. Conf. on Machine Learning, 591–598. Morgan Kaufmann, 2000.Google Scholar
  19. 19.
    Thomas M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.zbMATHGoogle Scholar
  20. 20.
    N. Qian and T. J. Sejnowski. Predicting the secondary structure of globular proteins using neural network models. J. Molecular Biology, 202:865–884, 1988.CrossRefGoogle Scholar
  21. 21.
    J. R. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.Google Scholar
  22. 22.
    D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing — Explorations in the Micro structure of Cognition, chapter 8, pages 318–362. MIT Press, 1986.Google Scholar
  23. 23.
    T. J. Sejnowski and C. R. Rosenberg. Parallel networks that learn to pronounce english text. Journal of Complex Systems, 1(1):145–168, February 1987.Google Scholar
  24. 24.
    A. S. Weigend, D. E. Rumelhart, and B. A. Huberman. Generalization by weight-elimination with application to forecasting. Adv. Neural Inf. Proc. Sys. 3, 875–882, Morgan Kaufmann, 1991.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Thomas G. Dietterich
    • 1
  1. 1.Oregon State UniversityCorvallisUSA

Personalised recommendations