Machine Learning

, Volume 9, Issue 2–3, pp 165–203 | Cite as

A learning criterion for stochastic rules

  • Kenji Yamanishi

Abstract

This paper proposes a learning criterion for stochastic rules. This criterion is developed by extending Valiant's PAC (Probably Approximately Correct) learning model, which is a learning criterion for deterministic rules. Stochastic rules here refer to those which probabilistically asign a number of classes, {Y}, to each attribute vector X. The proposed criterion is based on the idea that learning stochastic rules may be regarded as probably approximately correct identification of conditional probability distributions over classes for given input attribute vectors. An algorithm (an MDL algorithm) based on the MDL (Minimum Description Length) principle is used for learning stochastic rules. Specifically, for stochastic rules with finite partitioning (each of which is specified by a finite number of disjoint cells of the domain and a probability parameter vector associated with them), this paper derives target-dependent upper bounds and worst-case upper bounds on the sample size required by the MDL algorithm to learn stochastic rules with given accuracy and confidence. Based on these sample size bounds, this paper proves polynomial-sample-size learnability of stochastic decision lists (which are newly proposed in this paper as a stochastic analogue of Rivest's decision lists) with at mostk literals (k is fixed) in each decision, and polynomial-sample-size learnability of stochastic decision trees (a stochastic analogue of decision trees) with at mostk depth. Sufficient conditions for polynomial-sample-size learnability and polynomial-time learnability of any classes of stochastic rules with finite partitioning are also derived.

Keywords

Learning from examples stochastic rules PAC model MDL principle stochastic decision lists stochastic decision trees sample complexity 

References

  1. Abe, N. & Warmuth, M. (1990). On the computational complexity of approximating distributions by probabilistic automata.Proceedings of the Third Workshop on Computational Learning Theory (pp. 52–66), Rochester, NY: Morgan Kaufmann.Google Scholar
  2. Akaike, H. (1974). A new look at the statistical model identification.IEEE Trans. Autom. Contr., AC-19, 716–723.Google Scholar
  3. Angluin, D. & Laird, P. (1988). Learning from noisy examles.Machine Learning, 343–370.Google Scholar
  4. Barron, A.R. (1985).Logically smooth density estimation. Ph.D. dissertation, Dept. of Electrical Eng., Stanford Univ.Google Scholar
  5. Barron, A.R. & Cover, T.M. (1991). Minimum complexity density estimation.IEEE Trans. on IT, IT-37, 1034–1054.Google Scholar
  6. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1987). Occam's razor.Information Processing Letters, 24, 377–380.Google Scholar
  7. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1989). Learnability and Vapnik-Chervonenkis dimension.Journal of ACM, 36, 929–965.Google Scholar
  8. Cesa-Bianchi, N. (1990). Learning the distribution in the extended PAC model.Proceedings of the First International Workshop on Algorithmic Learning Theory (pp. 236–246). Tokyo, Japan: Japanese Society for Artificial Intelligence.Google Scholar
  9. Ehrenfeucht, A., Haussler, D., Kearns, M., & Valiant, L. (1989). A general lower bound on the number of examples needed for learning.Information and Computation, 82, 247–251.Google Scholar
  10. Fisher, R.A. (1956).Statistical Methods and Scientific Inference. Olyver and Boyd.Google Scholar
  11. Gallager, R.G. (1986).Information theory and reliable communication. New York: Wiley, 1986.Google Scholar
  12. Haussler, D. (1989).Generalizing the PAC model for neural net and other learning applications. Technical Report UCSC CRL-89-30, Univ. of California at Santa Cruz.Google Scholar
  13. Haussler, D. (1990). Decision theoretic generalizations of the PAC learning model.Proceedings of the First International Workshop on Algorithmic Learning Theory (pp. 21–41), Tokyo, Japan: Japanese Society for Artificial Intelligence.Google Scholar
  14. Haussler, D. & Long, P. (1990).A generalization of Sauer's lemma. Technical Report UCSC CRL-90-15, Univ. of California at Santa Cruz.Google Scholar
  15. Kearns, M. & Li, M. (1988). Learning in the presence of malicious errors.Proceedings of the 20th Annual ACM Symposium on Theory of Computing (pp. 267–279), Chicago, IL.Google Scholar
  16. Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts.Proceedings of the 31st Symposium on Foundations of Computer Science (pp. 382–391), St. Louis, Missouri.Google Scholar
  17. Kraft, C. (1949).A device for quantizing, grouping, and coding amplitude modulated pulses. M.S. Thesis, Department of Electrical Engineering, MIT, Cambridge, MA.Google Scholar
  18. Kraft, C. (1955). Some conditions for consistency and uniform consistency of statistical procedures.University of California Publications in Statistics, 2, 125–141.Google Scholar
  19. Kullback, S. (1967). A lower bound for discrimination in terms of variation.IEEE Trans. on IT, IT-13, 126–127.Google Scholar
  20. Laird, P.D. (1988). Efficient unsupervised learning.Proceedings of the First Annual Workshop on Computational Learning Theory (pp. 91–96), Cambridge, MA: Morgan Kaufmann.Google Scholar
  21. Pednault, E.P.D. (1989). Some experiments in applying inductive inference principles to surface reconstruction.Proceedings of the 11th International Joint Conference on Artificial Intelligence (pp. 1603–1609), Morgan Kaufmann.Google Scholar
  22. Pitman, E.J.G. (1979).Some Basic Theory for Statistical Inference. London: Chapman and Hall.Google Scholar
  23. Pitt, L. & Valiant, L.G. (1988). Computational limitation on learning from examples.Journal of ACM, 35, 965–984.Google Scholar
  24. Quinlan, J.R. & Rivest, R.L. (1989). Inferring decision trees using the minimum description length criterion.Information and Computation, 80, 227–248.Google Scholar
  25. Rissanen, J. (1978). Modeling by shortest data description.Automatica, 14, 465–471.Google Scholar
  26. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length.Annals of Statistics, 11, 416–431.Google Scholar
  27. Rissanen, J. (1984). Universal coding, information, prediction, and estimation.IEEE Trans. on IT, IT-30, 629–636.Google Scholar
  28. Rissanen, J. (1986). Stochastic complexity and modeling.Annals of Statistics, 14, 1080–1100.Google Scholar
  29. Rissanen, J. (1989).Stochastic complexity in statistical inquiry, World Scientific, Series in Computer Science, 15.Google Scholar
  30. Rivest, R.L. (1987). Learning decision lists.Machine Learning, 2, 229–246.Google Scholar
  31. Schreiber, F. (1985). The Bayes Laplace statistic of the multinomial distributions.AEU, 39, 293–298.Google Scholar
  32. Segen, J. (1989).From features to symbols: Learning relational shape. In J.C. Simon, (Ed.),Pixels to features. Elsevier Science Publishers B.V.Google Scholar
  33. Sloan, R. (1988). Types of noise in data for concept learning. InProceedings of the First Annual Workshop on Computational Learning Theory (pp. 91–96), Cambridge, MA, CA: Morgan Kaufmann.Google Scholar
  34. Solomonoff, R.J. (1964). A formal theory of inductive inference.Part 1. Information and Control, 7, 1–22.Google Scholar
  35. Valiant, L.G. (1984). A theory of the learnable.Communications of the ACM, 27, 1134–1142.Google Scholar
  36. Valiant, L.G. (1985). Learning disjunctions of conjunctions.Proceedings of the Ninth International Joint Conference on Artificial Intelligence (pp. 560–566), Los Angeles, CA: Morgan Kaufmann.Google Scholar
  37. Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability and its Applications, XVI(2, 264–280.Google Scholar
  38. Wallace, C.S. & Boulton, D.M. (1968). An information measure for classification.Computer Journal, 185–194.Google Scholar
  39. Yamanishi, K. (1989). Inductive inference and learning criterion of stochastic classification rules with hierarchical parameter structures.Proceedings of the 12th Symposium of Information Theory and Its Applications, 2 (pp. 707–712) (in Japanese), Inuyama, Japan.Google Scholar
  40. Yamanishi, K. (1990a). Inferring optimal decision lists from stochastic data using the minimum description length criterion. Presented at1990 IEEE International Symposium on Information Theory, San Diego, CA.Google Scholar
  41. Yamanishi, K. (1990b). A learning criterion for stochastic rules.Proceedings of the Third Annual Workshop on Computational Learning Theory (pp. 67–81), Rochester, NY: Morgan Kaufmann.Google Scholar

Copyright information

© Kluwer Academic Publishers 1992

Authors and Affiliations

  • Kenji Yamanishi
    • 1
  1. 1.C&C Information Technology Research Labs.NEC CorporationKawasaki, KanagawaJapan

Personalised recommendations