## Abstract

This paper proposes a learning criterion for stochastic rules. This criterion is developed by extending Valiant's PAC (Probably Approximately Correct) learning model, which is a learning criterion for deterministic rules. Stochastic rules here refer to those which probabilistically asign a number of classes, {*Y*}, to each attribute vector X. The proposed criterion is based on the idea that learning stochastic rules may be regarded as probably approximately correct identification of conditional probability distributions over classes for given input attribute vectors. An algorithm (an MDL algorithm) based on the MDL (Minimum Description Length) principle is used for learning stochastic rules. Specifically, for stochastic rules with finite partitioning (each of which is specified by a finite number of disjoint cells of the domain and a probability parameter vector associated with them), this paper derives target-dependent upper bounds and worst-case upper bounds on the sample size required by the MDL algorithm to learn stochastic rules with given accuracy and confidence. Based on these sample size bounds, this paper proves polynomial-sample-size learnability of stochastic decision lists (which are newly proposed in this paper as a stochastic analogue of Rivest's decision lists) with at most*k* literals (*k* is fixed) in each decision, and polynomial-sample-size learnability of stochastic decision trees (a stochastic analogue of decision trees) with at most*k* depth. Sufficient conditions for polynomial-sample-size learnability and polynomial-time learnability of any classes of stochastic rules with finite partitioning are also derived.

### Keywords

Learning from examples stochastic rules PAC model MDL principle stochastic decision lists stochastic decision trees sample complexity### References

- Abe, N. & Warmuth, M. (1990). On the computational complexity of approximating distributions by probabilistic automata.
*Proceedings of the Third Workshop on Computational Learning Theory*(pp. 52–66), Rochester, NY: Morgan Kaufmann.Google Scholar - Akaike, H. (1974). A new look at the statistical model identification.
*IEEE Trans. Autom. Contr., AC-19*, 716–723.Google Scholar - Angluin, D. & Laird, P. (1988). Learning from noisy examles.
*Machine Learning*, 343–370.Google Scholar - Barron, A.R. (1985).
*Logically smooth density estimation*. Ph.D. dissertation, Dept. of Electrical Eng., Stanford Univ.Google Scholar - Barron, A.R. & Cover, T.M. (1991). Minimum complexity density estimation.
*IEEE Trans. on IT, IT-37*, 1034–1054.Google Scholar - Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1987). Occam's razor.
*Information Processing Letters, 24*, 377–380.Google Scholar - Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1989). Learnability and Vapnik-Chervonenkis dimension.
*Journal of ACM, 36*, 929–965.Google Scholar - Cesa-Bianchi, N. (1990). Learning the distribution in the extended PAC model.
*Proceedings of the First International Workshop on Algorithmic Learning Theory*(pp. 236–246). Tokyo, Japan: Japanese Society for Artificial Intelligence.Google Scholar - Ehrenfeucht, A., Haussler, D., Kearns, M., & Valiant, L. (1989). A general lower bound on the number of examples needed for learning.
*Information and Computation, 82*, 247–251.Google Scholar - Fisher, R.A. (1956).
*Statistical Methods and Scientific Inference*. Olyver and Boyd.Google Scholar - Gallager, R.G. (1986).
*Information theory and reliable communication*. New York: Wiley, 1986.Google Scholar - Haussler, D. (1989).
*Generalizing the PAC model for neural net and other learning applications*. Technical Report UCSC CRL-89-30, Univ. of California at Santa Cruz.Google Scholar - Haussler, D. (1990). Decision theoretic generalizations of the PAC learning model.
*Proceedings of the First International Workshop on Algorithmic Learning Theory*(pp. 21–41), Tokyo, Japan: Japanese Society for Artificial Intelligence.Google Scholar - Haussler, D. & Long, P. (1990).
*A generalization of Sauer's lemma*. Technical Report UCSC CRL-90-15, Univ. of California at Santa Cruz.Google Scholar - Kearns, M. & Li, M. (1988). Learning in the presence of malicious errors.
*Proceedings of the 20th Annual ACM Symposium on Theory of Computing*(pp. 267–279), Chicago, IL.Google Scholar - Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts.
*Proceedings of the 31st Symposium on Foundations of Computer Science*(pp. 382–391), St. Louis, Missouri.Google Scholar - Kraft, C. (1949).
*A device for quantizing, grouping, and coding amplitude modulated pulses*. M.S. Thesis, Department of Electrical Engineering, MIT, Cambridge, MA.Google Scholar - Kraft, C. (1955). Some conditions for consistency and uniform consistency of statistical procedures.
*University of California Publications in Statistics, 2*, 125–141.Google Scholar - Kullback, S. (1967). A lower bound for discrimination in terms of variation.
*IEEE Trans. on IT, IT-13*, 126–127.Google Scholar - Laird, P.D. (1988). Efficient unsupervised learning.
*Proceedings of the First Annual Workshop on Computational Learning Theory*(pp. 91–96), Cambridge, MA: Morgan Kaufmann.Google Scholar - Pednault, E.P.D. (1989). Some experiments in applying inductive inference principles to surface reconstruction.
*Proceedings of the 11th International Joint Conference on Artificial Intelligence*(pp. 1603–1609), Morgan Kaufmann.Google Scholar - Pitman, E.J.G. (1979).
*Some Basic Theory for Statistical Inference*. London: Chapman and Hall.Google Scholar - Pitt, L. & Valiant, L.G. (1988). Computational limitation on learning from examples.
*Journal of ACM, 35*, 965–984.Google Scholar - Quinlan, J.R. & Rivest, R.L. (1989). Inferring decision trees using the minimum description length criterion.
*Information and Computation, 80*, 227–248.Google Scholar - Rissanen, J. (1978). Modeling by shortest data description.
*Automatica, 14*, 465–471.Google Scholar - Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length.
*Annals of Statistics, 11*, 416–431.Google Scholar - Rissanen, J. (1984). Universal coding, information, prediction, and estimation.
*IEEE Trans. on IT, IT-30*, 629–636.Google Scholar - Rissanen, J. (1986). Stochastic complexity and modeling.
*Annals of Statistics, 14*, 1080–1100.Google Scholar - Rissanen, J. (1989).
*Stochastic complexity in statistical inquiry*, World Scientific, Series in Computer Science, 15.Google Scholar - Rivest, R.L. (1987). Learning decision lists.
*Machine Learning, 2*, 229–246.Google Scholar - Schreiber, F. (1985). The Bayes Laplace statistic of the multinomial distributions.
*AEU, 39*, 293–298.Google Scholar - Segen, J. (1989).
*From features to symbols: Learning relational shape*. In J.C. Simon, (Ed.),*Pixels to features*. Elsevier Science Publishers B.V.Google Scholar - Sloan, R. (1988). Types of noise in data for concept learning. In
*Proceedings of the First Annual Workshop on Computational Learning Theory*(pp. 91–96), Cambridge, MA, CA: Morgan Kaufmann.Google Scholar - Solomonoff, R.J. (1964). A formal theory of inductive inference.
*Part 1. Information and Control, 7*, 1–22.Google Scholar - Valiant, L.G. (1984). A theory of the learnable.
*Communications of the ACM, 27*, 1134–1142.Google Scholar - Valiant, L.G. (1985). Learning disjunctions of conjunctions.
*Proceedings of the Ninth International Joint Conference on Artificial Intelligence*(pp. 560–566), Los Angeles, CA: Morgan Kaufmann.Google Scholar - Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities.
*Theory of Probability and its Applications, XVI*(*2*, 264–280.Google Scholar - Wallace, C.S. & Boulton, D.M. (1968). An information measure for classification.
*Computer Journal*, 185–194.Google Scholar - Yamanishi, K. (1989). Inductive inference and learning criterion of stochastic classification rules with hierarchical parameter structures.
*Proceedings of the 12th Symposium of Information Theory and Its Applications, 2*(pp. 707–712) (in Japanese), Inuyama, Japan.Google Scholar - Yamanishi, K. (1990a). Inferring optimal decision lists from stochastic data using the minimum description length criterion. Presented at
*1990 IEEE International Symposium on Information Theory*, San Diego, CA.Google Scholar - Yamanishi, K. (1990b). A learning criterion for stochastic rules.
*Proceedings of the Third Annual Workshop on Computational Learning Theory*(pp. 67–81), Rochester, NY: Morgan Kaufmann.Google Scholar