A learning criterion for stochastic rules Article DOI :
10.1007/BF00992676

Cite this article as: Yamanishi, K. Mach Learn (1992) 9: 165. doi:10.1007/BF00992676
Abstract This paper proposes a learning criterion for stochastic rules. This criterion is developed by extending Valiant's PAC (Probably Approximately Correct) learning model, which is a learning criterion for deterministic rules. Stochastic rules here refer to those which probabilistically asign a number of classes, {Y }, to each attribute vector X. The proposed criterion is based on the idea that learning stochastic rules may be regarded as probably approximately correct identification of conditional probability distributions over classes for given input attribute vectors. An algorithm (an MDL algorithm) based on the MDL (Minimum Description Length) principle is used for learning stochastic rules. Specifically, for stochastic rules with finite partitioning (each of which is specified by a finite number of disjoint cells of the domain and a probability parameter vector associated with them), this paper derives target-dependent upper bounds and worst-case upper bounds on the sample size required by the MDL algorithm to learn stochastic rules with given accuracy and confidence. Based on these sample size bounds, this paper proves polynomial-sample-size learnability of stochastic decision lists (which are newly proposed in this paper as a stochastic analogue of Rivest's decision lists) with at mostk literals (k is fixed) in each decision, and polynomial-sample-size learnability of stochastic decision trees (a stochastic analogue of decision trees) with at mostk depth. Sufficient conditions for polynomial-sample-size learnability and polynomial-time learnability of any classes of stochastic rules with finite partitioning are also derived.

Keywords Learning from examples stochastic rules PAC model MDL principle stochastic decision lists stochastic decision trees sample complexity An extended abstract of this paper appeared in Proceedings of the 3rd Annual Workshop on Computational Learning Theory.

Download to read the full article text

References Abe, N. & Warmuth, M. (1990). On the computational complexity of approximating distributions by probabilistic automata.

Proceedings of the Third Workshop on Computational Learning Theory (pp. 52–66), Rochester, NY: Morgan Kaufmann.

Google Scholar Akaike, H. (1974). A new look at the statistical model identification.

IEEE Trans. Autom. Contr., AC-19 , 716–723.

Google Scholar Angluin, D. & Laird, P. (1988). Learning from noisy examles.Machine Learning , 343–370.

Barron, A.R. (1985).Logically smooth density estimation . Ph.D. dissertation, Dept. of Electrical Eng., Stanford Univ.

Barron, A.R. & Cover, T.M. (1991). Minimum complexity density estimation.

IEEE Trans. on IT, IT-37 , 1034–1054.

Google Scholar Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1987). Occam's razor.

Information Processing Letters, 24 , 377–380.

Google Scholar Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M.K. (1989). Learnability and Vapnik-Chervonenkis dimension.

Journal of ACM, 36 , 929–965.

Google Scholar Cesa-Bianchi, N. (1990). Learning the distribution in the extended PAC model.

Proceedings of the First International Workshop on Algorithmic Learning Theory (pp. 236–246). Tokyo, Japan: Japanese Society for Artificial Intelligence.

Google Scholar Ehrenfeucht, A., Haussler, D., Kearns, M., & Valiant, L. (1989). A general lower bound on the number of examples needed for learning.

Information and Computation, 82 , 247–251.

Google Scholar Fisher, R.A. (1956).Statistical Methods and Scientific Inference . Olyver and Boyd.

Gallager, R.G. (1986).

Information theory and reliable communication . New York: Wiley, 1986.

Google Scholar Haussler, D. (1989).Generalizing the PAC model for neural net and other learning applications . Technical Report UCSC CRL-89-30, Univ. of California at Santa Cruz.

Haussler, D. (1990). Decision theoretic generalizations of the PAC learning model.

Proceedings of the First International Workshop on Algorithmic Learning Theory (pp. 21–41), Tokyo, Japan: Japanese Society for Artificial Intelligence.

Google Scholar Haussler, D. & Long, P. (1990).A generalization of Sauer's lemma . Technical Report UCSC CRL-90-15, Univ. of California at Santa Cruz.

Kearns, M. & Li, M. (1988). Learning in the presence of malicious errors.Proceedings of the 20th Annual ACM Symposium on Theory of Computing (pp. 267–279), Chicago, IL.

Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts.Proceedings of the 31st Symposium on Foundations of Computer Science (pp. 382–391), St. Louis, Missouri.

Kraft, C. (1949).

A device for quantizing, grouping, and coding amplitude modulated pulses . M.S. Thesis, Department of Electrical Engineering, MIT, Cambridge, MA.

Google Scholar Kraft, C. (1955). Some conditions for consistency and uniform consistency of statistical procedures.

University of California Publications in Statistics, 2 , 125–141.

Google Scholar Kullback, S. (1967). A lower bound for discrimination in terms of variation.

IEEE Trans. on IT, IT-13 , 126–127.

Google Scholar Laird, P.D. (1988). Efficient unsupervised learning.

Proceedings of the First Annual Workshop on Computational Learning Theory (pp. 91–96), Cambridge, MA: Morgan Kaufmann.

Google Scholar Pednault, E.P.D. (1989). Some experiments in applying inductive inference principles to surface reconstruction.Proceedings of the 11th International Joint Conference on Artificial Intelligence (pp. 1603–1609), Morgan Kaufmann.

Pitman, E.J.G. (1979).

Some Basic Theory for Statistical Inference . London: Chapman and Hall.

Google Scholar Pitt, L. & Valiant, L.G. (1988). Computational limitation on learning from examples.

Journal of ACM, 35 , 965–984.

Google Scholar Quinlan, J.R. & Rivest, R.L. (1989). Inferring decision trees using the minimum description length criterion.

Information and Computation, 80 , 227–248.

Google Scholar Rissanen, J. (1978). Modeling by shortest data description.

Automatica, 14 , 465–471.

Google Scholar Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length.

Annals of Statistics, 11 , 416–431.

Google Scholar Rissanen, J. (1984). Universal coding, information, prediction, and estimation.

IEEE Trans. on IT, IT-30 , 629–636.

Google Scholar Rissanen, J. (1986). Stochastic complexity and modeling.

Annals of Statistics, 14 , 1080–1100.

Google Scholar Rissanen, J. (1989).Stochastic complexity in statistical inquiry , World Scientific, Series in Computer Science, 15.

Rivest, R.L. (1987). Learning decision lists.

Machine Learning, 2 , 229–246.

Google Scholar Schreiber, F. (1985). The Bayes Laplace statistic of the multinomial distributions.

AEU, 39 , 293–298.

Google Scholar Segen, J. (1989).From features to symbols: Learning relational shape . In J.C. Simon, (Ed.),Pixels to features . Elsevier Science Publishers B.V.

Sloan, R. (1988). Types of noise in data for concept learning. In

Proceedings of the First Annual Workshop on Computational Learning Theory (pp. 91–96), Cambridge, MA, CA: Morgan Kaufmann.

Google Scholar Solomonoff, R.J. (1964). A formal theory of inductive inference.

Part 1. Information and Control, 7 , 1–22.

Google Scholar Valiant, L.G. (1984). A theory of the learnable.

Communications of the ACM, 27 , 1134–1142.

Google Scholar Valiant, L.G. (1985). Learning disjunctions of conjunctions.

Proceedings of the Ninth International Joint Conference on Artificial Intelligence (pp. 560–566), Los Angeles, CA: Morgan Kaufmann.

Google Scholar Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities.

Theory of Probability and its Applications, XVI (

2 , 264–280.

Google Scholar Wallace, C.S. & Boulton, D.M. (1968). An information measure for classification.Computer Journal , 185–194.

Yamanishi, K. (1989). Inductive inference and learning criterion of stochastic classification rules with hierarchical parameter structures.Proceedings of the 12th Symposium of Information Theory and Its Applications, 2 (pp. 707–712) (in Japanese), Inuyama, Japan.

Yamanishi, K. (1990a). Inferring optimal decision lists from stochastic data using the minimum description length criterion. Presented at1990 IEEE International Symposium on Information Theory , San Diego, CA.

Yamanishi, K. (1990b). A learning criterion for stochastic rules.

Proceedings of the Third Annual Workshop on Computational Learning Theory (pp. 67–81), Rochester, NY: Morgan Kaufmann.

Google Scholar © Kluwer Academic Publishers 1992

Authors and Affiliations 1. C&C Information Technology Research Labs. NEC Corporation Kawasaki, Kanagawa Japan