# Stochastic Finite Learning

**DOI:**https://doi.org/10.1007/978-1-4899-7687-1_793

## Motivation and Background

Assume that we are given a concept class \(\mathcal{C}\) and should design a learner for it. Next, suppose we already know or could prove \(\mathcal{C}\) not to be learnable in the model of PAC learning. But it can be shown that \(\mathcal{C}\) is learnable within Gold’s (1967) model of inductive inference or learning in the limit. Thus, we can design a learner behaving as follows. When fed any of the data sequences allowed in this model, it converges in the limit to a hypothesis correctly describing the target concept. Nothing more is known. Let *M* be any fixed learner. If (*d*_{ n })_{n ≥ 0} is any data sequence, then the *stage of convergence* is the least integer *m* such that *M*(*d*_{ m }) = *M*(*d*_{ n }) for all *n* ≥ *m* provided such an *n* exists (and infinite, otherwise). In general, it is undecidable whether or not the learner has already reached the stage of convergence, but even if it is decidable for a particular concept class, it may be practically infeasible to do so. This *uncertainty* may not be tolerable in many applications.

When we tried to overcome this uncertainty, the idea of stochastic finite learning emerged. Clearly, in general nothing can be done, since in Gold’s (1967) model the learner has to learn from any data sequence. So for every concept that needs more than one datum to converge, one can easily construct a sequence where the first datum is repeated very often and where therefore the learner does not find the right hypothesis within the given bound. However, such data sequences seem unnatural. Therefore, we looked at data sequences that are generated with respect to some probability distribution taken from a prespecified class of probability distributions and computed the expected *total learning time*, i.e., the expected time until the learner reaches the stage of convergence (cf. Erlebach et al. 2001; Zeugmann 1998). Clearly, one is then also interested in knowing how often the expected total learning time is exceeded. In general, Markov’s inequality can be applied to obtain the relevant tail bounds. However, if the learner is known to be rearrangement-independent and conservative, then we always get *exponentially* shrinking tail bounds (cf. Rossmanith and Zeugmann 2001). A learner is said to be *rearrangement-independent* if its output depends exclusively on the range and length of its input (but not on the order) (cf., e.g., Lange and Zeugmann (1996) and the references therein). Furthermore, a learner is *conservative*, if it exclusively performs mind changes that can be justified by an inconsistency of the abandoned hypothesis with the data received so far (see Angluin (1980b) for a formal definition).

Combining these ideas results in stochastic finite learning. A stochastic finite learner is successively fed data about the target concept. Note that these data are generated randomly with respect to one of the probability distributions from the class of underlying probability distributions. Additionally, the learner takes a confidence parameter *δ* as input. But in contrast to learning in the limit, the learner itself decides how many examples it wants to read. Then it computes a hypothesis, outputs it, and stops. The hypothesis output is correct for the target with probability at least 1 −*δ*.

The description given above explains how it works, but not why it does. Intuitively, the stochastic finite learner simulates the limit learner until an upper bound for twice the expected total number of examples needed until convergence has been met. Assuming this to be true, by Markov’s inequality the limit learner has now converged with probability 1∕2. All what is left is to decrease the probability of failure. This can be done by using again Markov’s inequality, i.e., increasing the sample complexity by a factor of 1∕*δ* results in a confidence of 1 −*δ* for having reached the stage of convergence.

Note that the stochastic finite learner has to calculate an upper bound for the stage of convergence. This is precisely the point where we need the parameterization of the class \(\mathcal{D}\) of underlying probability distributions. Then a bit of *prior knowledge* must be provided in the form of suitable upper and/or lower bounds for the parameters involved. A more serious difficulty is to incorporate the unknown target concept into this estimate. This step depends on the concrete learning problem on hand and requires some extra effort.

It should also be noted that our approach may be beneficial even in case that the considered concept class is PAC learnable.

## Definition

Let \(\mathcal{D}\) be a set of probability distributions on the learning domain, let \(\mathcal{C}\) be a concept class, \(\mathcal{H}\) a hypothesis space for \(\mathcal{C}\), and let *δ* ∈ (0, 1). The pair \((\mathcal{C},\mathcal{D})\) is said to be *stochastically finitely learnable with δ-confidence* with respect to \(\mathcal{H}\) iff there is a learner *M* that for every \(c \in \mathcal{C}\) and every \(D \in \mathcal{D}\) performs as follows. Given any random data sequence *θ* for *c* generated according to *D*, *M* stops after having seen a finite number of examples and outputs a single hypothesis \(h \in \mathcal{H}\). With probability at least 1 −*δ* (with respect to distribution *D*), *h* has to be correct, i.e., *c* = *h*.

If stochastic finite learning can be achieved with *δ*-confidence for every *δ* > 0, then we say that \((\mathcal{C},\mathcal{D})\) can be learned stochastically finite *with high confidence*.

## Detail

Note that there are subtle differences between our model and PAC learning. By its definition, stochastic finite learning is not completely distribution independent. A bit of *additional knowledge* concerning the underlying probability distributions is required. Thus, from that perspective, stochastic finite learning is weaker than the PAC model. On the other hand, we do *not* measure the quality of the hypothesis with respect to the underlying probability distribution. Instead, we require the hypothesis computed to be exactly correct with high probability. Note that exact identification with high confidence has been considered within the PAC paradigm, too (cf., e.g., Goldman et al. 1993). Conversely, we also can easily relax the requirement to learn *probably exactly correct* but whenever possible we shall not do it.

Furthermore, in the uniform PAC model as introduced in Valiant (1984), the sample complexity depends exclusively on the VC dimension of the target concept class and the error and confidence parameters *ɛ* and *δ*, respectively. This model has been generalized by allowing the sample size to depend on the concept complexity, too (cf., e.g., Blumer et al. 1989; Haussler et al. 1991). Provided no upper bound for the concept complexity of the target concept is given, such PAC learners decide themselves how many examples they wish to read (cf. Haussler et al. 1991). This feature is also adopted to our setting of stochastic finite learning. However, all variants of efficient PAC learning we are aware of require that all hypotheses from the relevant hypothesis space are uniformly polynomially evaluable. Though this requirement may be necessary in some cases to achieve (efficient) stochastic finite learning, it is not necessary in general as we shall see below.

In the following, we provide two sample applications of stochastic finite learning. We always choose as hypothesis space the concept class \(\mathcal{C}\) itself.

## Learning Monomials

Let *X*_{ n } = { 0, 1}^{ n } be the learning domain, let \(\mathcal{L}_{n} =\{ x_{1},\bar{x}_{1},x_{2},\bar{x}_{2},\ldots,x_{n},\bar{x}_{n}\}\) (set of literals) and consider the class \(\mathcal{C}_{n}\) of all concepts describable by a conjunction of literals. As usual, we refer to any conjunction of literals as a *monomial*. A monomial *m* describes a concept *c* ⊆ *X*_{ n } in the obvious way: the concept contains exactly those binary vectors for which the monomial evaluates to 1. For a monomial *m*, let *#*(*m*) denote its length, i.e., the number of literals in it.

The basic ingredient to the stochastic finite learner is Haussler’s (1987) Wholist algorithm, and thus the main emphasis is on the resulting complexity. The Wholist algorithm can also be used to achieve PAC learning of the class \(\mathcal{C}_{n}\), and the resulting sample complexity is *O*(1∕*ɛ* ⋅ (*n* + ln(1∕*δ*))) for all *ɛ*, *δ* ∈ (0, 1]. Since the Wholist algorithm learns from positive examples only, it is meaningful to study the learnability of \(\mathcal{C}_{n}\) from positive examples only. So, the stage of convergence is *not* decidable.

Since the Wholist algorithm immediately converges for the empty concept, we exclude it from our considerations. That is, we consider concepts \(c \in \mathcal{C}_{n}\) described by a monomial \(m =\bigwedge _{ j=1}^{\#(m)}\ell_{i_{j}}\) such that *k* = *k*(*m*) = *n* − *#*(*m*) > 0. A literal not contained in *m* is said to be irrelevant. Bit *i* is said to be irrelevant for monomial *m* if neither *x*_{ i } nor \(\bar{x}_{i}\) appears in *m*. There are 2^{ k } positive examples for *c*. For the sake of presentation, we assume these examples to be *binomially distributed* with parameter *p*. So, in a random positive example, all entries corresponding to irrelevant bits are selected independently to one another. With some probability *p*, this will be a 1, and with probability 1 − *p*, this will be a 0. Only distributions where 0 < *p* < 1 are considered, since otherwise exact identification is impossible. Now, one can show that the expected number of examples needed by the Wholist algorithm until convergence is bounded by ⌈log_{ ψ }*k*(*m*)⌉ +*τ* + 2, where \(\psi :=\min \left \{ \frac{1} {1-p},\; \frac{1} {p}\right \}\) and \(\tau :=\max \left \{ \frac{p} {1-p},\, \frac{1-p} {p} \right \}\).

*CON*denote a random variable for the stage of convergence. Since the Wholist algorithm is rearrangement-independent and conservative, we can conclude (cf. Rossmanith and Zeugmann 2001)

Finally, in order to obtain a stochastic finite learner, we reasonably assume that *prior knowledge* is provided by parameters *p*_{low} and *p*_{up} such that *p*_{low} ≤ *p* ≤ *p*_{up} for the true parameter *p*. Binomial distributions fulfilling this requirement are called (*p*_{low}, *p*_{up})-*admissible distributions*. Let \(\mathcal{D}_{n}[p_{\mathrm{low}},p_{\mathrm{up}}]\) denote the set of such distributions on *X*_{ n }. Then one can show *Let 0 < p*_{low}*≤ p*_{up}*< 1 and*\(\psi :=\min \{ \frac{1} {1-p_{\mathrm{low}}},\; \frac{1} {p_{\mathrm{up}}} \}\). *Then*\((\mathcal{C}_{n},\mathcal{D}_{n}[p_{\mathrm{low}},p_{\mathrm{up}}])\)*is stochastically finitely learnable with high confidence from positive examples. To achieve δ-confidence no more than O*log_{2}*1∕δ ⋅* log_{ψ}*n, many examples are necessary.*

Therefore, we have achieved an exponential improvement on the number of examples needed for learning (compared to the PAC bound displayed above), and, in addition, our stochastic finite learner exactly identifies the target. Note that this result is due to Reischuk and Zeugmann; however, we refer the reader to Zeugmann (2006) for the relevant proofs.

The results obtained for learnability from positive examples only can be extended *mutatis mutandis* to the case when the learner is fed positive and negative examples (cf. Zeugmann (2006) for details).

## Learning Pattern Languages

The pattern languages have been introduced by Angluin (1980a) and can be informally defined as follows. Let \(\Sigma =\{ 0,1,\ldots \}\) be any finite alphabet containing at least two elements. Let *X* = { *x*_{0}, *x*_{1}, *…*} be a countably infinite set of variables such that \(\Sigma \cap X =\ \emptyset\). *Patterns* are nonempty strings over \(\Sigma \cup X\), e.g., 01, 0*x*_{0}111, 1*x*_{0}*x*_{0}0*x*_{1}*x*_{2}*x*_{0} are patterns. The length of a string \(s \in \Sigma ^{{\ast}}\) and of a pattern *π* is denoted by | *s* | and | *π* |, respectively. A pattern *π* is in *canonical form* provided that if *k* is the number of different variables in *π* then the variables occurring in *π* are precisely *x*_{0}, *…*, *x*_{k−1}. Moreover, for every *j* with 0 ≤ *j* < *k* − 1, the leftmost occurrence of *x*_{ j } in *π* is left to the leftmost occurrence of *x*_{j+1}. The examples given above are patterns in canonical form.

If *k* is the number of different variables in *π*, then we refer to *π* as to a *k*-*variable pattern*. For example, *x*0*xx* is a one-variable pattern, and *x*_{0}10*x*_{1}*x*_{0} is a two-variable pattern. If *π* is a pattern, then the language generated by *π* is the set of all strings that can be obtained from *π* by substituting a *nonnull* element \(s_{i} \in \Sigma ^{{\ast}}\) for each occurrence of the variable symbol *x*_{ i } in *π*, for all *i* ≥ 0. We use *L*(*π*) to denote the language generated by pattern *π*. So, 1011, 1001010 belong to *L*(*x*0*xx*) (by substituting 1 and 10 for *x*, respectively) and 010110 is an element of *L*(*x*_{0}10*x*_{1}*x*_{0}) (by substituting 0 for *x*_{0} and 11 for *x*_{1}). Note that even the class of all one-variable patterns has infinite VC dimension (cf. Mitchell et al. 1999).

Reischuk and Zeugmann (2000) designed a stochastic finite learner for the class of all one-variable pattern languages that runs in time *O*( | *π* | log(1∕*δ*)) for all meaningful distributions and learns from positive data only. That is, all data fed to the learner belong to the target pattern language. Furthermore, by meaningful distribution essentially the following is meant. The expected length of an example should be finite and the distribution should allow to learn the target pattern. This is then expressed by fixing some suitable parameters. It should be noted that the algorithm is highly practical, and a modification of it also works for the case that *empty* substitutions are allowed. Though this seems to be a minor modification, it is *not*. The learnability results for pattern languages resulting from a definition that also allows for empty substitutions considerably differ from the case, where only nonnull substitutions are admitted (cf. Reidenbach 2006, 2008).

For the class of all pattern languages, one can also provide a stochastic finite learner identifying the whole class from positive data. In order to arrive at a suitable class of distributions, essentially three requirements are made. The first one is the same as in the one-variable case, i.e., the expected length \(\mbox{ E}[\Lambda ]\) of a generated string should be finite. Second, the class of distributions is restricted to regular product distributions, i.e., for all variables the substitutions are identically distributed.

Third, two parameters *α* and *β* are introduced. The parameter *α* is the probability that a string of length 1 is substituted, and *β* is the conditional probability that two random strings that get substituted into *π* are identical under the condition that both have length 1. These two parameters ensure that the target pattern language is learnable at all. The stochastic finite learner is then using as *a priori knowledge* a lower bound *α*^{∗} for *α* and an upper bound *β*^{∗} for *β*. The basic ingredient to this stochastic finite learner is Lange and Wiehagen’s (1991) pattern language learning algorithm. Rossmanith and Zeugmann’s (2001) stochastic finite learner for the pattern languages runs in time \(O\left ((1/\alpha _{{\ast}}^{k})\mbox{ E}[\Lambda ]\log _{1/\beta _{{\ast}}}(k)\log _{2}(1/\delta )\right )\), where *k* is the number of different variables in the target pattern. So, with increasing *k* it becomes impractical.

Note that the two stochastic finite learners for the pattern languages can compute the expected stage of convergence, since the first string seen provides an upper bound for the length of the target pattern.

For further information, we refer the reader to Zeugmann (2006) and the references therein. More research is needed to explore the potential of stochastic finite learning. Such investigations should extend the learnable classes, should study the incorporation of noise, and should explore further possible classes of meaningful probability distributions.

## Cross-References

## Recommended Reading

- Angluin D (1980a) Finding patterns common to a set of strings. J Comput Syst Sci 21(1):46–62MathSciNetMATHCrossRefGoogle Scholar
- Angluin D (1980b) Inductive inference of formal languages from positive data. Inf Control 45(2):117–135MathSciNetMATHCrossRefGoogle Scholar
- Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnability and the Vapnik-Chervonenkis dimension. J ACM 36(4):929–965MathSciNetMATHCrossRefGoogle Scholar
- Erlebach T, Rossmanith P, Stadtherr H, Steger A, Zeugmann T (2001) Learning one-variable pattern languages very efficiently on average, in parallel, and by asking queries. Theor Comput Sci 261(1):119–156MathSciNetMATHCrossRefGoogle Scholar
- Gold EM (1967) Language identification in the limit. Inf Control 10(5):447–474MathSciNetMATHCrossRefGoogle Scholar
- Haussler D (1987) Bias, version spaces and Valiant’s learning framework. In: Langley P (ed) Proceedings of the fourth international workshop on machine learning. Morgan Kaufmann, San Mateo, pp 324–336CrossRefGoogle Scholar
- Haussler D, Kearns M, Littlestone N, Warmuth MK (1991) Equivalence of models for polynomial learnability. Inf Comput 95(2):129–161MathSciNetMATHCrossRefGoogle Scholar
- Lange S, Wiehagen R (1991) Polynomial-time inference of arbitrary pattern languages. New Gener Comput 8(4):361–370MATHCrossRefGoogle Scholar
- Lange S, Zeugmann T (1996) Set-driven and rearrangement-independent learning of recursive languages. Math Syst Theory 29(6):599–634MathSciNetMATHCrossRefGoogle Scholar
- Mitchell A, Scheffer T, Sharma A, Stephan F (1999) The VC-dimension of subclasses of pattern languages. In: Watanabe O, Yokomori T (eds) Proceedings of the 10th international conference on algorithmic learning theory, ALT ’99, Tokyo, Dec 1999. Lecture notes in artificial intelligence, vol 1720. Springer, pp 93–105Google Scholar
- Reidenbach D (2006) A non-learnable class of E-pattern languages. Theor Comput Sci 350(1):91–102MathSciNetMATHCrossRefGoogle Scholar
- Reidenbach D (2008) Discontinuities in pattern inference. Theor Comput Sci 397(1–3):166–193MathSciNetMATHCrossRefGoogle Scholar
- Reischuk R, Zeugmann T (2000) An average-case optimal one-variable pattern language learner. J Comput Syst Sci 60(2):302–335MathSciNetMATHCrossRefGoogle Scholar
- Rossmanith P, Zeugmann T (2001) Stochastic finite learning of the pattern languages. Mach Learn 44(1/2): 67–91MATHCrossRefGoogle Scholar
- Goldman SA, Kearns MJ, Schapire RE (1993) Exact identification of read-once formulas using fixed points of amplification functions. SIAM J Comput 22(4):705–726MathSciNetMATHCrossRefGoogle Scholar
- Valiant LG (1984) A theory of the learnable. Commun ACM 27(11):1134–1142MATHCrossRefGoogle Scholar
- Zeugmann T (1998) Lange and Wiehagen’s pattern language learning algorithm: an average-case analysis with respect to its total learning time. Ann Math Artif Intell 23:117–145MathSciNetMATHCrossRefGoogle Scholar
- Zeugmann T (2006) From learning in the limit to stochastic finite learning. Theor Comput Sci 364(1):77–97. Special issue for ALT 2003Google Scholar