Abstract
The theoretical hardness of machine teaching has usually been analyzed for a range of concept languages under several variants of the teaching dimension: the minimum number of examples that a teacher needs to figure out so that the learner identifies the concept. However, for languages where concepts have structure (and hence size), such as Turingcomplete languages, a low teaching dimension can be achieved at the cost of using very large examples, which are hard to process by the learner. In this paper we introduce the teaching size, a more intuitive way of assessing the theoretical feasibility of teaching concepts for structured languages. In the most general case of universal languages, we show that focusing on the total size of a witness set rather than its cardinality, we can teach all total functions that are computable within some fixed time bound. We complement the theoretical results with a range of experimental results on a simple Turingcomplete language, showing how teaching dimension and teaching size differ in practice. Quite remarkably, we found that witness sets are usually smaller than the programs they identify, which is an illuminating justification of why machine teaching from examples makes sense at all.
Introduction
Learning from examples when the concept class is rich and infinite is hard. Given a concept, the teacher is faced with the problem of finding a set of examples, called the witness set, that allows the learner to uniquely identify the concept. The teaching dimension of a concept is the minimum cardinality of a witness set for the concept. The teaching dimension of the whole class is the maximum teaching dimension of any concept in the class. When the class is rich and infinite this maximum will usually be unbounded (Freivalds et al. 1989; Shinohara and Miyano 1991; Freivalds et al. 1993; Goldman and Kearns 1995; Zhu et al. 2018). Even if we adopt a preferencebased order on concepts (Gao et al. 2016), the maximum is still unbounded (it is only bounded on expectation, with strong sampling priors) and incomputable (HernandezOrallo and Telle 2018).
In this paper we consider the teaching of universal languages with the goal of finding small teaching witnesses. The main insight comes when we realize that some concepts could be taught with very few examples, having low teaching dimension, but those examples could be enormously large. Not only does this deviate from any intuitive notion of ‘small teaching witness’, but it also ignores the prohibitive time needed for the teacher to find such a witness set and the time needed for the learner to identify the right concept, as this time depends on the size of the witness set. The teacher algorithm may not even terminate. As a result of all this, even when the teaching dimension is low, it can still be incomputable in general, and not very representative of how difficult teaching a concept is. In this paper we address this situation, and ask: can we define a new teaching metric that is more reasonably related to how easy it is to teach a concept, and use this to teach at least a substantial part of a universal language so that both teacher and learner become computable?
We will show a solution that answers this question in the positive, and describe experiments with an implementation for a specific universal language. Our main result is Theorem 2, showing that we can teach all total functions computable within some fixed time bound using the same learning prior. Several ingredients are needed, but all revolve around the size—the encoding length in bits—of the witness set, rather than its cardinality.
Firstly, we consider as learning prior a preference order parameterized by a time complexity function that gives preference to small programs that within the time limit have the desired behavior. Because we include the runtime in this order, this preference depends on the witness set. As a result, the teacher can find these small witness sets by using a search guided by a preference metric, derived—in three different ways—from program length and runtime, generating what we call the “teaching book”. This setting leads to a teaching procedure where both teacher and learner are computable. The crucial idea is that we enumerate finite witnesses in order of nondecreasing teaching size, as for each size we have a finite number of them. Note this is not possible for teaching dimension (already for teaching dimension 1 the set of witnesses is infinite). The teaching size, defined as the size of the shortest witness set needed to identify the concept, is still unbounded in general.
Secondly, in order to obtain a bounded teaching size on expectation, we define a sampling prior, which is simply based on the “teaching book”, in such a way that those concepts that appear early in the book have higher probability to be taught. In other words, the sampling prior gives strong preference to those concepts whose simplest compatible program can be taught with a witness set whose total size is small.
Not only does this make things computable and provides ways of having bounded teaching size on expectation but it is also more meaningful. Furthermore, it has a more profound connotation for the notion of teaching in general. For concept languages where examples are structured and can be of any size (e.g., strings), saying that the teaching dimension is 1 means that there is a successful witness set with one example that will identify the concept, but this example can still be huge in terms of size and may be very difficult to find. For instance, a concept that maps every input to the same very long output could still have a small teaching dimension, but the teaching size would be large. With the notion of teaching size, a concept that requires very large examples to be identified would not have a small teaching size, and would then be a concept that is hard to teach.
The transition from teaching dimension to teaching size follows a long tradition in the context of machine learning and machine teaching (from language identification to compression) that we will properly analyze in Sect. 7. Several key challenges in machine learning and machine teaching (e.g., compositionality and structure) could be reunderstood under the notion of teaching size. In light of this historical context and the state of the art, the main contributions of this paper are:

We prove theoretically and show experimentally that there are concepts with small teaching dimension (even 1) and arbitrarily large teaching size.

Under our new setting based on building a teaching book ordered by the size of the witness set, both learner and teacher are shown to be computable for universal languages for any fixed time bound, something proved impossible for other machine teaching settings.

To our knowledge, this paper is the first one where a teaching setting is analyzed experimentally for a universal language, showing that interesting programs (e.g., reverse, mirror, shifts, etc.) are teachable with short witness sets.

More fundamentally, we see experimentally that the size of the teaching set is usually smaller than the shortest expression of the concept to be taught, which is an illuminating justification of why machine teaching from examples makes sense at all.
The rest of the paper is organized as follows. In the next section we formally define the teaching framework using the notion of teaching size in contrast to teaching dimension. We instantiate the framework through three different preference functions: by length (related to a prior based on Kolmogorov complexity K), by length and runtime (related to a prior based on Levin complexity Kt), and finally by length and a time complexity function (related to a prior based on timebounded Kolmogorov complexity) where we are able to show the most positive results. In Sect. 5 we discuss further aspects of the sampling bias. In Sect. 6 we describe an experimental validation for the universal string manipulation language P3, comparing the use of teaching size (allowing an open number of elements in the witness) to one more related to teaching dimension (where the number of elements in the witness is limited). We put the paper in the context of related work in Sect. 7 and we close the paper with a discussion of the results and their implications.
The general framework
We consider the following setting for machine teaching. We have an infinite example (or instance) space X and an infinite concept class C consisting of concepts that are a subset of X (this subset forms the positive examples for the concept). The goal is that for any concept \(c \in C\) the teacher must find a small witness set \(w \subseteq X\) of positive examples from which the learner is able to uniquely identify the concept. In our most general setting the examples will be pairs of strings and the concepts will be the outcomes of programs of a Turingcomplete language L, also called “universal”, mapping strings to strings.
Every program written in such a language computes some partial function from inputs to outputs, which is undefined on exactly those inputs on which it does not halt. We call two programs equivalent if they compute the same partial function. We consider the Lconcepts defined by the language L to be the set of all partial functions computed by programs written in this language, thus establishing a bijection between Lconcepts and equivalence classes of programs. Note that in standard terminology a computable function is a total function computed by some program, which by definition must always halt.
The example space will consist of pairs \(\langle i,o\rangle \) over the binary alphabet, socalled input/output pairs, in addition to the pair \(\langle i,\perp \rangle \). Given a binary string i as input a program p will either go into an infinite loop, denoted by \(p(i)=\perp \), or compute a binary string o as output, which we denote by \(p(i)=o\). Thus, the program p computes a function \(f_p:\{0,1\}^* \rightarrow \{0,1\}^* \, \cup \perp \) and belongs to the equivalence class of programs compatible with the Lconcept c defined by the partial function computed by p. A witness w for the concept c should be a finite subset of its positive example set, that allows the learner to identify a program p compatible with c. Note that we consider only positive examples when identifying programs, always specifying the desired behavior for a given input. When programs can run unbounded, we will never use \(\perp \) as an output, because it would be useless for the learner (it cannot in general check that a program goes into an infinite loop). However, we will use pairs of the form \(\langle i,\perp \rangle \) in timebounded scenarios, where teacher and learner can use this as useful information for determining that a program does not stop in the time budget.
An example set \(S=\{\langle i_1,o_1\rangle ,\ldots ,\langle i_k,o_k\rangle \}\) is just a finite set of binary i/o pairs, used as witness. The size (number of symbols) of a string s (either input or output) is denoted by s. Since we will be focusing on the size of example sets we need to carefully consider how to encode it. To encode an example set we need to distinguish each string in an i/o pair so that, e.g., \(\{\langle 0,1\rangle ,\langle 00,11\rangle \}\) is not confused with, e.g., \(\{\langle 01,0\rangle ,\langle 01,1\rangle \}\). We can do this either by using delimiters between strings (as we have done in the previous sentence with the brackets, or with a special symbol 010011, or by using a prefix code that explicitly allows to check when the end of a string has been reached (e.g., 010001010110001111, with Elias coding (Elias 1975)). We choose some such encoding and let \(\delta \) be a function from example sets to natural numbers so that \(\delta (S)\) is the number of bits needed to encode the example set S under the chosen encoding (in the previous case \(\delta (\{\langle 0,1\rangle ,\langle 00,11\rangle \})=18)\).
We say that a concept c satisfies S, denoted by \(c \models S\), if \(c(i) = o\) for all the pairs \(\langle i,o\rangle \) in S. All concepts satisfy the empty set. Just as for concepts, we say that a program p satisfies the example set S if \(p \models S\), i.e. it has the i/obehavior specified by all i/o pairs in S. The equivalence class of programs compatible with concept c is \(Class_L(c)=\{p: \forall S, p \models S \iff c \models S \}\).
Teaching size
Given this, the teaching size (TS) of a concept c could in a first attempt be defined as follows:
i.e., as the size of the smallest (using the \(\delta \) encoding) example set S that allows a learner to uniquely identify c. This minimal set S is known as a witness set for the concept c and the intention is that a learner should be able to use it to infer a program in \(Class_L(c)\). Note that the TS depends on both a concept and a learner. Sometimes in the machine teaching literature, it is assumed that an optimal learner for a language is used, so the definition depends on the language and the learner is hence omitted. In this paper the definition of TS depends on a learner that is specified to follow a precise procedure \({\varPhi } \) mapping sets to concepts: \({\varPhi } (S)= c\).
However, as we are dealing with a universal language it is clear that witness sets of finite size will not do. This is the case because for any finite witness set there will be an infinite set of distinct concepts satisfying this witness, since on inputs not specified in the witness there is an infinity of different outputs that could be given (e.g., the witness set \(\{\langle 0,1\rangle ,\langle 00,11\rangle \}\) is consistent with a program that doubles the input but also with one that copies the last bit, and infinitely many other computable concepts). Thus we need a learning prior, for example a preferencebased total ordering of programs, that will allow the learner to prefer some programs, and thus also some concepts, over others.
Since the concepts and programs form infinite discrete sets a natural idea is to use orderings related to the resources or complexity required by the programs, with the ‘easy’ programs coming first. Consider a universal language L formed by a finite set of instructions where each instruction \(\upsilon _i\) is defined by a symbol of an alphabet \({\varUpsilon }\). We assume a lexicographic order for the symbols in \({\varUpsilon }\). A program p is defined as a sequence of instructions and its length is denoted by \(\ell (p)\). The number of programs of a given length l is \({\varUpsilon }^l\) and hence the number of programs of length \(\le l\) is \(\frac{{\varUpsilon }}{{\varUpsilon }1}({\varUpsilon }^n1)\).
Let \(\prec \) be a total order of programs, where programs are ordered by length, thus shorter programs with fewer instructions preceding longer programs, and lexicographically for programs having the same number of instructions. When we refer to an ordering of programs by length then the total order given by \(\prec \) is always the order we mean, and similarly ‘the smallest program’ satisfying some constraint is the earliest such program in this order.
We will in the next sections use the \(\prec \) ordering of programs to define three distinct notions of teaching size of concepts. We will also discuss, and prove results about, the effectiveness of teaching under these scenarios.
Teaching size according to program length preference: \(TS_\ell \)
We can define the first program for a concept according to \(\ell \) and \(\prec \) as follows:
where \({\mathrm {argmin}}_p^{\prec }\) returns the program p that minimises the expression but comes first in the order \(\prec \) in case of ties. Clearly, we can simply define the Kolmogorov complexity of a concept as \(K(c) = \ell ({\varPhi } _\ell (c))\), i.e., the length of the smallest program computing c.
The most obvious way to apply the total order \(\prec \) induced by \(\ell \) to teaching is to require that the learner, given a witness, can distinguish the desired program/concept from all programs/concepts earlier in the order, but not those later. This is done by defining the first program for an example set S according to \(\ell \) and \(\prec \) as follows:
and similarly \(K(S)= \ell ({\varPhi } _\ell (S))\), i.e., the length of the shortest program having the i/obehavior specified by S. Now we can give the definition of teaching size of a concept c based on \(\ell \):
i.e., the teaching size for a concept c using the order given by \(\ell \) and \(\prec \) is the size of the smallest witness set S, under \(\delta \) encoding, such that the first program in the order that satisfies S is in the equivalence class of c.^{Footnote 1}
If all programs were guaranteed to halt on all inputs then these definitions could be useful. Given S the learner would try programs in increasing sizelexicographic order, i.e. following \(\prec \), and the first one having the i/obehavior specified by S would be the desired \({\varPhi } (S)\), what the learner should find.
However, we are dealing with universal languages and in this case finding the shortest program, i.e., computing the Kolmogorov complexity, is undecidable, so both learner and teacher would in general also be undecidable. A common approach to get computability is to use an approximation of Kolmogorov complexity. Levin’s Kt has some interesting properties as the learner would combine program length and runtime. We explore this possibility in “Appendix 1” and we find that for any finite example set S the learner is computable and is able to identify a program with this behavior. However, there will be concepts that cannot be taught. We need another approach.
Teaching size according to timebounded program length preference: \(TS_\ell ^f\)
We will now use a time complexity function \(f:{\mathbb {N}} \rightarrow {\mathbb {N}}\) that will restrict how long the learner, and teacher, will run a program before aborting it. We now allow \(\langle i,\perp \rangle \in S\). We define \(\perp =1\) and for ease of encoding we partition the pairs into \(S=S_{\perp } \cup S_{out}\) with the former containing exclusively the pairs of the form \(\langle i, \perp \rangle \) and the latter containing the pairs for which the program must finish in the time budget. The teacher will be filling what we call an fTeaching Book, a list of entries in the form of program/witness pairs (p, w), with p the smallest program fcompatible with w, and w the smallest witness for which p is fcompatible. The teacher will fill the fTeaching Book by a nested loop, generating example sets in some order of nondecreasing size, breaking ties in a deterministic manner, and for each example set trying programs of increasing lengths until finding the first one with the correct behavior.
First, for computability of the teacher we must ensure that for every example set \(S=\{\langle i_1,o_1\rangle ,\ldots ,\langle i_k,o_k\rangle \}\) there will exist some program \(p_S\), a lookup table hardcoded for S, that within the allotted time has the behavior specified by S. Note we only consider example sets that do not contain any contradictory pair (two distinct behaviors on the same input). In most highlevel languages \(p_S\) can be implemented as a program that hardcodes a trie data structure for S, with inputs in S as nodes of a binary tree whose edges are labelled 0 or 1 such that all descendants of a node have as common prefix the binary string on the path from the root to that node. The empty output will be given on inputs not in S. On a binary input i this program searches for i in the trie and if found, say \(i=i_j\), then it outputs \(o_j\) unless \(o_j= \perp \) in which case it enters an infinite loop. The program \(p_S\) can implement the trie as a hardcoded conditional flow structure (tree of ifthenelses) or can be composed of a first part consisting of instructions that based on the input i traverses a trie that comes as a hardcoded data structure in a second part of the program, used as working memory with random access. Assuming some hardcoding of this kind, \(p_S\) will have runtime upper bounded by \(c \cdot \min \{i, i_{max}\} + o_{max}\) where c is a fixed constant and \(i_{max}, o_{max}\) are the lengths of the longest inputstring and outputstring in S. For some settings and some ‘lowlevel’ universal languages this timebound will not be met, but nevertheless there will be some hardcoded program for any example set S. In general, for a language L we let \(\lambda _L(n,S)\) be an upper bound on the runtime on any input i of \(i=n\) alphabet symbols of some fixed, and fast, hardcoded program for S. When L is clear from context we will drop the subscript.
Fixing a universal language L we say that a program p is fcompatible with example set S, denoted \(p \models _f S\), if for all pairs \(\langle i,o\rangle \in S\) the program p on input i will within \(\max \{f(i),\lambda _L(i,S)\}\) time steps have the specified behavior: if \(o \ne \perp \) then it halts and outputs o, and if \(o = \perp \) then it does not halt within that time limit. Note that for any L, f and S there is a program fcompatible with S, since the \(\lambda _L(i,S)\) time bound allows some hardcoded program for S. We use \(\lambda \) because we want all possible witness sets to identify a program in the teaching book. However, the hardcoded programs will be identified only by witness sets that cannot be compressed.
Clearly, using larger time complexity functions we will be able to correctly discriminate between more programs. For a time complexity function f, we define the first program under this new scheme as follows
and the timebounded Kolmogorov complexity \(K^f(S)= \ell ({\varPhi } _\ell ^f(S))\), i.e., the length of the shortest program fcompatible with S. We then use this to define the \(K^f\)teaching size of a concept c as
if such S exists and otherwise undefined; i.e. the teaching size based on \(\ell \) and a complexity function f for a concept c is the size of the smallest witness set S (using \(\delta \) encoding) such that the shortest program fcompatible with S is in the equivalence class of c. For some concepts, e.g., those requiring much more time than allowed by f, their teaching size will not be defined.^{Footnote 2}
Note that we may have \(p \not \models _f S\) but \(p\models S\) (or viceversa) when the timebound f does not allow p to run to completion on some input in S. When teaching concept c we want the learner to find a program p such that \(p \models S\) whenever \(c \models S\), so care must be taken. Nevertheless, we will see that we can define a learner and teacher and pick timebound functions so that concepts can be taught properly.
Given an example set S, the learner wants to find \({\varPhi } _\ell ^f(S)\), the smallest program fcompatible with S. This program has some fixed length, and therefore there is a finite number of programs preceding it in the \(\prec \) order. The learner is computable: try programs in increasing order, i.e. \(\prec \) order, and test if the program is fcompatible with S, with the first such program being the answer.
Let us now turn to the teacher. The teacher is computable, by filling what we call an fTeaching Book, a list of entries in the form of program/witness pairs (p, w), with p the smallest program fcompatible with w, and w the smallest witness for which p is fcompatible. A bruteforce algorithm to fill the fTeaching Book, initially empty, is as follows: try all example sets S with no contradictory pairs but allowing \(\langle i,\perp \rangle \) pairs, in order of nondecreasing \(\delta \) encoding size (breaking ties in a deterministic manner) and test if the program p found by the learner on this S (using the strategy described above for the learner) is already in the fTeaching Book, and if it is not then add the pair (p, S) to the fTeaching Book.
Observe that for some Lconcept c we cannot rule out the possibility that the fTeaching Book may contain two programs, with two different witnesses, with both programs equivalent to c. For this to happen, the earlier program must time out on one of the i/opairs of the witness identifying the later program, among other things. The teaching size of the concept under \(\ell \) and f is in any case welldefined, as the size of the smallest witness.
We show that, for any language L and any partial or total concept in \(C_L\), i.e. the class of concepts defined by L, there exists a time complexity function f so that using f this concept can be taught. Note we are not able to construct f in a computable way, we give only an existence proof.
Theorem 1
For any universal language L, and for any concept c in \(C_L\), there is a complexity function f so that the fTeaching Book will contain some \((p,w_c)\) with \(p \in Class_L(c)\) and \(TS_{\ell }^f(c)=\delta (w_c)\).
Proof
Given the language L and a concept c in \(C_L\) we take p to be the smallest program equivalent to c and define a time complexity function f and a witness w to show that the teacher will find a pair (p, w) for insertion in the fTeaching Book.
For partial concepts we allow the witness w to contain i/opairs consisting of \(\langle i,\perp \rangle \) denoting that the equivalent program must go into an infinite loop on input i. Let p be the smallest program equivalent to c. For any program \(p'\) smaller than p, since \(p'\) is not equivalent to c, there will exist a conflicting pair, either an input i for which \(p(i)=o \ne \perp \) but \(p'(i) \ne o\), or with \(p(i)=\perp \) and \(p'(i) \ne \perp \). Let the finite witness w contain for each \(p'\) smaller than p one such conflicting pair, in the first case use \(\langle i,o\rangle \) and in the second case use \(\langle i,\perp \rangle \). We thus have \(w=w_{\perp } \cup w_{out}\) consisting of q pairs \(w_{\perp }=\{\langle i_{1},\perp \rangle ,\ldots ,\langle i_{q},\perp \rangle \}\) where p goes into infinite loop and with q programs \(\{p_{1},\ldots ,p_{q}\}\) for which these are conflicting pairs, thus \(p_{j}(i_{j}) \ne \perp \) for \(1 \le j \le q\). We also have k pairs \(w_{out}=\{\langle i_{q+1},o_{q+1}\rangle ,\ldots ,\langle i_{q+k},o_{q+k}\rangle \}\) where the output of p is specified.
We now choose any time complexity function f that obeys that for each \(1 \le j \le q\) we have \(f(i_j) \ge \tau (p_j,i_j)\), and also for each \(q+1 \le j \le q+k\) we have \(f(i_{q+j}) \ge \tau (p,i_{q+j})\), where \(\tau (p,i)\) is the number of steps executed by program p on input i. This will ensure that, for each \(\langle i,\perp \rangle \in w_{\perp }\) being a conflicting pair for some \(p'\), we have f big enough to allow \(p'(i)\) to run to completion, and for \(\langle i,o\rangle \in w_{out}\) we have f big enough to allow p(i) to run to completion.
As the teacher starts filling the fTeaching Book going through witnesses of increasing size, consider what happens when the witness \(w=w_{\perp } \cup w_{out}\) we have just described is reached. For any program \(p'\) smaller than p we have a conflicting pair \(\langle i,o\rangle \in w\), if \(\langle i,o\rangle \in w_{out}\) then it will be discovered that \(p'\) does not output o within the time bound (either it outputs something else, or it does not halt within the time bound) and thus \(p'\) is discarded. If instead the conflicting pair is \(\langle i,\perp \rangle \in w_{\perp }\), then f allows time for \(p'(i)\) to run to completion on i and the program \(p'\) will be discarded. On the other hand, the program p, which is the smallest equivalent to c, will satisfy all the pairs in w, for a pair \(\langle i,o\rangle \in w\) it will within time f(i) output the correct thing on input i, while for a pair \(\langle i,\perp \rangle \in w\) it will within time f(i) not halt on input i.
Thus when the teacher tries the witness w the program p will be found, and if it is not already in the fTeaching Book it will be inserted. Since the teacher algorithm goes through witnesses in order of increasing size, there will be some witness \(w_c\) with smallest size that leads the learner to identify program p. The \((p,w_c)\) pair, with \(TS_{\ell }^f(c) = \delta (w_c)\), will be inserted in the fTeaching Book, but it is not necessarily equal to the (p, w) pair we have specified in the proof. \(\square \)
Since the concepts in \(C_L\) are all partial functions computed by programs written in this language, it includes also functions that are not total, i.e., which are undefined on exactly those inputs on which the program does not halt. Therefore, the existence result we have just shown goes beyond the set of computable functions, i.e., the total functions computed by some program, which by definition must always halt.
In case a concept c in \(C_L\) is total, thus defining a computable function, then any equivalent program will halt on every input and the time complexity function f in the above result will depend only on c. This allows us to strengthen the result to teach the whole class of concepts that are computable within the same time bound, and to avoid the nonconstructive nature of the previous result. On the other hand, the previous result held for any concept in \(C_L\) whereas the result we show now holds only for a subset of the concepts, that we now define. Fix any \(t:{\mathbb {N}} \rightarrow {\mathbb {N}}\) and define the class \(C_L^t\) of total concepts in \(C_L\) having implementations with runtime t(n):
Theorem 2
For any runtime function t and any universal language L, using time complexity function defined by \(f(n) = t(n)\) for all n, we can teach the concept class \(C_L^t\). For any \(c \in C_L^t\) the fTeaching Book will contain some (p, w) with \(p \in Class_L(c)\) and \(TS_{\ell }^f(c)=\delta (w)\).
Proof
Consider any concept \(c \in C_L^t\) and its equivalent program \(p_c\) whose runtime on any input of size n is upperbounded by t(n). We use an analogous proof as in Theorem 1, to find a witness \(w=\{\langle i_1,o_1\rangle ,\ldots ,\langle i_k,o_k\rangle \}\) for \(p_c\) (rather than for “the smallest program equivalent to c”) and note that (i) for a smaller program equivalent to \(p_c\) there is no conflicting pair, and (ii) we will not have conflicting pairs of type \(\langle i,\perp \rangle \), thus the constraints on the time complexity function f will be bounded only by the runtime of the program \(p_c\). We define f so that for each \(1 \le j \le k\) we have \(f(i_j) = t(i_j) \ge \tau (p_c,i_j)\), which allows \(p_c\) to run to completion on each of the inputs in w. Thus, for each concept in \(C_L^t\) the constraints on f(n) are never higher than t(n). The remainder of the proof proceeds in a manner analogous to the proof of Theorem 1. \(\square \)
Note there is no nonconstructivity here, we fix \(f=t\) and basically show that for any \(c \in C_L^t\) if we let the teacher run long enough then a program equivalent to c will be inserted in the fTeaching Book. Our setting is not using “coding tricks” (Balbach 2007; Balbach and Zeugmann 2009), such as assuming a coding between instances and concepts, where only the “index” would be necessary to identify the concept. Goldman and Mathias’s condition holds: whenever a learner identifies a concept c with an example set S, it must also identify c with any superset of S that is also consistent with c (Goldman and Mathias 1993). This is true for any \(c \in C_L^t\) since if the ttimebounded program \(p \in Class_L(c)\) was identified by witness w and \(w'\) is a superset of w consistent with c, then since the time bound holds for any input, \(w'\) will still identify the same program p. The fact that we remove duplicated programs from the teaching book does not affect this condition. For Theorem 1, however, only a softer version of the condition holds: for any superset of the witness containing extra i/opairs for which the program (or an equivalent one) can run inside the time complexity bound.
Sampling prior and expected teaching size
With the learning prior \(\ell \) and the time complexity function f seen in the previous section, the algorithm for filling the fTeaching Book will for any example set S indeed terminate, for the same reason that the learner algorithm terminates, i.e., because of the time limit, the fact that we only run on witnesses S with no contradictory pairs, and we ensure that we allow \(\lambda (n,S)\) time on an input of size n, so there is always some program fcompatible with all pairs in S.
Sometimes, especially when the teaching setting is used to understand the complexity of machine learning, the concepts to be taught (or its distribution) are determined by the task. However, in an actual teaching task (or in an evaluation setting), it is usually the teacher (or the evaluator) who can decide what order or distribution (especially in an exam) she or he may choose. If a teacher has the task of choosing a convenient sample, there are some constraints influencing this choice. First, both teacher and learner would not like to have a very large teaching or evaluating session. Consequently, giving preference to those concepts with short witness sets makes sense. Second, by choosing a distribution that decays with the order in the book, we ensure that the teacher does not need to prepare a huge book and only enlarge it occasionally. Of course, there are also other cases, such as the teacher following a teaching book sequentially.
We can generalise all these cases, by considering that the teacher has a sampling prior v over all concepts, such that \(\sum _{c \in C} v(c) = 1\). With this, we can define the expected teaching size of the whole class C as follows:
Note that those concepts that are not found in the book will have probability 0. Taking this sampling prior into account the teacher can build a finite fTeaching Book by setting a limit on the maximum witness size at which fTeaching Book stops. The teacher can assign decreasing probabilities to the n elements in this finite book, according to their index j in the book, e.g., \(v_j = 2^{j}\), and leave part of the mass of the distribution unassigned for the part of the book that is not calculated (\(v_{rest} = 2^{n+1}\)). Whenever the teacher samples from this distribution, if the chosen index \(j < n\) then the jth entry (p, w) in the teaching book is chosen and w sent to the learner. Otherwise, the teacher will have to enlarge the book up to the necessary index to generate the example.
Note that in the resulting fTeaching Book the majority of concepts will usually have teaching size close to the sizelimit, as there are more large witnesses. Nevertheless, a sufficient condition to get any desired expected finite teaching size derives directly from the equation above. If we take the teaching book, we have the size of the witness set in each entry and we only have to choose a probability distribution such that this sum is lower than 1, with the remainder just expressing the probability that we have to extend the book.
We can also play with various definitions of “interesting” concepts and give higher prior to these, instead of the order that is given by the book. In other words, the teaching book (up to a given size) is an introductory book up to teaching complexity, from which the teacher can assign the desired teaching prior and sample from it (with or without replacement).
Experimental validation and insights
We describe an experimental validation of our method for teaching a universal language, and we will derive several insights from the experiments. We work with the universal language P3, a simple language for string manipulation, that we describe below. As learning prior we take the preference order \(\prec \) built on program size and complexity function f. We will thus be computing the teaching size \(TS_{\ell }^f\) of P3concepts, i.e. concepts computable by P3 programs. We implement the teacher algorithm described in Sect. 4.
For the sake of simplicity we use a naive encoding \(\delta \) of example sets based on three delimiters, inserted before every alphabet symbol, to show either that a new i/o string starts or the string continues. This inefficient encoding uses a number of bits that is \(\lceil (2 \log _2 5) n \rceil \) where n is the number of alphabet symbols in the example set. Since this is just a multiplicative factor, in what follows, we will use the number of alphabet symbols in the witness for size. For instance, the example set \(\{\langle 0100, 00\rangle , \langle 001, \rangle , \langle 00, 00\rangle \}\) has \(n=13\).
The teacher considers example sets by nondecreasing \(\delta \) size breaking ties in a deterministic way, and for each example set considers programs in \(\prec \) order. For each input/output pair (i, o) of the example set S we run p(i). As soon as we discover that p gives the wrong output (note we do not consider \(o = \perp \)) or meets the time limit we proceed to the next program. Otherwise the first compatible program has been found and the (p, S) pair is inserted into the fTeaching Book unless p has already been matched with a smaller witness. When all example sets up to a fixed size bound have been tried we stop, thus having computed the initial part of the infinite teaching book consisting of concepts with smallest teaching size. This experiment, denoted by \(D_{\ge 1}\), thus computes the teaching size \(TS_{\ell }^f\) as described above.
For comparison, we also ran an experiment, denoted by \(D_{=1}\), in which the teacher considers only example sets consisting of a single i/o pair, generating these by increasing size, thus filling the \(D_{=1}\)Teaching Book with programs having teaching dimension 1 under the preferencebased order given by \(\prec \) and under the time constraint given by f. In both cases we allowed the same computing time for filling the teaching book, but with varying f, 2000 and 200 respectively, as the lookuptable P3 program for a single example is much faster than the one for several examples. For a fixed number of alphabet symbols there are obviously fewer example sets if we restrict to a single example, thus we filled the teaching books up to two different limits (7 and 9 alphabet symbols). Table 1 shows that \(D_{\ge 1}\) nevertheless went through more example sets and ended up with more programs in the book.
We start by describing the language P3, then more details of our implementation for teaching size and for the alternative experiment and a comparison of the results.
The P3 language
P” is a primitive programming language introduced by Corrado Böhm (1964), and was the first GOTOless imperative language proved Turingcomplete, i.e., universal. One of its better known^{Footnote 3} variants has 8 instructions in total. We employ another variant called P3, also universal and having just 7 instructions: \(<>{\texttt {+[]o}}\).
P3 programs operate on an input/memory tape of cells, by moving a pointer left (instruction <) and right (instruction >), similar to a Turing Machine. We consider P3programs that take binary inputs and generate binary outputs. There is also the special symbol dot ‘.’ so that the tape alphabet has 3 symbols \({\varSigma }=\{0,1,.\}\). The tape is padded with ‘.’ on cells before and after the binary input. The ‘.’ is crucial for loops and halting of programs. There is also an output tape onto which the symbol in the pointer cell can be written (instruction o) in a writeonly sequential manner, and if a ‘.’ is output the program halts. The tape alphabet has the cyclic ordering (‘0’ ‘1’ ‘.’) and there are two instructions that overwrite the symbol in the pointer cell (\({\texttt {+}}\) changes it to the next symbol in the ordering, and \({\texttt {}}\) changes it to the previous one). Thus, applying \({\texttt {+}}\) to ‘.’ gives ‘0’. Finally, the bracket instruction [ loops to the corresponding bracket ], i.e., the instructions between the two brackets are executed repeatedly. The condition of the loop is evaluated on both brackets and exits to the instruction following ] when the content of the pointer cell is ‘.’.
The number of programs up to size n are exactly: \(\sum _{i=0}^{n} 7^i\). However, several optimizations are implemented. For instance, instructions without the output instruction o and programs that do not contain a balanced number of brackets are not executed. We use the following lexicographic order on instructions \(<> {\texttt {+[]o}}\) to define the \(\prec \) lengthlexicographic ordering of P3 programs. Let us consider as an example a P3 program for left shift (e.g., on input 10010 we want output 00101):
The program starts with the pointer at the first position of the input string and moves the pointer to the second position by the > instruction. Then, the program prints the rest of the characters of the input string using the loop \({\texttt {[o}} > {\texttt {]}}\). After that, the pointer is now on the ‘.’ following the last character of the input string, so the pointer is moved one position to the left: <, and then to the first ‘.’ on the left of the input string using the loop: \({\texttt {[}} < {\texttt {]}}\). Finally, the pointer is moved forward to the first character of the string and it is printed: \(> {\texttt {o}}\).
In the experiments below we can use the example set \(\{\langle 0100, 1000\rangle , \langle 10010, 00101\rangle \}\), which has two i/opairs, to find this program, as there is no shorter program that has this particular i/obehavior. However, if we wish to find this program using a single i/opair we will not succeed, as the programs \(> {\texttt {[o}} > {\texttt {]+o}}\) and \(> {\texttt {[o}} > {\texttt {]o}}\) are shorter and one of them will have the behavior of the single i/opair, by skipping the first input character, a 0 or 1, printing the rest and then applying \({\texttt {+}}\) or \({\texttt {}}\) to the ‘.’ of the pointer cell to print a 0 or 1. Thus, under preference order \(\prec \) the left shift P3 program has teaching dimension two.
\(D_{\ge 1}\): Computing teaching size of P3 programs
In this section we report on an implementation of the teacher and learner algorithm computing teaching size according to timebounded program length preference: \(TS_l^f\). We thus compute the \(D_{\ge 1}\)Teaching Book of P3 programs ordered by teaching size. We choose a constant time limit function \(f(n)=2000\) for all n. This is large enough, given the maximum size of our i/opairs that the lookuptable programs can run to completion, and for the course of the experiment we believe the time limit is only reached by nonhalting programs. Table 2 gives some (p, w) pairs, of programs and example sets, that the teacher will find.
In our experiments, we fill a teaching book up to the size of at most 7 alphabet characters in the example set. We got a total of 111,569 example sets. From this set we removed those example sets that are contradictory, those with repeated pairs, and those where all the outputs are empty, since they identify the empty program. This way 17,252 example sets remain.
For each of the 17,252 example sets, a P3 program was found, with some programs identified by many example sets. The final \(D_{\ge 1}\)Teaching Book contains 5062 distinct programs and their witnesses, i.e., the smallest example set that identifies them. In Table 3, we show some witnesses and their corresponding P3 program in the teaching book.
Of the 5062 programs in the \(D_{\ge 1}\)Teaching Book 663 have a witness with a single i/opair, 2563 with two i/o pairs, 1689 with three i/o pairs and 147 with four i/o pairs. In Table 4 we see the frequency of programs in the \(D_{\ge 1}\)Teaching Book according to the number of alphabet symbols in the witness set.
Teaching size compared to teaching dimension 1
Let us now describe a second experiment, denoted by \(D_{=1}\), in which the teacher considers only example sets consisting of a single i/o pair, generating these by increasing size, thus filling a \(D_{=1}\)Teaching Book with programs having teaching dimension 1 under the preferencebased order given by \(\prec \) and time complexity function f. Let us first show that if we do not halt this process it will continue filling the teaching book indefinitely, since there is an infinity of distinct concepts with teaching dimension one.
Lemma 1
For any integer n, the \(D_{=1}\)Teaching Book, will eventually include a program of teaching size at least n.
Proof
By contradiction. Suppose not, it means the \(D_{=1}\)Teaching Book would not grow beyond some k programs. But then, say on the empty input \(\epsilon \), these k programs can only output at most k distinct outputs. If we take some output x not among these k then there is a smallest program p that on empty input hardcodes the output x. This program would in fact be added to the \(D_{=1}\)Teaching Book when running on example set \((\epsilon ,x)\), since the hardcoded program is allowed to run to completion regardless of the choice of f, contradicting that \(D_{=1}\)Teaching Book had only k programs. \(\square \)
For the experiment we choose a constant time limit function \(f(n)=200\) for all n, since the lookuptable program of a single fixed i/opair, for the sizes we consider, will not need more than this number of steps. We limit the size of witnesses to at most 9 alphabet symbols. For each binary string of size k and note that this string can be broken up to form a single i/opair in \(k+1\) distinct ways. Thus the number of example sets up to size n is in this case given by \(\sum _{0 \le k \le n} (k+1)2^k\) and for our limit of \(n=9\) this gives 9217 example sets. For all these example sets a program was found and the final \(D_{=1}\)Teaching Book contained 3227 distinct programs and their witnesses. Table 5 shows some programs and witnesses in the \(D_{=1}\)Teaching Book.
For this experiment we have some observations on the correlation between the size of the input and output of the witness set and the length of the program. First, the length of the programs is highly correlated with the size of outputs and inversely correlated with the size of inputs. When we have short inputs the identified program usually hardcodes the output for that particular example. Given an output of size k, we will need at worst 2 \(\times k\) instructions to obtain the desired output: one instruction to get the correct value in the pointer cell (\({\texttt {+}}\) or \({\texttt {}}\)), and the o instruction. The length of the programs seems to follow a normal distribution. The median is 9 instructions. If we compare the frequency of example sets identifying a program with respect to size, we see that short programs are identified by more example sets, this is also expressed by the negative correlation of these two variables.
Let us compare witness size and program length for the two teaching books. In Fig. 1 we see two scatter plots giving this information. It is clear that the programs in the \(D_{\ge 1}\)Teaching Book, even though there are more of them (for visibility the scales are different), are generally smaller. The \(D_{=1}\)Teaching Book contains many (long) hardcoded programs, since hardcoding is more prevalent when having to satisfy only a single i/opair. Also, we notice in both cases a correlation between program length and witness size (which may become tighter if we calculate the books for larger witness sets), and that the bulk of the witnesses, unsurprisingly as there are more larger example sets, are towards the high end.
There were 5062 programs in \(D_{\ge 1}\) and 3227 in \(D_{=1}\), with 543 programs being in common to both teaching books. For each of these programs, we want to compare the size of their witness set in the two teaching books. In Fig. 2 we include a scatter plot where each circle represents a group of these 543 programs, sharing the same pair of sizes of witness encodings. The size of the circle represent the number of programs in the group. The \(x\)axis represents the size of the witness set in the \(D_{= 1}\)Teaching Book, while the \(y\)axis represents the size of the witness set in the \(D_{\ge 1}\)Teaching Book. As expected, no programs are above the diagonal, as the \(D_{\ge 1}\)Teaching Book is built to minimize size.
For programs on the diagonal the two sizes are identical, but there are some exceptions (43 programs). In these cases the size in the \(D_{\ge 1}\)Teaching Book is smaller, see Table 6. For example, the program \({\texttt {+[o}} > {\texttt {]o}}\) which is basically “increase cell, loop outputting and move right while 0 or 1, decrease the final cell (which must be a ‘.’) and output” has the \(D_{= 1}\) witness \(\{\langle 0000,10001\rangle \}\) and the \(D_{\ge 1}\) witness \(\{\langle ,10\rangle ,\langle 1,1\rangle \}\). Size goes from 9 to 4.
Further experimental results and discussion
Note that for the previous experiments we used a very inefficient encoding of example sets. One of the interesting things about the teaching size is that it gives a quantification of how many bits would be needed to send a witness set that is sufficient for the learner to identify the concept. This means that the efficiency of the encoding is key. Consequently, as an alternative to the simple encoding we used in the previous sections, we ran an experiment using an asymptotically optimal universal code, the Elias Delta Coding (Elias 1975), which encodes a number x using \({\displaystyle \lfloor \log _{2}(x)\rfloor +2\lfloor \log _{2}(\lfloor \log _{2}(x)\rfloor +1)\rfloor +1}\) bits. The number of bits depends only on the length of the binary encoding of x and we specify this as a function \(ED: {\mathbb {N}} \rightarrow {\mathbb {N}}\) so that the Elias Delta code for a binary string of n bits uses ED(n) bits. For an example set \(S=\{\langle i_1,o_1\rangle ,\ldots ,\langle i_k,o_k\rangle \}\) its Elias Delta encoding size is \(\delta '(S)= \sum _{1 \le j \le k} ED(i_j) + ED(o_j)\), basically concatenating all the encodings. Since these are prefix codes their concatenation preserves the separation into the original binary strings.
We constructed the teaching book enumerating examples sets up to a size of 15 bits under this \(\delta '\). Figure 3 compares the witness size with this new encoding with the program length, also expressed in bits (as we have 7 instructions, we multiply the number of instruction of a program by \(\log _2 7\) to have its length in bits too).
We see a few small programs that require large witness sets (under the unit diagonal on Fig. 3). But the striking finding is that in most other cases—almost always—the witness size (in bits) is smaller than the program size (in bits). This extraordinary observation means that if a teacher wants to send (teach) a concept, it is frequently more efficient (in terms of transmission) to send its optimal witness set under our schema than to send the program itself. This is significant experimental support for the phenomenon (usually observed in humans, Khan et al. 2011) that sometimes it is more informationefficient to teach some ideas using examples than explaining the idea itself. The very problem of teaching by example —“if we already know the true model, why bother training a learner?” in Zhu (2015)’s words— is ultimately meaningful under these findings, and goes beyond the traditional applications for education and adversarial attacks. It is not that the teacher “cannot telepathize [the model]” (Zhu et al. 2018), it is that showing a few examples instead could be more efficient. The fact that we can now compare the magnitudes of programs and witnesses in the same unit (bits) is given by the notion of teaching size, rather than teaching dimension.
Now for this new encoding, we are going to calculate the expected teaching size using some different sample priors. For instance, if we use the sampling prior as defined in Sect. 5 (let us call it \(v_1\)), we get an expected teaching size of 5.28. This is clearly a very extreme prior so we get very close to 5 bits, which is the minimum witness size for Elias coding. A less extreme sampling prior \(v_2\) would be to assign a uniform distribution on each size bin up to the maximum size in the book (15). That gives us a mass of probability \(1/(155)\) for each size (note that there is a size bin with no entries). With this prior, we get an expected teaching size of 10.2 bits. With a uniform distribution \(v_3\) on all the entries in the book, we would have an expected teaching size between 14 and 15.
In general, we do not know whether a concept appears twice in the book, but if the complexity function f is large enough in terms of the limit up to which the book is constructed (as in this case), redundant programs will be quite unlikely. So, Fig. 3 can be interpreted as a comparison between \(TS_{\ell }^f(c)\) and \(K^f(c)\), where we see that \(TS_{\ell }^f(c) < K^f(c)\) for most cases.^{Footnote 4} An interesting property to analyze for a pair of language and sample prior v is the transmission efficiency of teaching, defined as \( {\mathbb {E}} _v [ K^f(C)]  {\mathbb {E}} _v [ TS_{\ell }^f(C) ]\), where \({\mathbb {E}} _v [ K^f(C)]\) is \(\sum _{c \in C} v(c) \cdot K^f(c)\), also measured in bits. For P3, up to the sizes we have explored, this efficiency is positive for the three priors we have used above (\(v_1\), \(v_2\) and \(v_3\)).
From the comparison of experiments we have the following observations:

The computability of the theoretical setting is translated into a universal language, P3.

Teaching books have a high number and high diversity of programs. Hence we do not observe any collapsing phenomenon of a few programs capturing almost all witness sets.

We can find interesting programs such as those in Table 2.

The experiment with limited witness cardinality equal 1 shows higher overfitting than the general setting.

With a strong sampling bias, we have a very small expected teaching size, but other softer sampling bias can be used.

Remarkably, it is rather common to find witness sets that are smaller than the programs they identify. This gets more pronounced for efficient encodings such as Elias delta.
All the code for the P3 interpreter, the teacher and the learner, and the several settings and encodings used in this paper, as well as results and plots, can be found at this repository: https://github.com/ceferra/TheTeachingSizewithP3.
Related work
In this section we put the notion of teaching size in the historical context of machine learning and machine teaching. We start with the early days of machine learning, when several theoretical learning models were introduced. Gold’s language identification in the limit (Gold 1967) can be understood as a first model of teaching, where positive (or negative) examples are given sequentially. The emphasis is on whether the concept that is generating the sequence of examples will be identified. When sequences are replaced by sets we have another model, usually known as concept learning (or teaching). The link between the number of examples needed and the complexity of the concept class has been investigated profusely, most especially under the notions of VCdimension (Vapnik and Chervonenkis 1971) and PAClearnability (Valiant 1984). In all these models and variants, the magnitude or complexity of the evidence is usually measured in terms of cardinality, i.e., how many examples are necessary.
For many concept languages, the examples do not have structure, so all examples are equally complex. This view is reinforced by a custom setting in machine learning where examples are just points in a multidimensional space, i.e., a vector of real numbers. On top of this, the dominant paradigm in machine learning today is that of function approximation, rather than function identification. However, when examples have structure, we need concepts that are able to compose elements, and then the notion of approximation does not make sense (e.g., if a hypothesis outputs ‘11110000’ rather than ‘11110010’, the output may look close, but the true concept may still be quite afar). As a result, there is an increasing interest in the understanding of learning where programs or other structural concepts are built or induced from nonscalar data (Lake et al. 2015; Gulwani et al. 2015; Lake and Baroni 2018; LázaroGredilla et al. 2019).
Alongside the theoretical and practical models of machine learning, there has been significant progress on the notion of machine teaching, where the teacher can choose the examples at will, the optimal “witness set” (Freivalds et al. 1989; Shinohara and Miyano 1991; Freivalds et al. 1993; Goldman and Kearns 1995), in order to make learning more efficient. While this gives extra power (as other variants of learning, such as active learning), the theoretical models have not yet shown a setting where teaching becomes efficient for rich concept classes, not to say universal languages. A remarkable exception is the model of teaching in Angluin and Kriķis (2003), where, given the right assumptions, the learner can learn any partial recursive function. The focus is on the ability to learn and not on the minimum cardinality (or size) of the examples.
The most common theoretical model since then is based on the notion of teaching dimension, and appeared together with the early theoretical models of teaching. One way of tightening the choice when many concepts are compatible with a given witness set is by the use of costs or preferences, following the tradition of the Kdimension (Balbach 2007, 2008) and consolidated under the notion of the preferencebased teaching dimension (PBTD) (Gao et al. 2016). Given an order on concepts, we just choose the first compatible concept in this preference order. With the use of preferences we get some finite (worstcase) teaching dimensions for some restricted languages (Gao et al. 2016).
Even these refinements of the teaching dimension fall short for a full underpinning of machine teaching, which is moving towards more general settings (Khan et al. 2011; Zhu 2013, 2015; Simard et al. 2017), and related areas such as curriculum learning (Bengio et al. 2009), understood as teaching with sequences of examples of increasing complexity or requiring from what has been learned before. This divergence (Khan et al. 2011) also happens in the context of new insights about teaching in cognitive science and psychology (Shafto et al. 2014), where composition and structure require rich concept classes.
Some recent variants of the teaching dimension have paid attention to languages that are structurally rich and universal. HernandezOrallo and Telle (2018) studied the teaching dimension when using a learning prior that was derived from a complexity measure. They showed that even rich infinite languages, such as the regular languages, can be taught with an upper bound on the expected number of examples in a witness. Two priors are needed for this, the learning prior—basically a variation of the preferencebased order (Gao et al. 2016)—used to inform the learner of how to search for concepts given a witness set, and the sampling prior used by the teacher to test the performance of the learner. However, not only are there several incomputability problems, but the teaching dimension does not really capture how complex it is to learn with rich languages producing arbitrarily large examples.
Replacing the teaching dimension by the teaching size and the setting we present in this paper solves these issues and links machine teaching to some fundamental concepts in machine learning such as compression (through Kolmogorov complexity) and Solomonoff’s induction theory (1964). The size of the inputs (and outputs) is now placed at the forefront, and how compressible they are play a key role in the understanding of machine teaching in light of the theoretical and experimental results shown in this paper.
While Theorem 1 applies for any universal language, it gives a result where a different complexity function can be chosen for each concept. On the contrary, the time bound is used to define the relevant concept class for Theorem 2. While this restricts the concept class, it also allows for the analysis of the teaching size for a hierarchy of concept classes according to their time complexity functions. For instance, interesting connections could be established when choosing functions in P and compare the resulting teaching sizes (and book) with respect to various concepts and classes.
Finally, the shift from the teaching dimension to the teaching size, and the use of the sampling bias to calculate the expected teaching size links us to fundamental notions in communication and information theory, and reinforces the view of machine teaching from a coding perspective (Zhu 2015). In our setting, the learner is actually decoding the examples to find the message (the concept). In the experiments, witness sets can be seen as a compressed version of a concept, assuming that the learner is using a decoding algorithm that searches programs by increasing length. Under this view, what really matters is the comparison of sizes, and not cardinalities, between witness sets and programs.
Conclusions
In this paper, we have argued that the concept of teaching dimension is counterintuitive for languages where examples are structured, and can be arbitrarily long. A small teaching dimension could be obtained with a very long example. Also, the teaching dimension leads to incomputability problems for universal languages. Instead, we have introduced the teaching size, and we have analyzed the computability of the teaching procedure (for learner and teacher) according to length, interwoven length and computational steps, and length with a time complexity function. In the last case, we have been able to show positive computability results, while preserving the power of teaching the whole class in general.
The use of size instead of cardinality also makes the results in expectation for a whole class of concepts more meaningful. For instance, having a finite dimension for a concept class may still mean that the examples may have an unbounded size, and hence the process is not bounded at all. With the use of the teaching size this cannot happen. Actually, instead of the expected teaching size (the mean for the sampling prior), we could also calculate the total teaching size for teaching n concepts (the sum for the sampling prior). Also, the coupling with the complexity function f paves the way for bounding both teaching and learning time on expectation (or on aggregation).
Many of the traditional and new applications of machine teaching can be strengthened under the view of the teaching size paradigm, including humancomputer interaction, education, trustworthy AI and coding (Zhu et al. 2018). For instance, interaction with robots and personal assistants can benefit from a better understanding of how humans teach (and communicate) concepts, and bring the old ideas of programming by example (Lieberman 2001) closer to reality. A user can tell an assistant that she likes peas with roasted chicken and potatoes, but dislikes peas with fish, and the system can infer that this is sufficient to infer the kind of recipes the user likes. A model of teaching where these short examples are put at the same level of examples having millions of ingredients does not make any sense.
The notion of teaching size can have a novel applicability for explaining ML models (Biran and Cotton 2017), where key examples have to be given to the user. How to find the key short examples to convey the meaning of the model is precisely what we have analyzed here. Finally, while we tend to associate structured attributes with lists, graphs and other compositional objects, natural numbers and real numbers can also be considered structured, with an encoding to define their size, using e.g. rationals to encode real numbers. In this light, it is not the same to teach a concept with a few numbers of small size as to teach it with a single number of huge size. This would explain why humans do not use the number that is “closest to the decision boundary” (Khan et al. 2011) at the cost of using infinite precision.
The experimental section supports the intuition that more examples usually pave the way for witnesses of smaller size. Moreover, a focus on minimizing the number of examples leads to specialization (hardcoded programs for one example), while a focus on minimizing size usually finds more general programs, with large numbers of examples being used to identify diversity.
In terms of tractability, the learning algorithm grows exponentially, but it is fully parallelizable. The teaching algorithm uses the learning algorithm for each witness function it iterates on. Even if we have to check whether the concept is already in the book or not, this can be done as a postprocess, so the teaching algorithm is also parallelizable at the upper level. Also note there are very efficient specialised computers (Jun 2016) for the language BF—very similar to P3—with which we could produce much larger teaching books.
The most significant insight from the experiments was finding that the size of the program is almost always larger than the optimal witness set used to identify it. This suggests that sending the examples rather than the definition of the concept pays off, a phenomenon we observe in real teaching scenarios with humans. This is possible whenever the learner has a powerful learning mechanism and an aligned learning prior.
The teaching size opens a new avenue for a more realistic and powerful analysis of machine teaching, and suggests closer connections with learnability theory and communication, especially when results are expressed in terms of the size of the inputs, and not only the number of examples. As future work we want to explore several nonuniversal concept classes and sampling priors to estimate their expected teaching size.
Notes
 1.
We can also define this teaching size using \(\prec \) directly, to highlight the similarity with the preferencebased teaching dimension (which would minimize over number of i/opairs in S rather than \(\delta (S)\)):
$$\begin{aligned} TS_\ell (c) = \min \limits _{S} \{\delta (S) : p \models S \wedge (p' \prec p \Rightarrow p' \not \models S) \wedge p \in Class_L(c)\}.\end{aligned}$$  2.
This teaching size variant could alternatively be defined using \(\prec \) as a preferencebased order directly:
$$\begin{aligned} TS_{\ell }^f(c) = \min \limits _{S} \{\delta (S) : p \models _f S \wedge (p' \prec p \Rightarrow p' \not \models _f S) \wedge p \in Class_L(c)\}. \end{aligned}$$  3.
 4.
Note that it is also the case that \(TS_{\ell }^f(c) > K^f(c) + k\) where k is a constant given by the length of expressing the program “Run the learner algorithm on w” in the language.
References
Angluin, D., & Kriķis, M. (2003). Learning from different teachers. Machine Learning, 51(2), 137–163.
Balbach, F. J. (2007). Models for algorithmic teaching. Ph.D. thesis, University of Lübeck.
Balbach, F. J. (2008). Measuring teachability using variants of the teaching dimension. Theoretical Computer Science, 397(1–3), 94–113.
Balbach, F. J., & Zeugmann, T. (2009). Recent developments in algorithmic teaching. In Intl conf on language and automata theory and applications (pp. 1–18). Springer.
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41–48). ACM.
Biran, O., & Cotton, C. (2017). Explanation and justification in machine learning: A survey. In IJCAI17 Workshop on explainable AI (XAI) (p. 8).
Böhm, C. (1964). On a family of turing machines and the related programming language. ICC Bulletin, 3(3), 187–194.
Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2), 194–203.
Freivalds, R., Kinber, E. B., & Wiehagen, R. (1989). Inductive inference from good examples. In International workshop on analogical and inductive inference (pp. 1–17). Springer.
Freivalds, R., Kinber, E. B., & Wiehagen, R. (1993). On the power of inductive inference from good examples. Theoretical Computer Science, 110(1), 131–144.
Gao, Z., Ries, C., Simon, H. U., & Zilles, S. (2016). Preferencebased teaching. In Conf. on learning theory (pp. 971–997).
Gold, E. M. (1967). Language identification in the limit. Information and Control, 10(5), 447–474.
Goldman, S. A., & Kearns, M. J. (1995). On the complexity of teaching. Journal of Computer and System Sciences, 50(1), 20–31.
Goldman, S. A., & Mathias, H. D. (1993). Teaching a smart learner. In Conf. on computational learning theory (pp. 67–76).
Gulwani, S., HernándezOrallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11).
HernandezOrallo, J., & Telle, J. A. (2018). Finite biased teaching with infinite concept classes. arXiv preprint. arXiv:1804.07121.
Jun, S. W. (2016). 50,000,000,000 instructions per second: Design and implementation of a 256core brainfuck computer. Computer Science and AI Laboratory, MIT.
Khan, F., Mutlu, B., & Zhu, X. (2011). How do humans teach: On curriculum learning and teaching dimension. In Advances in neural information processing systems (pp. 1449–1457).
Lake, B., & Baroni, M. (2018). Generalization without systematicity: On the compositional skills of sequencetosequence recurrent networks. In ICML (pp. 2879–2888).
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266), 1332–1338.
LázaroGredilla, M., Lin, D., Guntupalli, J. S., & George, D. (2019). Beyond imitation: Zeroshot task transfer on robots by learning concepts as cognitive programs. Science Robotics 4.
Levin, L. A. (1973). Universal Search Problems. Problems of Information Transmission, 9, 265–266.
Li, M., & Vitányi, P. (2008). An introduction to Kolmogorov complexity and its applications (3rd ed.). New York, NY: Springer.
Lieberman, H. (2001). Your wish is my command: Programming by example. San Francisco, CA: Morgan Kaufmann.
Shafto, P., Goodman, N. D., & Griffiths, T. L. (2014). A rational account of pedagogical reasoning: Teaching by, and learning from, examples. Cognitive Psychology, 71, 55–89.
Shinohara, A., & Miyano, S. (1991). Teachability in computational learning. New Generation Computing, 8(4), 337–347.
Simard, P. Y., Amershi, S., Chickering, D. M., Pelton, A. E., Ghorashi, S., Meek, C., Ramos, G., Suh, J., Verwey, J., & Wang, M., et al. (2017). Machine teaching: A new paradigm for building machine learning systems. arXiv preprint arXiv:1707.06742.
Solomonoff, R. J. (1964). A formal theory of inductive inference. Part I. Information and Control, 7(1), 1–22.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280.
Zhu, X. (2013). Machine teaching for Bayesian learners in the exponential family. In Neural information processing systems 26, Curran (pp. 1905–1913).
Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an approach toward optimal education. In AAAI (pp. 4083–4087).
Zhu, X., Singla, A., Zilles, S., & Rafferty, A. N. (2018). An overview of machine teaching. arXiv preprint arXiv:1801.05927.
Acknowledgements
We would like to thank the anonymous referees for their helpful comments. This work was supported by the EU (FEDER) and the Spanish MINECO under grant RTI2018094403BC32, and the Generalitat Valenciana PROMETEO/2019/098. This work was done while the first author visited Universitat Politècnica de València and also while the third author visited University of Bergen (covered by Generalitat Valenciana BEST/2018/027 and University of Bergen). J. HernándezOrallo is also funded by an FLI grant RFP2152.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Karsten Borgwardt, PoLing Loh, Evimaria Terzi, Antti Ukkonen.
Appendix A: Teaching size according to Levin’s search: \(TS_{\ell \tau }\)
Appendix A: Teaching size according to Levin’s search: \(TS_{\ell \tau }\)
To get a finite procedure we consider the introduction of computational steps in the complexity function, inspired by Levin’s Kt (Levin 1973; Li and Vitányi 2008). We define the order given by Levin’s search for an example set \(S=\{\langle i_1,o_1\rangle ,\ldots ,\langle i_k,o_k\rangle \}\) as follows:
where \(\tau (p,i)\) represents the runtime of executing program p on the input i. And for a given S, the first program found under Levin’s search is given by:
\({\varPhi } _{\ell \tau }(S)\) is the Levinsimplest program satisfying S. And Levin’s Kt is simply \(Kt(S) = \ell \tau ({\varPhi } _{\ell \tau }(S))\) Note that Kt(S) is related to the logarithm of the time it takes to generate the behavior S, when we execute all possible programs in dovetail fashion, by increasing \(\ell \tau \) (following the program size \(\ell \) with \(\prec \) for ties and runtime \(\tau \)). We define the teaching size of a concept c under \(\ell \tau \) as follows:
i.e. the size of the smallest witness set S, using \(\delta \) encoding, such that the Levinsimplest program satisfying S is in the equivalence class of c.
When the learner is given a set of instances \(S=\{\langle i_1,o_1\rangle ,\ldots ,\langle i_k,o_k\rangle \}\)—note in this case we do not allow a pair \(\langle i,\perp \rangle \)—we want to find \({\varPhi } _{\ell \tau }(S)\), the Levinsimplest program satisfying S.
The learner is computable: given a witness we can perform the dovetail procedure of Levin’s universal search, on an increasing 2dimensional budget: over programs of increasing length \(\ell \) following \(\prec \) and over increasing runtimes \(\tau \), and in this way we find the Levinsimplest program satisfying the witness.
Let us now turn to the teacher. The teacher will be able to distinguish between concepts defined by any finite behavior, by filling what we call a Levin Teaching Book of program/witness pairs (p, w) with p the Levinsimplest program satisfying w, and w the smallest witness for p.
The Levin Teaching Book can be computed following a bruteforce algorithm. Starting with an initially empty book, try all witnesses w with no contradictory pairs (i.e., no two pairs with distinct outputs for the same input) and no \(\langle i,\perp \rangle \) pairs, in order of increasing \(\delta \) encoding size (lexicographically for those of the same size) and run the learner algorithm (i.e., Levin search) on w to find a program p. Check if p is already in the Levin Teaching Book, and if it is not then add the pair (p, w) at the end of the Levin Teaching Book. In that way, the Levin Teaching Book is formed by pairs (p, w), where w is a witness and p the Levinsimplest program satisfying it. Observe that for some Lconcept c we cannot rule out the possibility that the Levin Teaching Book may contain two programs, with two different witnesses, with both programs equivalent to c. This would happen, if for example, one of the programs takes an extraordinary long time on one of the i/opairs of the witness of the other program, and viceversa. The teaching size of the concept under \(\ell \tau \) is in any case welldefined, as the size of the smallest witness.
We have the following positive result.
Claim
For any universal language L and finite example set S, there will exist some (p, w) with \(p= {\varPhi } _{\ell \tau }(S)\) in the Levin Teaching Book. If \(p \in Class_L(c)\) then \(TS_{\ell \tau }(c) \le \delta (w) \le \delta (S)\).
Proof
When the teacher tries S then the Levinsimplest program p satisfying S will be found. Either (p, S) will be inserted in the Levin Teaching Book or p has already been found paired with a smaller witness w. If \(p \in Class_L(c)\) we have \(TS_{\ell \tau }(c) = \delta (w) \le \delta (S)\) unless there is a program \(p'\) equivalent to p that is Levinsimplest for some example set smaller than w. \(\square \)
This means that, if we want to teach not a specific concept, but require only that we teach one out of a number of concepts having a given behavior S on a finite input set, then we can do so. However, we would like, for every universal language, to teach also total concepts with a specific infinite behavior. For a simple example, consider teaching the identity concept, whose equivalence class contains exactly the idprograms, i.e., any program that gives as output the same as its input. Without putting some constraint on the universal language we cannot do that, as seen by the following result, formulated using Universal Turing Machines.
Claim
For any Universal Turing Machine (universal language) U there exists another UTM U\(^\prime \) s.t. no idprogram will be in the LevinTeaching Book for U\(^\prime \).
Proof
We give a sketch of the proof. We construct U\(^\prime \) so that it alters the programs of U by ensuring (1) the existence of fast nonid programs such that for any finite witness set there is one that behaves like an idprogram and (2) any real idprogram is slowed down so it is not the Levinsimplest for any finite witness set. We ensure these two conditions as follows: (1) make a computable set \(\{p_i: \text{ binary } \text{ string } i\}\) of programs such that \(p_i(x)=x\) except for \(p_i(i) \ne i\), and let this be fast. (2) For any other program p on input x add a conditional such that before halting we check if the output is \(p(x)=x\) and if so we add some code slowing it down. Then, for any proposed finite witness w for the idprogram we have some smallest input string \(i \not \in w\), and thus the program \(p_i\) will be Levinsimpler on w than any of the real idprograms. \(\square \)
Thus, even for quite simple concepts we cannot avoid that there will be some contrived universal languages that ‘fool’ the teacher using \(\ell \tau \) so that a witness set for this concept cannot be found.
Rights and permissions
About this article
Cite this article
Telle, J.A., HernándezOrallo, J. & Ferri, C. The teaching size: computable teachers and learners for universal languages. Mach Learn 108, 1653–1675 (2019). https://doi.org/10.1007/s10994019058212
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Machine teaching
 Teaching dimension
 Teaching size
 Compression
 Universal languages
 P” programming language
 Levin’s search