1 Introduction

Humans are able to learn a new concept from very little information. However, a large majority of machine learning algorithms needs a substantial amount of data to perform similarly. In particular, successful models such as deep learning are among the most data-hungry (Marcus, 2018). Additionally, humans are able to identify more complex object categories than machines by composing simpler objects, but the best deep learning techniques usually struggle in compositional settings (Shrestha & Mahmood, 2019). How is it possible that humans perform such a rich learning from so scarce information? (Lake et al., 2015)

In essence, education in humans is organised through curricula, starting with simple concepts and gradually increasing the complexity. Bengio et al. (2009) initiated the formalisation of curriculum learning (CL) in the context of machine learning. It inspired new applications in machine learning to an extensive variety of tasks (Tang et al., 2018; Wang et al., 2018, 2019). The empirical results showed advantages of CL over random training. Nevertheless, CL is still limited by two issues: (1) the need to find a way of ordering the examples according to their difficulty and (2) the correct order of the concepts involved (Soviany et al., 2022).

Machine teaching (MT) is the area of AI that looks for optimal examples that a teacher should utilise to make a student identify a concept (Zhu et al., 2018). It is of extreme importance in applications such as digital assistants, where we would like to teach them new concepts or procedures with limited data, in the same way humans teach or communicate with other humans (Degen et al., 2020).

MT is often understood as an inverse problem to machine learning (Zhu, 2015). For one single concept machine teaching works as follows: the teacher generates a training—or witness—set of examples, either positive or negative, from which the learner identifies a concept. For instance, consider a teacher who wants a student to acquire the concept “red ball”; the teacher knows the student’s learning algorithm and may provide it with the following witness set: a red tennis ball picture labelled as positive and a blue one labelled as negative.

For humans and many other animals it is assumed that once a concept has been captured it is possible to reuse it to learn another one. Previous work (Zhou & Bilmes, 2018) has tried to optimise the teaching of one concept, e.g., through a partial minimax approach. However, not only do we want to consider the sequence of training sets for a concept, but also the best order to teach a given set of concepts. MT for several concepts was restricted to sequences where every acquired concept becomes background knowledge (Clayton & Abbass, 2019; Wang et al., 2021). Pentina et al. (2015), who had studied CL from an experimental approach, proposed that future work should identify a valid theoretical framework that would allow for a more generic distribution of tasks and the realisation of the advantages of forgetting or using independent sessions. This is in line with neurological studies such as Richards and Frankland (2017) and recent animal cognition research (Dong et al., 2016; Epp et al., 2016; Shuai et al., 2010), showing “evidence that forgetting is necessary for flexible behavior in dynamic environments”.

The first full theoretical framework of CL in MT was introduced in Garcia-Piqueras and Hernández-Orallo (2021). Using simplicity priors it identifies the optimal tree-distribution of concepts by means of the \(\mathbb {I}\)-search algorithm (Garcia-Piqueras & Hernández-Orallo, 2021). The framework defines an instructional curriculum as the set of alternative partial sequences, such as the upper and lower branch of Fig. 1.

Fig. 1
figure 1

Instructional curriculum for a set of concepts, where concept − is taught before \(\angle \) and these two before \(\bigtriangleup \), in one session, and concept \(\wr \) is taught in an independent session or lesson

The order between the branches is irrelevant, but the order of the concepts in each branch is crucial. The MT framework implementing CL proposed in Garcia-Piqueras and Hernández-Orallo (2021) not only meets the specifications in Pentina et al. (2015), but is also consistent with Richards and Frankland (2017) by handling a new phenomenon called interposition: previous knowledge is not always useful. This issue increases the difficulty of finding optimal curricula.

Under general conditions, \(\mathbb {I}\)-search is able to overcome interposition, but it is computationally intractable in general. The computational cost is one of the reasons why CL is not sufficiently used in AI (Forestier et al., 2022), despite existing solutions with search algorithms like \(A^*\) (Pearl, 1984). Heuristic estimates enhance those procedures by making less node expansion in the graph search, which drastically reduces computational costs (Rios & Chaimowicz, 2010). To our knowledge, there are no such estimators for CL.

Here we introduce new theoretical results, such as Inequality (2): a relation between the sizes of the examples and concept descriptions with and without background knowledge. Such inequality is key to define a new family of heuristics to effectively identify minimal curricula. The heuristics are elegantly defined using a “ratio of similarity” between the sizes of the sets of examples with and without previous knowledge. Such ratio is the quotient of the lengths of the descriptions of a concept with and without background knowledge.

Theoretical results are illustrated on a drawing domain, where curricula can exploit that some drawings built on substructures previously learnt (e.g., a flag is depicted as a rectangle and a straight pole). Experiments show the effect of interposition in CL and how it is overcome through our novel approach. This contribution, based on compositional simplicity priors and exemplified using a drawing domain, follows the direction of other important efforts in compositional AI (Lake et al., 2015; Wu et al., 2016; Tenenbaum et al., 2000; Wong et al., 2021).

2 The machine teaching framework

In this section we give more details about the interaction between the teacher and the learner as distinct entities. The teacher-learner protocol (Telle et al., 2019) is based on the following statements. Let \(\Sigma ^*\) be the set of all possible strings formed from combinations of symbols of an alphabet \(\Sigma \). We label such strings as positive or negative. An example is an input–output pair where the input is a string of \(\Sigma ^*\) and the output is \(+\) or − (outputs might be strings in the most general case of the framework).

Definition 1

The example (or instance) space is defined as the infinite set

$$\begin{aligned} X = \bigl \{ \langle \texttt{i}, \texttt{o}\rangle : \langle \texttt{i}, \texttt{o}\rangle \in \Sigma ^* \times \{+,-\} \bigr \} \end{aligned}$$

There is a total order \(\lessdot \) in X and there is a metric \(\delta \) that gives the size of any set of examples.

Definition 2

We define the infinite concept class \(C=2^X\) consisting of concepts that are a subset of X.

The objective is that for any concept \(c\in C\) the teacher must find a small witness set of examples from which the learner is able to uniquely identify the concept. The learner tries to describe concepts through a language L.

Definition 3

A program p in language L satisfies the example \(\langle \texttt{i}, \texttt{o}\rangle \), denoted by \(p(\texttt{i})=\texttt{o}\), when p outputs \(\texttt{o}\) on input \(\texttt{i}\)Footnote 1 . We say that a program p is compatible with the set of examples \(S \subset X\), if p satisfies every example of S and we denote \(p \vDash S\).

Two programs are equivalent if they compute the same function mapping strings of \(\Sigma ^*\) to \(\{+,-\}\). There is a total order \(\prec \) defined over the programs in L. For any program p in L there is a metric \(\ell \) that calculates its length.

We say that c is an L-concept if it is a total or partial function \(c:\Sigma ^* \!\rightarrow \! \{+,-\}\) computed by at least a program in L. Let \(C_L\) be the set of concepts that are described by L. Given \(c \in C_L\), we denote \([c]_L\) as the equivalence class of programs in L that compute the function defined by c.

Definition 4

We define the first program, in order \(\prec \), returned by the learner \(\Phi \) for the example set S as

$$\begin{aligned} {\Phi _{\ell }} (S) = {\arg \min \limits _{p}}^{\prec } \left\{ \ell (p): p {{\vDash }} S \right\} \end{aligned}$$

We say that w is a witness set of concept \(c \in C_L\) for learner \(\Phi \), if w is a finite example set such that \(p = {\Phi _{\ell }} (w)\) and \(p \in [c]_L\).

The teacher \(\Omega \) has a concept \(c \in C_L\) in mind and knows how the learner works. With this information, the teacher provides the learner with the witness set w.

Definition 5

We define the simplest witness set that allows the learner to identify a concept c as

$$\begin{aligned} {\Omega _{\ell }} (c) = {\arg \min \limits _{S}}^{\lessdot } \left\{ \delta (S): {\Phi _{\ell }} (S) \in [c]_L \right\} \end{aligned}$$

We define the teaching size of a concept c as \({TS_{\ell }} (c) = \delta ({\Omega _{\ell }} (c))\).

We exemplify our discourse with a drawing domain: polygonal strokes on a grid. The details about the definition of a language L and a example space X are given in Section A. In short, we deal with an example space generated by commands North, South, East and West, for the four possible axis-parallel directions in a plane. In Fig. 2, starting at the black dot, the frieze is described by \(\mathsf{ENESENES\ldots }\) Examples are labelled positive or negative, e.g., \(\textsf{EN}^+\) or \(\textsf{E}^-\).

Fig. 2
figure 2

Polygonal on a grid

The learner is able to capture concepts using a language L with the following instructions: \(\textsf{U}\)p, \(\textsf{D}\)own, \(\textsf{R}\)ight and \(\textsf{L}\)eft, \(\varvec{)}\) and \(\varvec{@}\). The symbol \(\varvec{)}\) refers to a non-deterministic choice between going to the beginning of the program or continue with the next instruction. For instance, the concept a defined by the frieze in Fig. 2, could be expressed in L as \(\mathsf {RURD)}\). The instruction \(\varvec{@}\) stands for library calls to implement background knowledge; for instance, \(\textsf{RU}\varvec{@}\) is equivalent to \(\mathsf {RURD\varvec{)}}\) when the library points to subroutine \(\mathsf {RD\varvec{)}}\).

For example, the teacher receives or thinks about such concept a. The teacher selects the witness set, \(w=\{\textsf{ ENESEN}^+\}\) and provides the learner with it. At that point, the learner outputs the first program that satisfies w, i.e., \({\Phi _{\ell }} (w)=\mathsf {RURD\varvec{)}} \in [a]_L\) (see Fig. 3). We usually drop the index \(\ell \) if it is clear from the context. The teaching size of concept a is the size of the witness set, i.e., \(TS(a)=\delta (\{\textsf{ ENESEN}^+\})=21\) (3 bits per symbol as it is stated in Section A).

Fig. 3
figure 3

The teacher-learner protocol for a given concept a

As we have seen, the teacher-learner protocol makes it possible to define the teaching size of a concept. The MT framework was adapted to implement background knowledge through the notion of conditional teaching size (Garcia-Piqueras & Hernández-Orallo, 2021). We discuss this approach in the following section, along with a phenomenon called interposition.

3 Conditional teaching size and interposition

Let a library be a possibly empty ordered set of programs. We denote \(B=\langle p_1, \ldots , p_k\rangle \), where each \(p_i\) identifies concept \(c_i\) and \(|B |\) denotes the number of primitives k. We use \(|B |=0\) to indicate that B is empty. A program p, identified by the learner, might be included in a library B as a primitive for later use.

In order to avoid old references when the library is expanded, we replace every instruction \(\varvec{@}\) of a program identified by the learner by the corresponding primitive. For example, if \(B=\langle \mathsf {D\varvec{)}} \rangle \) and the learner identifies \(\textsf{RU}\varvec{@}\), then the library is extended as \(B=\langle \mathsf {D\varvec{)}}, \mathsf {RUD\varvec{)}} \rangle \).

We define the conditional teaching size of concept c, using library B (background knowledge), as the size in bits of the first witness set w such that \({\Phi _{\ell }} (w \vert B)=p \in [c]_L\). Let us see an example.

Example 1

Let us consider \(Q=\{a, b, c\}\), where programs \(\textsf{RURD}\varvec{)} \in [a]_L\), \(\textsf{RRURD}\varvec{)} \in [b]_L\) and \(\textsf{R}\varvec{)}\textsf{URD}\varvec{)} \in [c]_L\) (polygonal chains as shown in the 2nd, 3rd and 4th rows of Fig. 6, respectively in Section A). These programs are placed first in their equivalent classes regarding the total order \(\prec \).

Graphical instances of a and b are, for example, the polygonal chains of the 2nd and 3rd rows of Fig. 6, respectively. We get \(TS(b \vert a)=12\) as the teaching size, in bits, of concept b using a as prior knowledge, i.e., the learner employs library \(B=\langle \textsf{RURD}\varvec{)} \rangle \) as background knowledge. In our particular case, the learner outputs \(\Phi (\{\textsf{RRN}^+\} \vert B)=\textsf{R}\varvec{@}\) when using library \(B=\langle \textsf{RURD}\varvec{)} \rangle \).

In such case, there is a reduction of the teaching size of concept b using a as prior knowledge: \(TS(b \vert a)=12<24=TS(b)\) (see Table 1). However, it may happen that background knowledge increases the teaching size of an object. This phenomenon is what we call interposition: some prior knowledge causes interposition to a given concept.

Table 1 Conditional teaching sizes for concepts of Example 1

For instance, let us consider \(\textsf{R}\varvec{)}\textsf{URD}\varvec{)} \in [c]_L\), with graphical instances such as the chain of the fourth row of Fig. 6. Using library \(B=\langle \textsf{R}\varvec{)}\textsf{URD}\varvec{)}\rangle \), we get that \(\Phi (\{ \mathsf {ENESEN^+}, \mathsf {EE^-}\} \vert B)=\textsf{RURD}\varvec{)}\) (see Table 1), so that:

$$\begin{aligned} TS(a \vert c)=30>21=TS(a) \end{aligned}$$

In other words, concept c causes interposition to concept a. This phenomenon increases the difficulty of finding the curriculum with minimum overall teaching size for a given set of concepts. We must take this issue into account in the following sections.

4 An optimisation problem: minimal curricula

In this section, we will deal with the following problem: if we want to teach a given set of concepts, which curriculum minimises the overall teaching size?

In our approach, given a set of concepts, a curriculum is a set of disjoint sequences covering all the concepts. Our view of curricula is more general than just simple sequences. Our choice is motivated by the interposition phenomenon seen in the previous section, since some concepts may increase the teaching size of some other concepts coming after those previous ones. If some branches are disconnected, a curriculum should not specify which branch comes first, since they are considered independent lessons. We will show below how the algorithms that find optimal curricula manage this flexibility.

For instance, Fig. 4 shows how a set of concepts \(\{x, y, r, s, t, w, z\}\) is partitioned into three branches: \(\{x \!\rightarrow \! y \!\rightarrow \! r \!\rightarrow \! s,\, t \!\rightarrow \! w,\, z\}\), where \(x \!\rightarrow \! y\) means that y must come after x in the curriculum. For each branch, there is no background knowledge or library at the beginning. The library grows as the teacher-learner protocol progresses in each branch.

Fig. 4
figure 4

Curriculum \(\{x \!\rightarrow \! y \!\rightarrow \! r \!\rightarrow \! s, t \!\rightarrow \! w, z\}\) for a set of concepts \(\{x, y, r, s, t, w, z\}\)

Let \(Q=\{c_i\}_{i=1}^n\) be a set of labelled concepts, a curriculum \(\pi = \{ \sigma _1, \cdots ,\sigma _m \}\) is a full partition of Q, where each of the m subsets \(\sigma _j \subset Q\) has a total order, being a sequence.

Definition 6

Let Q be a set of concepts. Let \(\pi = \{ \sigma _1, \sigma _2, \cdots , \sigma _m \}\) a curriculum in Q. We define the teaching size of each sequence \(\sigma = \{ c_1, c_2,..., c_k \}\) as \({TS_{\ell }} (\sigma ) = {TS_{\ell }} (c_1) + \sum _{j=2}^{k} {TS_{\ell }} (c_j \vert c_1, \ldots , c_{j-1})\). The overall teaching size of \(\pi \) is just \({TS_{\ell }} (\pi ) = \sum _{i=1}^{m} {TS_{\ell }} (\sigma _i)\).

We denote \(\overline{Q}\) as the set of all the possible curricula with Q. The order in which the subsets are chosen does not matter, but the order each subset is threaded does. For example, the curriculum \(\pi = \{x \!\rightarrow \! y \!\rightarrow \! r \!\rightarrow \! s, t \!\rightarrow \! w, z\}\) has many paths, such as xyrstwz or zxyrstw. But note that \(\pi \) is different from \(\pi '=\{y \!\rightarrow \! x \!\rightarrow \! r \!\rightarrow \! s, w \!\rightarrow \! t, z \}\).

The number of possible curricula given a number of concepts grows fast and this will motivate the heuristic we will introduce later on. In particular, the number of distinct curricula is given by the following calculation:

Proposition 1

For any Q with n concepts, the number of different curricula is

$$\begin{aligned} |\overline{Q} |= n! \cdot \Biggl ( \sum _{k=0}^{n-1} \left( {\begin{array}{c}n-1\\ k\end{array}}\right) \cdot {{1}\over {(k+1)!}} \Biggr ) \end{aligned}$$
(1)

Proof

For any set \(Q=\{c_1, \dots , c_n\}\) of n concepts, there are n! permutations of n labelled elements. For each permutation, there are \(n-1\) possibilities of starting a branch. Consequently, we can choose k positions out of \(n-1\). This implies that there will be \(k+1\) subsets which can change its order, i.e., \((k+1)!\) different permutations of the subsets express the same case.

Therefore, there are \(n! \cdot \left( {\begin{array}{c}n-1\\ k\end{array}}\right) \cdot {{1}\over {(k+1)!}}\) cases. Since \(k \in \{0, 1, \ldots , n-1\}\), Eq. (1) gives the total number of distinct curricula.

\(\square \)

Once that we know how many different curricula there are, we can try to identify which ones have lowest overall teaching size, denoted by \(TS^*_Q\). A curriculum \(\pi \) is hence minimal if \(TS(\pi ) = TS^*_Q \le TS(\pi ')\), \(\forall \pi ' \in \overline{Q}\).

Regarding Example 1, there are thirteen distinct curricula in \(\overline{Q}\) according to Proposition 1. Using Table 1, we can build Table 2, showing the overall teaching size for each curriculum in \(\overline{Q}\).

Table 2 Curricula for Example 1
Fig. 5
figure 5

Comparison of some relevant curricula of Example 1 (Table 1)

Concerning all the teaching sizes taken as a whole and the descriptions measured in bits, Fig. 5 compares some relevant curricula from Table 2.

In terms of overall teaching size, there is a tie between \(\pi _1\) and \(\pi _7\), both being minimal (Fig. 5). However, we observe that there are big differences when considering overall teaching size for other curricula. For example, if we choose option \(\pi _4\) instead of \(\pi _1\) (or \(\pi _7\)), we get \(\approx 63.3\%\) extra cost, even though there are just three concepts.

Further examples with more concepts and slightly different situations can be found in Section B.

From these examples, we ask ourselves whether it is necessary to calculate all the teaching sizes, as we have already done in Table 1 (or Tables 9 and 11 of Section B), to find minimal curricula. The answer is negative, as we will see, but how can such a reduction of calculations be achieved? Moreover, is there a less costly procedure that could provide a close-to-optimal solution? We will deal with these issues in the following section.

5 Sufficient conditions to infer conditional teaching size

We first ask the question of whether it is possible to approximate the conditional teaching size of a concept c if the learner is given a new primitive for the library. Namely, given that \(\Phi (w_c \vert B)=p_c \in [c]_L\), could it be possible to approximate \(TS(c \vert \langle B, p\rangle )\), where p is a new primitive? It is also very important not only to identify a good approximation, but not to overestimate the real conditional teaching size. We will employ this conservative property (in Sect. 6) to define an algorithm that outputs minimal curricula.

Firstly, we give a sufficient condition, valid for any kind of language, either universal or not, that provides an underestimation of conditional teaching size. Secondly, we identify when such sufficient condition applies for our particular drawing domain.

The following corollary provides that sufficient condition. It is proved in Section D as a consequence of Lemma 4.

Corollary 2

Let \(w_c\), \(w'_c \subset X\) such that \(p_c={\Phi _{\ell }} (w_c \vert B)\) and \(p'_c={\Phi _{\ell }} (w'_c \vert B')\), where \(B'\) is a library that extends, with a new primitive, the library B. Let us suppose that the following conditions are met:

  1. I

    \({\textstyle \exists k_1, k_2 \in \mathbb {N}}\) such that \(\delta (w_c)=\ell (p_c)+k_1\) and \(\delta (w'_c)=\ell (p'_c)+k_2\)

  2. II

    \(\delta (w_c)-\delta (w'_c) \le \ell (p_c) - \ell (p'_c)\)

  3. III

    \(\ell (p'_c) \le \ell (p_c)\)

Then,

$$\begin{aligned} \frac{\ell (p'_c)}{\ell (p_c)} \cdot \delta (w_c) \le \delta (w'_c) \end{aligned}$$
(2)

It is important to note that Inequality (2) does not overstimate \(\delta (w'_c)\); this fact will be key in Sect. 6 to define an algorithm that ouputs optimal curricula.

With respect to our drawing domain, Condition I (Corollary 2) is always true, since the language L has equivalent instructions for every command in \(\Sigma \). Also, the concepts of the drawing domain meet Condition II (Corollary 2), as a result of Theorem 3 (Section C), when \(w_c\) has no negative examples. Otherwise, let us suppose that \(w_c=\{ e^+, (e_i^-)\}\), such that \(\exists p_i\) in \(L_B\) with \(p_i {\vDash } e_i^+\) and \(p_i \notin [c]_L\). In general, we could not assure that adding a new primitive to \(B'\) would make \(\Phi (w'_c \vert B') \prec p_i\), \(\forall i\). If so, it would be unnecessary to include negative examples in \(w'_c\), and Condition II would not be met. However, if concepts can be taught initially in language L (\(|B |=0\)) without negative examples, then we can always estimate successively the teaching size of such concepts. In other words, we will consider \(\frac{\ell (p'_c)}{\ell (p_c)} \cdot TS(c) \le TS(c \vert B')\) (instead of \(TS(c \vert B)\) on the left side of the inequality).

Finally, we assumed in Condition III (Corollary 2) that the length in bits of a library call, \(\varvec{@}\textsf{i}\), is the same as the new call \(\varvec{@}\mathsf{i'}\). But there exist cases such as \(\Phi (\{\mathsf{ENESEE^+}\} \vert \langle \textsf{RD}\varvec{)}\rangle )=\textsf{R}\varvec{)}\textsf{U}\varvec{@} \in [c]_L\) and \(\Phi (\{\mathsf{ENESEE^+}\} \vert \langle \textsf{RD}\varvec{)}, \textsf{RURD}\varvec{)}\rangle )=\textsf{R}\varvec{)}\textsf{U}\varvec{@}0 \in [c]_L\). In such cases, \(\frac{\ell (p^*_{c \vert B'})}{\ell (p^*_{c \vert B})}>1\) and Inequality (2) is not ensured, where \(p^*_{c \vert B}\) is the first expression in language L enhanced with library B, using order \(\prec \), that identifies concept cFootnote 2 (we denote \(\ell ({p^*_{c \vert B}}) = \ell (p^*_c)\) when \( |B |=0\)).

However, in those situations (\(\ell (p^*_{c \vert B'}) > \ell (p^*_{c \vert B})\)), we consider \(\delta (w_c) = \delta (w'_c)\), since \(\delta (w'_c)\) cannot reduce \(\delta (w_c)\).

These considerations lead us to the definition of a valid family of heuristics that always output optimal curricula without calculating all the teaching sizes.

6 Heuristic search for optimal curricula

In our case, the search space is given by all the curricula, \(\overline{Q}\), and we need to find at least one \(\pi \) such that \(TS(\pi ) = TS^*_Q\). Each internal node in the search graph is a partial curriculum (not covering all concepts), while each leaf (node with no children at the bottom level) is a full curriculum belonging to \(\overline{Q}\), and an edge means adding a new concept to a branch of the curriculum (remember a curriculum is actually a tree). For instance, for two concepts a and b, the root \(n_1\) would be the empty curriculum. The children at level 2 would be \(n_{1,1}=\{a\}\) and \(n_{1,2}=\{b\}\). Finally, at level 3, the children of \(n_{1,1}\) would be \(n_{1,1,1}=\{a \!\rightarrow \! b\}\), \(n_{1,1,2}=\{a,b\}\) and the children of \(n_{1,2}\) would be \(n_{1,2,1}=\{b \!\rightarrow \! a\}\), \(n_{1,2,2}=\{b,a\}\), which means that we have two nodes that would be equal, and we see that the search space is a directed acyclic graph.

Let us start with a simple graph traversal algorithm and then evolve it into more sophisticated procedures. The standard \(A^*\) search is a baseline graph traversal algorithm Russell and Norvig (2020). \(A^*\) is based on a node evaluation function

$$\begin{aligned} f(n)=g(n)+h(n), \end{aligned}$$
(3)

where g(n) is the overall cost of getting to node n from the beginning and h(n) is the heuristic function, i.e., the estimated cost of getting to a target node (leaf node) from node n. \(A^*\) search guarantees an optimal solution when the heuristic function meets the following conditions:

  1. 1.

    \(h(n) \ge 0\), for every node n.

  2. 2.

    \(h(n)=0\), if n is a target node.

  3. 3.

    h(n) is admissible, i.e., h(n) never overestimates.

A popular variant of \(A^*\) is Weighted \(A_{\alpha }^*\) (\(\textrm{WA}^*\)), another graph traversal algorithm that guides the search with \(h'(n)=\alpha \cdot h(n)\), where \(\alpha \) is a parameter. Remember that if h(n) is admissible then \(\textrm{WA}^*\) guarantees the optimal solution when \(0 < \alpha \le 1\). Sometimes, it is useful to employ \(\textrm{WA}^*\), because it often gives a close-to-optimal solution with less node expansions (Hansen & Zhou, 2007).

Let us look for admissible heuristics not only because they guarantee success for algorithm \(A^*\), but also because they help to assess other close-to-optimal solutions (Hansen & Zhou, 2007).

Definition 7

(Node cost) Let Q be a set of concepts and n be a node of the search space \(\overline{Q}\). We define the teaching size cost of node n, which we denote by g(n), as the overall teaching size of the curriculum, either partial or complete, that represents node n.

Note that any path of the search graph uniquely generates a library B as background knowledge. For instance, the path \((a \!\rightarrow \! b \!\rightarrow \! c)\) generates library \(B=\langle p^*_a, p^*_{b \vert \langle p_a \rangle } \rangle \) when reaching concept c, while the path \((a \!\rightarrow \! b, c)\) generates no prior knowledge, i.e., \(|B |= 0\), when getting to concept c.

We now proceed with the theoretical results that will allow us to define an admissible heuristic. But, firstly, we need to define the cost of crossing an edge of the search graph.

Definition 8

(Estimated edge cost) Let Q be a set of concepts and n be a node of the search space \(\overline{Q}\); the last edge of node n is \(\textsf{e}\). Let concepts a and b be the vertices of edge \(\textsf{e}\), either \(\textsf{e}=[a \!\rightarrow \! b]\) or \(\textsf{e} = [a, b]\). Let B be the library employed to reach concept b through the edge \(\textsf{e}\), where \(0 \le |B |\). We define the estimated cost of crossing \(\textsf{e}\) as

$$\begin{aligned} h_B(\textsf{e}) = \frac{\ell ({p^*_{b \vert B}})}{\ell (p^*_b)} \cdot TS(b) \end{aligned}$$

If \(|B |=0\) then \(\ell ({p^*_{b \vert B}})=\ell (p^*_b)\) and \(h_B(\textsf{e})=TS(b)\).

We note that it would also be possible to define the estimated edge cost using the new primitive employed between one node and its child. That is to say, if B is the library employed to reach a, and \(B'\) is the library used to get to b, then we could define

$$\begin{aligned} h_{B' \setminus B}(\textsf{e}) = \left\{ \begin{array}{ll} \frac{\ell ({p^*_{b \vert B'}})}{\ell (p^*_{b \vert B})} \cdot TS(b), &{}\quad \text {if } \ell (p^*_{b \vert B'}) \le \ell ({p^*_{b \vert B}})\\ TS(b), &{} \quad \text {otherwise} \end{array} \right. \end{aligned}$$

Note that \(h_B(\textsf{e}) \le h_{B' {\setminus } B}(\textsf{e}), \forall \textsf{e}\). Consequently, \(h_B(\textsf{e})\) is less dominant than \(h_{B' \setminus B}(\textsf{e})\). Accordingly, if the former reduces the computational cost to identify optimal curricula, the latter will perform even better.

We now extend Definition 8 to a path of the search graph.

Definition 9

Let n, m be nodes of the search space \(\overline{Q}\), where m is a child of n that might be several levels after n. Let \(\{\textsf{e}_i\}\) be a path from node n to node m; as a special notation case we use \(\{\textsf{e}_i\}=\{\emptyset \}\) when there is no path between n and m throughout the search space. Let \(B_i\), with \(0 \le |B_i |, \forall i\), be the library employed to cross the edge \(\textsf{e}_i\). We define the estimated cost of getting from node n to node m as

$$\begin{aligned} H_n(m) = \left\{ \begin{array}{ll} \infty , &{} \quad \text { if } \{\textsf{e}_i\} = \{\emptyset \}\\ \sum _i h_{B_i}(\textsf{e}_i), &{} \quad \text { otherwise } \end{array} \right. \end{aligned}$$

We now employ Definition 9 to define the estimated cost of a node.

Definition 10

(Estimated cost of node n) Let n be a node of \(\overline{Q}\), we define the estimated cost of node n as

$$\begin{aligned} \mathcal {H}(n) = min \{ H_n(m): m~ { \textit{is a leaf node of the search graph}}\} \end{aligned}$$

The leaf nodes considered in Definition 10 are any leaf node of the search graph, not only the ones that are descendants of a given node n. The following Examples 2 and 3 illustrate how to calculate estimated costs in particular situations.

Example 2

Let a and b be the concepts of Example 1; where \(p^*_a=\textsf{RURD}\varvec{)}\) and \(p^*_b=\textsf{RRURD}\varvec{)}\). We consider the node \(\{a\}\) in the search space whose leaf nodes are \(\{a, b\}\), \(\{b \!\rightarrow \! a\}\) and \(\{a \!\rightarrow \! b\}\). We want to calculate \(\mathcal {H}(\{a\})\); since we are already in node \(\{a\}\) the calculations are

$$\begin{aligned}{} & {} H_{\{a\}}(\{a, b\}) = TS(b) = 24,\\{} & {} H_{\{a\}}(\{b \!\rightarrow \! a\}) = \infty \text { and}\\{} & {} H_{\{a\}}(\{a \!\rightarrow \! b\}) = h_{\langle p_a \rangle }([a \!\rightarrow \! b]) = \frac{\ell (p^*_{b \vert \langle p^*_a \rangle })}{\ell (p^*_b)} \cdot TS(b) = \frac{\ell (\textsf{R}\varvec{@)}}{\ell (\mathsf {RRURD\varvec{)}})} \cdot 24 = 8. \end{aligned}$$

Therefore, \(\mathcal {H}(\{a\})= min \{ 24, \infty , 8\} = 8\).

Example 3

Let a, b and c be the concepts of Example 1, where \(p^*_c= \textsf{R}\varvec{)}\textsf{URD}\varvec{)}\) . The leaf nodes with bounded estimation cost from node \(\{a\}\) are \(\{a, b, c\}\), \(\{a \!\rightarrow \! b, c\}\), \(\{a \!\rightarrow \! c, b\}\), \(\{a \!\rightarrow \! b \!\rightarrow \! c\}\) and \(\{a \!\rightarrow \! c \!\rightarrow \! b\}\). For instance, the estimated cost of node \(\{ a\!\rightarrow \! b, c\}\) from node \(\{a\}\) is

$$\begin{aligned}{} & {} H_{\{a\}}(\{a \!\rightarrow \! b, c\}) = h_{\langle p^*_a \rangle }([a \!\rightarrow \! b]) + h_{\emptyset }([b,c]) = \frac{\ell (p^*_{b \vert \langle p^*_a\rangle })}{\ell (p^*_b)} \cdot TS(b) + TS(c) = 8 + 24 = 32.\\ \end{aligned}$$

Similarly, \(H_{\{a\}}(\{a \!\rightarrow \! c, b\})= h_{\langle p^*_a \rangle }([a \!\rightarrow \! c]) + h_{\emptyset }([c,b])= \frac{\ell (p^*_{c \vert \langle p^*_a\rangle })}{\ell (p^*_c)} \cdot TS(c)+TS(b) = 1 \cdot 24 + 24 = 48 \),

$$\begin{aligned} H_{\{a\}}(\{a \!\rightarrow \! b \!\rightarrow \! c\})= & {} h_{\langle p^*_a \rangle }([a \!\rightarrow \! b]) + h_{\langle p^*_a, p^*_b \rangle }([b \!\rightarrow \! c]) \\= & {} \frac{\ell (p^*_{b\vert \langle p^*_a \rangle })}{\ell (p^*_b)} \cdot TS(b) + \frac{\ell (p^*_{c\vert \langle p^*_a, p^*_b \rangle })}{\ell (p^*_c)} \cdot TS(c) = \frac{\ell (\textsf{R}\varvec{@})}{\ell (\textsf{RRURD}\varvec{)})} \cdot 24 + 1 \cdot 24 = 32 ,\\ H_{\{a\}}(\{a \!\rightarrow \! c \!\rightarrow \! b\})= & {} h_{\langle p^*_a \rangle }([a \!\rightarrow \! c]) + h_{\langle p^*_a, p^*_c \rangle }([c \!\rightarrow \! b]) \\= & {} \frac{\ell (p^*_{c\vert \langle p^*_a \rangle })}{\ell (p^*_c)} \cdot TS(c) + \frac{\ell (p^*_{b\vert \langle p^*_a, p^*_c \rangle })}{\ell (p^*_b)} \cdot TS(b) \\= & {} 1 \cdot 24 + \frac{\ell (\textsf{R}\varvec{@}0)}{\ell (\textsf{R}\varvec{)}\textsf{URD}\varvec{)})} \cdot 24 = 33.\dot{3}\text { and } H_{\{a\}}(\{a, b, c\}) = 69. \end{aligned}$$

Therefore, \(\mathcal {H}(\{a\})=min\{\infty , 69, 48, 32, 33.\dot{3}\}=32\).

The heuristic of Definition 10 never overstimates the teaching size of a node if conditions expressed in Theorem 3 apply. As a result, \(A^*\) will output minimal curricula if we define the heuristic \(h(n)=\mathcal {H}(n)\). Let us show some experiments.

7 Empirical results

We experimented with three sets of concepts: \(Q=\{a,b,c\}\) (Example 1 in Sect. 3), \(Q'=\{a',b',c\}\) and \(Q''=\{a,b,c,d\}\) (Examples 4 and 5, respectively, in Section B). We studied all the possible different curricula for each set of concepts. Thus, we had to calculate all the teaching sizes of Tables 1, 9 and 11. Calculations took a long time even that we utilised a HPCx clusterFootnote 3. It was a server shared with other users, but there were reserved three cores for every teaching size calculation. Namely, it took approximately nine days to compute \(TS(c \vert a)\) (Table 1) and when the library had more primitives then the calculations increased explosively. For instance, it took more than two months to calculate \(TS(c \vert a, b, d)\) (Table 11).

As we mentioned before, the admissible heuristic \(h(n)=\mathcal {H}(n)\) is valid for every set of concepts Q of our drawing domain, when each concept \(c \in Q\) can be taught in L (\(|B |=0\)) without negative examples. We implemented the \(A^*\) search using such heuristic and applied it to the examples.

In Example 1 the algorithm finds a minimal curriculum \(\pi _7\) in 4 steps, through 8 effective calculations of teaching size against the overall 15 edges. Regarding Example 4, \(A^*\) finds the minimal curriculum in 4 steps, effectively calculating 7 teaching sizes against 15 overall.

Considering Example 5 (Section B), there are two curricula that maximise the teaching size with, approximately, 66% more effort than the optimal. The \(A^*\) search showed \(\pi ^*=\{d \!\rightarrow \! a \!\rightarrow \! b \!\rightarrow \! c\}\) as minimal curriculum in 6 steps, through 13 teaching size calculations against 64 (20% teaching size calculations). Experiments show that there is only one optimal curriculum (\(TS(\pi ^*)=63)\), and it coincides with the one that identifies the \(A^*\) search.

Tables 3, 4 and 5 summarise the experimental results obtained for Examples 1, 4 and 5, respectively, using the following distinct algorithms:

  • Dijkstra’s algorithm (D) Dijkstra (1959). It always identifies an optimal curriculum, since teaching sizes are positive Barbehenn (1998).

  • Dijkstra’s modified algorithm (\(D'\)): it stops when it goes through all the concepts. It does not necessarily get an optimal curriculum, but the computational cost is lower than Dijkstra’s algorithm in terms of teaching size calculations.

  • Greedy algorithm (G): from any given state, it always chooses the bifurcation that involves the least teaching size. In general, it does not identify an optimal curriculum.

  • \(A^*\) search algorithm Hart et al. (1968). The heuristic given in Definition 10, \(\mathcal {H}\), guarantees an optimal curriculum.

  • \(A^*_{\alpha }\) search, i.e., \(\textrm{WA}^*\) algorithm with parameter \(\alpha \): since the heuristic \(\mathcal {H}(n)\) is admissible, it will identify optimal solutions when \(0 \le \alpha \le 1\)Footnote 4.

Table 3 Curricula obtained for Example 1 through different algorithms and whether they are optimal (in boldface when it has to be optimal), the % of TS operations that need to be calculated (the lower the better) and the % of extra cost when it is not optimal
Table 4 Curricula obtained for Example 4 through different algorithms and whether they are optimal (in boldface when it has to be optimal), and the % of TS operations that need to be calculated (the lower the better)
Table 5 Curricula obtained for Example 5 through different algorithms and whether they are optimal (in boldface when it has to be optimal), the % of TS operations that need to be calculated (the lower the better) and the % of extra cost when it is not optimal

As we can see in Tables 3, 4 and 5, there are big differences in terms of computational costs when using procedures ensuring optimal teaching curricula as an output. For instance, in Example 1, the computational cost of D performs an extra \(40\%\) with regard to \(A^*_{\alpha =1}\) (\(33.3\%\) when \(\alpha =0.8\)); in Example 4 the comparative between D and \(A^*_{\alpha =1}\) shows a \(46.7\%\) extra cost for D (\(40\%\) when \(\alpha =0.8\)). Furthermore, when we increase the number of concepts involved, as in Example 5, \(A^*_{\alpha =1}\) makes \(61\%\) less computing than D. It is reasonable to think that the tendency might double the computational reduction of \(A^*\) with respect to \(A^*\) when we add a new concept.

It is true that the greedy algorithm G identifies an optimal curriculum in all the Examples (1, 4 and 5), with even less computational effort. However, such procedure is risky, since an optimality is not guaranteed then and the difference in TS-calculations % is \(13.3\%\) in Example 1, \(6.6\%\) in Example 4 and \(7.8\%\) in Example 5, which is not so high with respect to the option ensuring optimal teaching curricula \(A^*_{\alpha =1}\).

8 Discussion

In this paper, we provided sufficient conditions to define a procedure that effectively identifies optimal curricula. This is used in an \(A^*\) search enhanced by the heuristic given in Definition 10, for a given set of concepts.

Inequality (2) can be applied to similar machine-teaching settings using different languages L, either universal or not. It is sufficient to consider a similar machine-teaching scenario and a set of concepts that meet the conditions.

Thus, we may introduce new instructions such as ‘\(\varvec{\vert }\)’ (logical OR operator), and even discard instructions like ‘\(\varvec{)}\)’. In particular, the previous results, dedicated to obtaining a heuristic, can be applied to any language \(L'\) that meets these statements:

  1. (i)

    \(L'\) should contain equivalent instructions for every command of \(\Sigma \).

  2. (ii)

    The size in bits of the equivalent instructions in \(L'\) should be less or equal than their counterpart commands in \(\Sigma \).

  3. (iii)

    There is an instruction, similar to \(\varvec{@}\), that is able to retrieve primitives in language \(L'\) and is the last instruction considering the lexicographical order.

We have shown that it is possible to apply a heuristic search to those MT scenarios that satisfy the above conditions (i), (ii) and (iii), having the following properties:

  1. 1.

    For all the cases studies shown, it provides a reduction of the computational effort involved.

  2. 2.

    The identification of optimal curricula is guaranteed.

Our approach is defined for compositional languages of discrete character and, correspondingly, a discrete optimisation process over a discrete space. As a result, it is not straightforward to adapt the setting to continuous representations, including neural networks. However, the general framework may adapt to a language L that might be based on continuous principles: instructions in L might be given different weights, so that the total order \(\prec \) might be defined through a distribution of probabilities. However, our MT framework is more indicated in situations where we want to understand –and explain– how each concept builds on previous concepts compositionally. For instance, given an observation, what is the best explanation for that information? In fact, programs in the language L of our particular 2D drawing domain (Section A) can be described in terms of automata, which have also been utilised recently in AI to spot patterns describing data sets in 2D too (Das et al., 2023). We can see the discrete character of our approach and the language we have used as a limitation, but it can also be seen as an opportunity for other researchers to investigate the performance of our heuristic search in other MT settings (even continuous) and domains.

Overall, our starting point was the realisation that curriculum learning for a given set of compositional concepts is underdeveloped mainly because of computational costs. Thanks to the new heuristics introduced in this paper, we are now able to effectively implement CL in MT, not only at the level of improving a sequence of training examples for a given task, but also considering the combinatorial explosion of interlocking concepts.