Machine Learning

, Volume 90, Issue 1, pp 91–126

Learning figures with the Hausdorff metric by fractals—towards computable binary classification

  • Mahito Sugiyama
  • Eiju Hirowatari
  • Hideki Tsuiki
  • Akihiro Yamamoto

DOI: 10.1007/s10994-012-5301-z

Cite this article as:
Sugiyama, M., Hirowatari, E., Tsuiki, H. et al. Mach Learn (2013) 90: 91. doi:10.1007/s10994-012-5301-z


We present learning of figures, nonempty compact sets in Euclidean space, based on Gold’s learning model aiming at a computable foundation for binary classification of multivariate data. Encoding real vectors with no numerical error requires infinite sequences, resulting in a gap between each real vector and its discretized representation used for the actual machine learning process. Our motivation is to provide an analysis of machine learning problems that explicitly tackles this aspect which has been glossed over in the literature on binary classification as well as in other machine learning tasks such as regression and clustering. In this paper, we amalgamate two processes: discretization and binary classification. Each learning target, the set of real vectors classified as positive, is treated as a figure. A learning machine receives discretized vectors as input data and outputs a sequence of discrete representations of the target figure in the form of self-similar sets, known as fractals. The generalization error of each output is measured by the Hausdorff metric. Using this learning framework, we reveal a hierarchy of learnable classes under various learning criteria in the track of traditional analysis based on Gold’s learning model, and show a mathematical connection between machine learning and fractal geometry by measuring the complexity of learning using the Hausdorff dimension and the VC dimension. Moreover, we analyze computability aspects of learning of figures using the framework of Type-2 Theory of Effectivity (TTE).


Binary classification Discretization Self-similar set Gold’s learning model Hausdorff metric Type-2 theory of effectivity 

1 Introduction

Discretization is a fundamental process in machine learning from analog data. For example, Fourier analysis is one of the most essential signal processing methods and its discrete version, discrete Fourier analysis, is used for learning or recognition on a computer from continuous signals. However, in the method, only the direction of the time axis is discretized, so each data point is not purely discretized. That is to say, continuous (electrical) waves are essentially treated as finite/infinite sequences of real numbers, hence each value is still continuous (analog). The gap between analog and digital data therefore remains.

This problem appears all over machine learning from observed multivariate data. The reason is that an infinite sequence is needed to encode a real vector exactly without any numerical error, since the cardinality of the set of real numbers, which is the same as that of infinite sequences, is much larger than that of the set of finite sequences. Thus to treat each data point on a computer, it has to be discretized and considered as an approximate value with some numerical error. However, to date, most machine learning algorithms ignore the gap between the original value and its discretized representation. This gap could result in some unexpected numerical errors.1 Since now machine learning algorithms can be applied to massive datasets, it is urgent to give a theoretical foundation for learning, such as classification, regression, and clustering, from multivariate data, in a fully computational manner to guarantee the soundness of the results of learning.

In the field of computational learning theory, Valiant’s learning model (also called PAC, Probably Approximately Correct, learning model), proposed by Valiant (1984), is used for theoretical analysis of machine learning algorithms. In this model, we can analyze the robustness of a learning algorithm in the face of noise or inaccurate data and the complexity of learning with respect to the rate of convergence or the size of the input using the concept of probability. Blumer et al. (1989) and Ehrenfeucht et al. (1989) provided the crucial conditions for learnability, that is, the lower and upper bounds for the sample size, using the VC (Vapnik-Chervonenkis) dimension (Vapnik and Chervonenkis 1971). These results can be applied to various concept representations that handle real-valued inputs and use real-valued parameters, for example, to analyze learning of neural networks (Baum and Haussler 1989). However, this learning model is not in line with discrete and computational analysis of machine learning. We cannot know which class of continuous objects is exactly learnable and what kind of data are needed to learn from a finite expression of discretized multivariate data. Although PAC learning from axis-parallel rectangles has already been investigated (Blumer et al. 1989; Kearns and Vazirani 1994; Long and Tan 1998), which can be viewed as a variant of learning from multivariate data with numerical error, it is not applicable in the study. Our goal is to investigate computational learning, focusing on a common ground between “learning” and “computation” of real numbers based on the behavior of Turing machines, without any reference to probability distributions. For the purpose of the investigation, we need to distinguish abstract mathematical objects such as real numbers and their concrete representations, or codes, on a computer.

Instead, in this paper we use Gold’s learning model (also called identification in the limit), which is originally designed for learning of recursive functions (Gold 1965) and languages (Gold 1967). In the model, a learning machine is assumed to be a procedure, i.e., a Turing machine (Turing 1937) which never halts, that receives training data from time to time, and outputs representations (hypotheses) of the target from time to time. All data are usually assumed to be given at some point in the future. Starting from this learning model, learnability of classes of discrete objects, such as languages and recursive functions, has been analyzed in detail under various learning criteria (Jain et al. 1999). However, analysis of learning for continuous objects, such as classification, regression, and clustering for multivariate data, with Gold’s model is still under development, despite such settings being typical in modern machine learning. To the best of our knowledge, the only line of studies devoted to learning of real-valued functions was by Hirowatari and Arikawa (1997, 2001) Apsītis et al. (1999), Hirowatari et al. (2003, 2005, 2006), where they addressed the analysis of learnable classes of real-valued functions using computable representations of real numbers.2 We therefore need a new theoretical and computational framework for modern machine learning based on Gold’s learning model with discretization of numerical data.

In this paper we consider the problem of binary classification for multivariate data, which is one of the most fundamental problems in machine learning and pattern recognition. In this task, a training dataset consists of a set of pairs {(x1,y1),(x2,y2),…,(xn,yn)}, where xi∈ℝd is a feature vector, yi∈{0,1} is a label, and the d-dimensional Euclidean space ℝd is a feature space. The goal is to learn a classifier from the given training dataset, that is, to find a mapping h:ℝd→{0,1} such that, for all x∈ℝd, h(x) is expected to be the same as the true label of x. In other words, such a classifier h is the characteristic function of a subset L={x∈ℝdh(x)=1} of ℝd, which has to be similar to the true set K={x∈ℝd∣the true label of x is 1} as far as possible. Throughout the paper, we assume that each feature is normalized by some data preprocessing such as min-max normalization for simplicity, that is, the feature space is the unit interval (cube) \(\mathcal {I}^{d} = [0, 1] \times \dots\times[0, 1]\) in the d-dimensional Euclidean space ℝd. In many realistic scenarios, each target K is a closed and bounded subset of \(\mathcal {I}^{d}\), i.e., a nonempty compact subset of \(\mathcal {I}^{d}\), called a figure. Thus here we address the problem of binary classification by treating it as “learning of figures”.

In this machine learning process, we implicitly treat any feature vector through its representation, or code on a computer, that is, each feature vector \(x \in \mathcal {I}^{d}\) is represented by a sequence p over some alphabet Σ using an encoding scheme ρ. Here such a surjective mapping ρ is called a representation and should map the set of “infinite” sequences Σω to \(\mathcal {I}^{d}\) since there is no one-to-one correspondence between finite sequences and real numbers (or real vectors). In this paper, we use the binary representationρ:Σω→[0,1] with Σ={0,1}, which is defined by ρ(p):=∑pi⋅2−(i+1) for an infinite sequence p=p0p1p2…. For example, ρ(0100…)=0.25, ρ(1000…)=0.5, and ρ(0111…)=0.5. However, we cannot treat infinite sequences on a computer in finite time and, instead, we have to use discretized values, i.e., truncated finite sequences in any actual machine learning process. Thus in learning of a classifier h for the target figure K, we cannot use an exact data point xK but have to use a discretized finite sequence wΣ which tells us that x takes one of the values in the set {ρ(p)∣wp} (wp means that w is a prefix of p). For instance, if w=01, then x should be in the interval [0.25,0.5]. For a finite sequence wΣ, we define ρ(w):={ρ(p)∣wp with pΣω} using the same symbol ρ. From a geometric point of view, ρ(w) means a hyper-rectangle whose sides are parallel to the axes in the space \(\mathcal {I}^{d}\). For example, for the binary representation ρ, we have ρ(0)=[0,0.5], ρ(1)=[0.5,1], ρ(01)=[0.25,0.5], and so on. Therefore in the actual learning process, while a target set K and each point xK exist mathematically, a learning machine can only treat finite sequences as training data.

Here the problem of binary classification is stated in a computational manner as follows: Given a training dataset {(w1,y1),(w2,y2),…,(wn,yn)} (wiΣ for each i∈{1,2,…,n}), where yi=1 if \(\rho(w_{i}) \cap K \not= \emptyset\) for a target figure \(K \subseteq \mathcal {I}^{d}\) and yi=0 otherwise, learn a classifier h:Σ→{0,1} for which h(w) should be the same as the true label of w for all wΣ. Each training datum (wi,yi) is called a positive example if yi=1 and a negative example if yi=0.

Assume that a figure K is represented by a set P of infinite sequences, i.e., {ρ(p)∣pP}=K, using the binary representation ρ. Then learning the figure is different from learning the well-known prefix closed set Pref(P), defined as Pref(P):={wΣwp for some pP}, since generally \(\mathrm {Pref}(P) \not= \{w \in\varSigma^{*} \mid\rho(w) \cap K \not= \emptyset\}\) holds. For example, if P={pΣω1p}, the corresponding figure K is the interval [0.5,1]. Then, the infinite sequence 0111… is a positive example since ρ(0111…)=0.5 and \(\rho(\mathtt {0}\mathtt {1}\mathtt {1}\mathtt {1}\dots) \cap K \not= \emptyset\), but it is not contained in Pref(P). This problem is fundamentally due to rational numbers having two representations, for example, both 0111… and 1000… represent 0.5. Solving this mismatch between objects of learning and their representations is one of the challenging problems of learning continuous objects based on their representation in a computational manner.

For finite expression of classifiers, we use self-similar sets known as fractals (Mandelbrot 1982) to exploit their simplicity and the power of expression theoretically provided by the field of fractal geometry. Specifically, we can approximate any figure by some self-similar set arbitrarily closely (derived from the Collage Theorem given by Falconer 2003) and can compute it by a simple recursive algorithm, called an IFS (Iterated Function System) (Barnsley 1993; Falconer 2003). This approach can be viewed as the analog of the discrete Fourier analysis, where FFT (Fast Fourier Transformation) is used as the fundamental recursive algorithm. Moreover, in the process of sampling from analog data in discrete Fourier analysis, scalability is a desirable property. It requires that when the sample resolution increases, the accuracy of the result is monotonically refined. We formalize this property as effective learning of figures, which is inspired by effective computing in the framework of Type-2 Theory of Effectivity (TTE) studied in computable analysis (Schröder 2002b; Weihrauch 2000). This model guarantees that as a computer reads more and more precise information of the input, it produces more and more accurate approximations of the result. Here we adapt this model from computation to learning, where if a learner (learning machine) receives more and more accurate training data, it learns better and better classifiers (self-similar sets) approximating the target figure.

To summarize, our framework of learning figures (shown in Fig. 1) is as follows: Positive examples are axis-parallel rectangles intersecting the target figure, and negative examples are those disjoint with the target. A learner reads a presentation (infinite sequence of examples), and generates hypotheses. Hypotheses are finite sequences (codes) that are discrete expressions of self-similar sets. To evaluate “goodness” of each classifier, we use the concept of generalization error and measure the error by the Hausdorff metric since it induces the standard topology on the set of figures (Beer 1993).
Fig. 1

Our framework of learning figures

The main contributions of this paper are as follows:
  1. 1.

    We formalize the learning of figures using self-similar sets based on Gold’s learning model towards realizing fully computable binary classification (Sect. 3). We construct a representational system for learning using self-similar sets based on the binary representation of real numbers, and show desirable properties of it (Lemmas 3.2, 3.3, and 3.4).

  2. 2.

    We construct a learnability hierarchy under various learning criteria, summarized in Fig. 3 (Sect. 4 and 5). We consider five criteria for learning: explanatory learning (Sect. 4.1), consistent learning (Sect. 4.2), reliable and refutable learning (Sect. 4.3), and effective learning (Sect. 5).

  3. 3.

    We show a mathematical connection between learning and fractal geometry by measuring the complexity of learning using the Hausdorff dimension and the VC dimension (Sect. 6). Specifically, we give a lower bound on the number of positive examples using the dimensions.

  4. 4.

    We also show a connection between computability of figures studied in computable analysis and learnability of figures discussed in this paper using TTE (Sect. 7). Learning can be viewed as computable realization of the identity from the set of figures to the same set equipped with a finer topology.


The rest of the paper is organized as follows: We review related work in comparison to the present work in Sect. 2. We formalize computable binary classification as learning of figures in Sect. 3 and analyze the learnability hierarchy induced by variants of our model in Sects. 4 and 5. The mathematical connection between fractal geometry and Gold’s model with the Hausdorff and the VC dimensions is presented in Sect. 6 and between computability and learnability of figures in Sect. 7. Section 8 gives the conclusion.

A preliminary version of this paper was presented at the 21st International Conference on Algorithmic Learning Theory (Sugiyama et al. 2010). In this paper, formalization of learning in Sect. 3 is completely updated for clarity and simplicity, and all theorems and lemmas have formal proofs (they were omitted in the conference paper). Furthermore, discussion about related work in Sect. 2 and TTE analysis in Sect. 7 are new contributions. In addition, several examples and figures are added for readability.

2 Related work

Statistical approaches to machine learning are now achieving great success since they are originally designed for analyzing observed multivariate data and, to date, many statistical methods have been proposed to treat continuous objects such as real-valued functions (Bishop 2007). However, most methods pay no attention to discretization and the finite representation of analog data on a computer. For example, multi-layer perceptrons are used to learn real-valued functions, since they can approximate every continuous function arbitrarily and accurately. However, a perceptron is based on the idea of regulating analog wiring (Rosenblatt 1958), hence such learning is not purely computable, i.e., it ignores the gap between analog raw data and digital discretized data. Furthermore, although several discretization techniques have been proposed by Elomaa and Rousu (2003), Fayyad and Irani (1993), Gama and Pinto (2006), Kontkanen et al. (1997), Li et al. (2003), Lin et al. (2003), Liu et al. (2002), Skubacz and Hollmén (2000), they treat discretization as data preprocessing for improving the accuracy or efficiency of machine learning algorithms. The process of discretization is therefore not considered from a computational point of view, and “computability” of machine learning algorithms is not discussed at sufficient depth.

There are several related articles considering learning under various restrictions in Gold’s model (Goldman et al. 2003), Valiant’s model (Ben-David and Dichterman 1998; Decatur and Gennaro 1995), and other learning context (Khardon and Roth 1999). Moreover, recently learning from partial examples, or examples with missing information, has attracted much attention in Valiant’s learning model (Michael 2010, 2011). In this paper we also consider learning from examples with missing information, which are truncated finite sequences. However, our model is different from the cited work, since the “missing information” in this paper corresponds to measurement error of real-valued data. Our motivation comes from actual measurement/observation of a physical object, where every datum obtained by an experimental instrument must have some numerical error in principle (Baird 1994). For example, if we measure the size of a cell by a microscope equipped with micrometers, we cannot know the true value of the size but an approximate value with numerical error, which depends on the degree of magnification by the micrometers. In this paper we try to treat this process as learning from multivariate data, where an approximate value corresponds to a truncated finite sequence and error becomes small as the length of the sequence increases. The model of computation for real numbers within the framework of TTE, as mentioned in the introduction, fits our motivation, and this approach is unique in computational learning theory.

Self-similar sets can be viewed as a geometric interpretation of languages recognized by ω-automata (Perrin and Pin 2004), first introduced by Büchi (1960), and learning of such languages has been investigated by De La Higuera and Janodet (2001), Jain et al. (2011). Both works focus on learning ω-languages from their prefixes, i.e. texts (positive data), and show several learnable classes. This approach is different from ours since our motivation is to address computability issues in the field of machine learning from numerical data, and hence there is a gap between prefixes of ω-languages and positive data for learning in our setting as mentioned in the introduction. Moreover, we consider learning from both positive and negative data, which is a new approach in the context of learning of infinite words.

Recently, two of the authors, Sugiyama and Yamamoto (2010), have addressed discretization of real vectors in a computational approach and proposed a new similarity measure, called coding divergence. It evaluates the similarity between two sets of real vectors and can be applied to many machine learning tasks such as classification and clustering. However, it does not address the issue of the learnability or complexity of learning of continuous objects.

3 Formalization of learning

To analyze binary classification in a computable approach, we first formalize learning of figures based on Gold’s model. Specifically, we define targets of learning, representations of classifiers produced by a learning machine, and a protocol for learning. In the following, let ℕ be the set of natural numbers including 0, ℚ the set of rational numbers, and ℝ the set of real numbers. The set ℕ+ (resp. ℝ+) is the set of positive natural (resp. real) numbers. The d-fold product of ℝ is denoted by ℝd and the set of nonempty compact subsets of ℝd is denoted by \(\mathcal {K}^{*}\). Notations used in this paper are summarized in Table 1.
Table 1


The set of natural numbers including 0


The set of positive natural numbers, i.e., ℕ+=ℕ∖{0}

The set of rational numbers

The set of real numbers


The set of positive real numbers


The number of dimensions (d∈ℕ+)


d-dimensional Euclidean space

\(\mathcal {K}^{*}\)

The set of figures (nonempty compact subsets of ℝd)

\(\mathcal {I}^{d}\)

The unit interval [0,1]×…×[0,1]

K, L

Figures (nonempty compact sets)


The number of elements in X

\(\mathcal {F}\)

Set of figures


Contraction for real numbers


Finite set of contractions


Contraction for figures




The set of finite sequences whose length are d, i.e., Σd={a1a2adaiΣ}


The set of finite sequences


The set of finite sequences without the empty string λ


The set of infinite sequences


The empty string

u, v, w

Finite sequences


w means a prefix of p (wp is wp and wp)


The set {pΣωwp}


The tupling function, i.e., \(\langle p^{1}, p^{2}, \dots, p^{d}\rangle :=p_{0}^{1}p_{0}^{2}\dots p_{0}^{d} p_{1}^{1}p_{1}^{2}\dots p_{1}^{d} p_{2}^{1}p_{2}^{2}\dots p_{2}^{d}\dots\)


The length of w. If w=〈w1,…,wd〉∈(Σd), |w|=|w1|=…=|wd|

\(\operatorname {diam}(k)\)

The diameter of the set ρ(w) with |w|=k, i.e., \(\operatorname {diam}(k) = \sqrt{d} \cdot2^{-k}\)

p, q

Infinite sequences

V, W

Set of finite or infinite sequences


Binary representation

ξ, ζ

Representation, i.e., a mapping from finite or infinite sequences to some objects


ξ is reducible to ζ


ξ is equivalent to ζ

\(\nu_{\mathbb {Q}^{d}}\)

Representation for rational numbers

\(\nu_{\mathcal {Q}}\)

Representation for finite sets of rational numbers

\(\mathcal {H}\)

The hypothesis space (The set of finite sets of finite sequences)




Classifier of hypothesis H


The mapping from hypotheses to figures




Presentation (informant or text)


The set of finite sequences of positive examples of K, i.e., \(\{w \mid\rho(w) \cap K \not= \emptyset\}\)


The set {w∈Pos(K)∣|w|=k}


The set of finite sequences of negative examples of K, i.e., {wρ(w)∩K=∅}


The Euclidean distance


The Hausdorff distance

The Hausdorff measure


The Hausdorff dimension


The box-counting dimension


The similarity dimension


The VC dimension

Throughout this paper, we use the binary representation\(\rho^{d} : (\varSigma^{d})^{\omega} \to \mathcal {I}^{d}\) as the canonical representation for real numbers. If d=1, this is defined as follows: Σ={0,1} and
$$ \rho^1(p) := \sum_{i = 0}^{\infty} p_i \cdot2^{-(i + 1)} $$
for an infinite sequence p=p0p1p2…. Note that Σd denotes the set {a1a2adaiΣ} and Σ1=Σ. For example, ρ1(0100…)=0.25, ρ1(1000…)=0.5, and so on. Moreover, by using the same symbol ρ, we introduce a representation \(\rho^{1} :\varSigma^{*} \to \mathcal {K}^{*}\) for finite sequences defined as follows: where ↑w={pΣωwp}. For instance, ρ1(01)=[0.25,0.5] and ρ1(10)=[0.5,0.75].
In a d-dimensional space with d>1, we use the d-dimensional binary representation\(\rho^{d} : (\varSigma^{d})^{\omega } \to \mathcal {I}^{d}\) defined in the following manner.
$$ \rho^d \bigl(\bigl\langle p^1, p^2, \dots, p^d\bigr\rangle\bigr) := \bigl( \rho^1 \bigl(p^1 \bigr), \rho^1 \bigl(p^2 \bigr), \dots, \rho^1 \bigl(p^d \bigr) \bigr), $$
where d infinite sequences p1, p2, …, and pd are concatenated using the tupling function 〈⋅〉 such that
$$ \bigl\langle p^1, p^2, \dots, p^d\bigr \rangle :=p_0^1p_0^2\dots p_0^d p_1^1p_1^2 \dots p_1^d p_2^1p_2^2 \dots p_2^d\dots. $$
Similarly, we define a representation \(\rho^{d} : (\varSigma^{d})^{*} \to \mathcal {K}^{*}\) by
$$ \rho^d \bigl(\bigl\langle w^1, w^2, \dots, w^d\bigr\rangle\bigr) := \rho^d \bigl({\uparrow}\bigl \langle w^1, w^2, \dots, w^d\bigr\rangle \bigr), $$
$$ \bigl\langle w^1, w^2, \dots, w^d\bigr \rangle :=w_0^1w_0^2\dots w_0^d w_1^1w_1^2 \dots w_1^d \dots w_{n}^1w_n^2 \dots w_n^d $$
with |w1|=|w2|=…=|wd|=n. Note that, for any w=〈w1,…,wd〉∈(Σd), |w1|=|w2|=…=|wd| always holds, and we denote the length by |w| in this paper. For a set of finite sequences, i.e., a languageL⊂(Σd), we define
$$ \rho^d(L) :=\big \{\rho^d(w) | w \in L\big \}. $$
We omit the superscript d of ρd if it is understood from the context.
A target set of learning is a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) fixed a priori, and one of them is chosen as a target in each learning phase. A learning machine uses self-similar sets, known as fractals and defined by finite sets of contractions. This approach is one of the key ideas in this paper. Here, a contraction is a mapping φ:ℝd→ℝd such that, for all x,yX, d(φ(x),φ(y))≤cd(x,y) for some real number c with 0<c<1. For a finite set of contractions C, a nonempty compact set F satisfying
$$ F = \bigcup_{\varphi \in C}\varphi (F) $$
is determined uniquely (see Falconer 2003 for a formal proof). The set F is called the self-similar set of C. Moreover, if we define a mapping \(\varPhi:\mathcal {K}^{*} \to \mathcal {K}^{*}\) by
$$ \varPhi(K) := \bigcup_{\varphi \in C}\varphi (K) $$
and define
$$ \varPhi^0(K) := K\quad \text{and}\quad \varPhi^{k + 1}(K) := \varPhi\bigl(\varPhi^k(K) \bigr) $$
for each k∈ℕ recursively, then
$$ F = \bigcap_{k = 0}^{\infty}\varPhi^k(K) $$
for every \(K \in \mathcal {K}^{*}\) such that φ(K)⊂K for every φC. This means that we have a level-wise construction algorithm with Φ to obtain the self-similar set F.
A learning machine produces hypotheses, each of which is a finite language and becomes a finite expression of a self-similar set that works as a classifier. Formally, for a finite language H⊂(Σd), we consider H0,H1,H2,… such that Hk is recursively defined as follows:
$$ \begin{cases} H^0 := \left \{\lambda\right \},\\ H^k := \{\langle w^{1}u^{1}, w^{2}u^{2}, \dots, w^{d}u^{d} \rangle | \langle w^{1}, w^{2}, \dots, w^{d}\rangle\in H^{k - 1}\ \text{and}\ \langle u^{1}, u^{2}, \dots, u^{d}\rangle\in H\}. \end{cases} $$
We can easily construct a fixed program P(⋅) which generates H0,H1,H2,… when receiving a hypothesis H. We give the semantics of a hypothesis H by the following equation:
$$ \kappa(H) := \bigcap_{k = 0}^{\infty} \bigcup\rho\bigl(H^k \bigr). $$
Since ⋃ρ(Hk)⊃⋃ρ(Hk+1) holds for all k∈ℕ, κ(H)=limk→∞ρ(Hk). We denote the set of hypotheses {H⊂(Σd)H is finite} by \(\mathcal {H}\) and call it the hypothesis space. We use this hypothesis space throughout the paper. Note that, for a pair of hypotheses H and L, H=L implies κ(H)=κ(L), but the converse may not hold.

Example 3.1

Assume d=2 and let a hypothesis H be the set {〈0,0〉,〈0,1〉,〈1,1〉}={00,01,11}. We have and the figure κ(H) defined in (6) is the Sierpiński triangle (Fig. 2). If we consider the following three mappings:
$$ \varphi _1 \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} + \begin{bmatrix} 0\\ 0 \end{bmatrix} ,\qquad \varphi _2 \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} + \begin{bmatrix} 0\\ 1/2 \end{bmatrix} ,\qquad \varphi _3 \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} + \begin{bmatrix} 1/2\\ 1/2 \end{bmatrix} , $$
the three squares \(\varphi _{1}(\mathcal {I}^{d})\), \(\varphi _{2}(\mathcal {I}^{d})\), and \(\varphi _{3}(\mathcal {I}^{d})\) are exactly the same as ρ(00), ρ(01), and ρ(11), respectively. Thus each sequence in a hypothesis can be viewed as a representation of one of these squares, which are called generators for a self-similar set since if we have the initial set \(\mathcal {I}^{d}\) and generators \(\varphi _{1}(\mathcal {I}^{d})\), \(\varphi _{2}(\mathcal {I}^{d})\), and \(\varphi _{3}(\mathcal {I}^{d})\), we can reproduce the three mappings φ1, φ2, and φ3 and construct the self-similar set from them. Note that there exist infinitely many hypotheses L such that κ(H)=κ(L) and HL. For example, L={〈0,0〉, 〈1,1〉, 〈00,10〉, 〈00,11〉, 〈01,11〉}.
Fig. 2

Generation of the Sierpiński triangle from the hypothesis H={〈0,0〉,〈0,1〉,〈1,1〉} (Example 3.1)

Lemma 3.2

(Soundness of hypotheses)

For every hypothesis\(H \in \mathcal {H}\), the setκ(H) defined by (6) is a self-similar set.


Let H={w1,w2,…,wn}. We can easily check that the set of rectangles ρ(w1),ρ(w2),…,ρ(wn) is a generator defined by the mappings φ1,φ2,…,φn, where each φi maps the unit interval \(\mathcal {I}^{d}\) to the figure ρ(wi). Define Φ and Φk in the same way as (4) and (5). For each k∈ℕ,
$$ \bigcup\rho\bigl(H^k \bigr) = \varPhi^k \bigl( \mathcal {I}^d \bigr) $$
holds. It therefore follows that the set κ(H) is exactly the same as the self-similar set defined by the mappings φ1,φ2,…,φn, that is, κ(H)=⋃φi(κ(H)) holds. □
To evaluate the “goodness” of each hypothesis, we use the concept of generalization error, which is usually used to score the quality of hypotheses in a machine learning context. The generalization error of a hypothesis H for a target figure K, written by \(\mbox {$\mathrm {GE}$}(K, H)\), is defined by the Hausdorff metricdH on the space of figures, i.e.,
$$ \mbox {$\mathrm {GE}$}(K, H) :=d_{\mathrm {H}}\bigl(K, \kappa(H) \bigr) = \inf \big \{\delta| K \subseteq\kappa(H)_{\delta}\ \text{and}\ \kappa(H) \subseteq K_{\delta}\big \}, $$
where Kδ is the δ-neighborhood of K defined by
$$ K_{\delta} :=\big \{x \in \mathbb {R}^d | d_{\mathrm {E}}(x, a) \le\delta\ \text{for some}\ a \in K\big \}. $$
The metric dE is the Euclidean metric such that
$$ d_{\mathrm {E}}(x, a) = \sqrt{\sum_{i = 1}^d \bigl(x^i - a^i \bigr)^2} $$
for x=(x1,…,xd),a=(a1,…,ad)∈ℝd. The Hausdorff metric is one of the standard metrics on the space since the metric space \((\mathcal {K}^{*}, d_{\mathrm {H}})\) is complete (in the sense of topology) and \(\mbox {$\mathrm {GE}$}(K, H) = 0\) if and only if K=κ(H) (Beer 1993; Kechris 1995). The topology on \(\mathcal {K}^{*}\) induced by the Hausdorff metric is called the Vietoris topology. Since the cardinality of the set of hypotheses \(\mathcal {H}\) is smaller than that of the set of figures \(\mathcal {K}^{*}\), we often cannot find the exact hypothesis H for a figure K such that \(\mbox {$\mathrm {GE}$}(K, H) = 0\). However, following the Collage Theorem given by Falconer (2003), we show that the power of representation of hypotheses is still sufficient, that is, we always can approximate a given figure arbitrarily closely by some hypothesis.

Lemma 3.3

(Representational power of hypotheses)

For anyδ∈ℝ and for every figure\(K \in \mathcal {K}^{*}\), there exists a hypothesisHsuch that\(\mbox {$\mathrm {GE}$}(K, H) < \delta\).


Fix a figure K and the parameter δ. Here we denote the diameter of the set ρ(w) with |w|=k by \(\operatorname {diam}(k)\). Then we have
$$ \operatorname {diam}(k) = \sqrt{d} \cdot2^{-k}. $$
For example, \(\operatorname {diam}(1) = 1/2\) and \(\operatorname {diam}(2) = 1/4\) if d=1, and \(\operatorname {diam}(1) = 1/\sqrt{2}\) and \(\operatorname {diam}(2) = 1/\sqrt{8}\) if d=2. For k with \(\operatorname {diam}(k) < \delta\), let
$$ H = \bigl\{\,w \in\bigl(\varSigma^d \bigr)^{*} \mid|w| = k \ \text{and}\ \rho(w) \cap K \neq\emptyset\, \bigr\}. $$
We can easily check that the \(\operatorname {diam}(k)\)-neighborhood of K contains κ(H) and the \(\operatorname {diam}(k)\)-neighborhood of κ(H) contains K. Therefore we have \(\mbox {$\mathrm {GE}$}(K, H) < \delta\). □
Moreover, to work as a classifier, every hypothesis H has to be computable, that is, the function h:(Σd)→{0,1} such that, for all w∈(Σd),
$$ h(w) = \begin{cases} 1 & \text{if}\ \rho(w) \cap\kappa(H) \neq\emptyset,\\[3pt] 0 & \text{otherwise} \end{cases} $$
should be computable. We say that such h is the classifier of H. The computability of h is not trivial, since for a finite sequence w, the two conditions h(w)=1 and wHk are not equivalent. Intuitively, this is because each interval represented by a finite sequence is closed. For example, in the case of Example 3.1, h(10)=1 because ρ(10)=[0.5,1]×[0,0.5] and ρ(10)∩κ(H)={(0.5,0.5)}≠∅ whereas 10Hk for any k∈ℕ. Here we guarantee this property of computability.

Lemma 3.4

(Computability of classifiers)

For every hypothesis\(H \in \mathcal {H}\), the classifierhof Hdefined by (7) is computable.


First we consider whether or not the boundary of an interval is contained in κ(H). Suppose d=1 and let C be a finite set of contractions and F be the self-similar set of C. We have the following property: Let \([x, y] = \varphi _{1} \circ \varphi _{2} \circ\dots\circ \varphi _{n} (\mathcal {I}^{1})\) for some φ1,φ2,…,φnC and let \(I = \varphi '_{1} \circ \varphi '_{2} \circ\dots\circ \varphi '_{n'} (\mathcal {I}^{1})\) for \(\varphi '_{1}, \varphi '_{2}, \dots, \varphi '_{n'} \in C\). Assume that, if n′ is large enough, there is no such I satisfying xI and minI<x (resp. maxI>y). Then, we have xF (resp. yF) if and only if \(0 \in \varphi (\mathcal {I}^{1})\) (resp. \(1 \in \varphi (\mathcal {I}^{1})\)) for some φC. This means that if [x,y]=ρ(v) with a sequence vHk (k∈ℕ) for a hypothesis H, where there is no sequence v′∈Hk with xρ(v′) and minρ(v′)<x (resp. maxρ(v′)>y) when k′ is large enough, we have xκ(H) (resp. yκ(H)) if and only if u∈{0}+ (resp. u∈{1}+) for some uH.

We show a pseudo-code of the classifier h in Algorithm 1 and prove that the output of the algorithm is 1 if and only if h(w)=1, i.e., ρ(w)∩κ(H)≠∅. In the algorithm, \(\underline{v^{s}}\) and \(\overline{v^{s}}\) denote the previous and subsequent binary sequences of vs with \(|v^{s}| = |\underline{v^{s}}| = |\overline{v^{s}}|\) in the lexicographic order, respectively. For example, if vs=001, \(\underline{v^{s}} = \mathtt {0}\mathtt {0}\mathtt {0}\) and \(\overline{v^{s}} = \mathtt {0}\mathtt {1}\mathtt {0}\). Moreover, we use the special symbol ⊥ meaning undefinedness, that is, v=w if and only if vi=wi for all i∈{0,1,…,|v|−1} with vi≠⊥ and wi≠⊥.
Algorithm 1

Classifier h of hypothesis H

The “if” part: For an input of a finite sequence w and a hypothesis H, if h(w)=1, there are two possibilities as follows:
  1. 1.

    For some k∈ℕ, there exists vHk such that wv. This is because ρ(w)⊇ρ(v) and ρ(v)∩κ(H)≠∅.

  2. 2.

    The above condition does not hold, but ρ(w)∩κ(H)≠∅.

In the first case, the algorithm goes to line 7 and stops with outputting 1. The second case means that the algorithm uses the function CheckBoundary. Since h(w)=1, there should exist a sequence vH such that u=aaaa for some uH, where a is obtained in lines 1–10. CheckBoundary therefore returns 1.

The “only if” part: In Algorithm 1, if vHk satisfies conditions in line 6 or line 8, h(w)∩κ(H)≠∅. Thus h(w)=1 holds. □

The set {κ(H)∣ H⊂(Σd) and the classifier h of H is computable} exactly corresponds to an indexed family of recursive concepts/languages discussed in computational learning theory (Angluin 1980), which is a common assumption for learning of languages. On the other hand, there exists some class of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) that is not an indexed family of recursive concepts. This means that, for some figure K, there is no computable classifier which classifies all data correctly. Therefore we address the problems of both exact and approximate learning of figures to obtain a computable classifier for any target figure.

We consider two types of input data stream, one includes both positive and negative data and the other includes only positive data, to analyze learning based on Gold’s learning model. Formally, each training datum is called an example and is defined as a pair (w,l) of a finite sequence w∈(Σd) and a label l∈{0,1}. For a target figure K,
$$ l = \begin{cases} \,1 & \text{if}\ \rho(w) \cap K \not= \emptyset\ \ \text{(\emph{positive example})},\\ \,0 & \text{otherwise}\ \ \text{(\emph{negative example})}. \end{cases} $$
In the following, for a target figure K, we denote the set of finite sequences of positive examples {w∈(Σd)ρ(w)∩K≠∅} by Pos(K) and that of negative examples by Neg(K). Moreover, we denote Posk(K)={w∈Pos(K)∣|w|=k}. From the geometric nature of figures, we obtain the following monotonicity of examples:

Lemma 3.5

(Monotonicity of examples)

If (v,1) is an example ofK, then (w,1) is an example ofKfor all prefixeswv, and (va,1) is an example ofKfor someaΣd. If (w,0) is an example ofK, then (wv,0) is an example ofKfor allv∈(Σd).


From the definition of the representation ρ in (1) and (3), if wv, we have ρ(w)⊇ρ(v), hence (w,1) is an example of K. Moreover,
$$ \bigcup_{a \in\varSigma^{d}} \rho(va) = \rho(u) $$
holds. Thus there should exist an example (va,1) for some aΣd. Furthermore, for all vΣ, ρ(wv)⊂ρ(w). Therefore if Kρ(w)=∅, then Kρ(wv)=∅ for all v∈(Σd), and (wv,0) is an example of K. □
We say that an infinite sequence σ of examples of a figure K is a presentation of K. The ith example is denoted by σ(i−1), and the set of all examples occurring in σ is denoted by \(\operatorname {range}(\sigma )\).3 The initial segment of σ of length n, i.e., the sequence σ(0),σ(1),…,σ(n−1), is denoted by σ[n−1]. A text of a figure K is a presentation σ such that
$$ \big \{w | (w, 1) \in \operatorname {range}(\sigma )\big \} = \mathrm {Pos}(K)\ \bigl( = \big \{w | \rho(w) \cap K \neq\emptyset\big \} \bigr), $$
and an informant is a presentation σ such that Table 2 shows the relationship between the standard terminology in classification and our definitions. For a target figure K and the classifier h of a hypothesis H, the set {w∈Pos(K)∣h(w)=1} corresponds to true positive, {w∈Neg(K)∣h(w)=1} false positive (type I error), {w∈Pos(K)∣h(w)=0} false negative (type II error), and {w∈Neg(K)∣h(w)=0} true negative.
Table 2

Relationship between the conditions for each finite sequence wΣ and the standard notation of binary classification


Target figure K





Hypothesis H


True positive

False positive


(Type I error)


False negative

True negative


(Type II error)

Let h be the classifier of a hypothesis H. We say that the hypothesis H is consistent with an example (w,l) if l=1 implies h(w)=1 and l=0 implies h(w)=0, and consistent with a set of examples E if H is consistent with all examples in E.

A learning machine, called a learner, is a procedure, (i.e. a Turing machine that never halts) that reads a presentation of a target figure from time to time, and outputs hypotheses from time to time. In the following, we denote a learner by M and an infinite sequence of hypotheses produced by M on the input σ by Mσ, and Mσ(i−1) denotes the ith hypothesis produced by M. Assume that M receives j examples σ(0),σ(1),…,σ(j−1) so far when it outputs the ith hypothesis Mσ(i−1). We do not require the condition i=j, that is, the inequality ij usually holds since M can “wait” until it receives enough examples. We say that an infinite sequence of hypotheses Mσconverges to a hypothesis H if there exists n∈ℕ such that Mσ(i)=H for all in.

4 Exact learning of figures

We analyze “exact” learning of figures. This means that, for any target figure K, there should be a hypothesis H such that the generalization error is zero (i.e., K=κ(H)), hence the classifier h of H can classify all data correctly with no error, that is, h satisfies (7). The goal is to find such a hypothesis H from examples (training data) of K.

In the following two sections (Sects. 4 and 5), we follow the standard path of studies in computational learning theory (Jain et al. 1999; Jain 2011; Zeugmann and Zilles 2008), that is, we define learning criteria to understand various learning situations and construct a learnability hierarchy under the criteria. We summarize our results in Fig. 3.
Fig. 3

Learnability hierarchy. For each line, the lower set is a proper subset of the upper set

4.1 Explanatory learning

The most basic learning criterion in Gold’s model is EX-learning (EX means EXplain), i.e., learning in the limit proposed by Gold (1967). We call these criteria FIGEX-INF- (INF means an informant) and FIGEX-TXT-learning (TXT means a text) for EX-learning from informants and texts, respectively. We introduce these criteria into the learning of figures, and analyze the learnability of figures.

Definition 4.1

(Explanatory learning)

A learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if for all figures \(K \in \mathcal {F}\) and all informants (resp. texts) σ of K, the outputs Mσ converge to a hypothesis H such that \(\mbox {$\mathrm {GE}$}(K, H) = 0\).

For every learning criterion CR introduced in the following, we say that a set of figures \(\mathcal {F}\) is CR-learnable if there exists a learner that CR-learns \(\mathcal {F}\), and denote by CR the collection of CR-learnable sets of figures following the standard notation of this field (Jain et al. 1999).

First, we consider FIGEX-INF-learning. Informally, a learner can FIGEX-INF-learn a set of figures if it has the ability to enumerate all hypotheses and to judge whether or not each hypothesis is consistent with the received examples (Gold 1967). Here we introduce a convenient enumeration of hypotheses. An infinite sequence of hypotheses H0,H1,… is called a normal enumeration if \(\left \{H_{i} | i \in \mathbb {N}\right \} = \mathcal {H}\) and, for all i,j∈ℕ, i<j implies
$$ \max_{v \in H_i}|v| \le\max_{w \in H_j}|w|. $$
We can easily implement a procedure that enumerates \(\mathcal {H}\) through a normal enumeration.

Theorem 4.2

The set of figures\(\kappa (\mathcal {H}) = \left \{\kappa(H) | H \in \mathcal {H}\right \}\)isFIGEX-INF-learnable.


This learning can be done by the well-known strategy of identification by enumeration. We show a pseudo-code of a learner M that FIGEX-INF-learns \(\kappa (\mathcal {H})\) in Procedure 1. The learner M generates hypotheses through normal enumeration. If M outputs a wrong hypothesis H, there must exist a positive or negative example that is not consistent with the hypothesis since, for a target figure K,
$$ \mathrm {Pos}(K_*) \ominus \mathrm {Pos}\bigl(\kappa(H) \bigr) \neq\emptyset $$
for every hypothesis H with κ(H)≠K, where XY denotes the symmetric difference, i.e., XY=(XY)∖(XY). Thus the learner M changes the wrong hypothesis and reaches a correct hypothesis H such that κ(H)=K in finite time. If M produces a correct hypothesis, it never changes the hypothesis, since every example is consistent with it. Therefore the learner MFIGEX-INF-learns \(\kappa (\mathcal {H})\).
Procedure 1

Learning procedure that FIGEX-INF-learns \(\kappa (\mathcal {H})\)


Next, we consider FIGEX-TXT-learning. In learning of languages from texts, the necessary and sufficient conditions for learning have been studied in detail by Angluin (1980, 1982), Kobayashi (1996), Lange et al. (2008), Motoki et al. (1991), Wright (1989), and characterization of learnability using finite tell-tale sets is one of the crucial results. We adapt these results into the learning of figures and show the FIGEX-TXT-learnability.

Definition 4.3

(Finite tell-tale set, cf. Angluin 1980)

Let \(\mathcal {F}\) be a set of figures. For a figure \(K \in \mathcal {F}\), a finite subset \(\mathcal {T}\) of the set of positive examples Pos(K) is a finite tell-tale set ofKwith respect to\(\mathcal {F}\) if for all figures \(L \in \mathcal {F}\), \(\mathcal {T}\subset \mathrm {Pos}(L)\) implies \(\mathrm {Pos}(L) \not \subset \mathrm {Pos}(K)\) (i.e., \(L \not\subset K\)). If every figure \(K \in \mathcal {F}\) has finite tell-tale sets with respect to \(\mathcal {F}\), we say that \(\mathcal {F}\) has finite tell-tale sets.

Theorem 4.4

Let\(\mathcal {F}\)be a subset of\(\kappa (\mathcal {H})\). Then\(\mathcal {F}\)isFIGEX-TXT-learnable if and only if there is a procedure that, for every figure\(K \in \mathcal {F}\), enumerates a finite tell-tale setWofKwith respect to\(\mathcal {F}\).

This theorem can be proved in exactly the same way as that for learning of languages given by Angluin (1980). Note that such procedure does not need to stop. Using this theorem, we show that the set \(\kappa (\mathcal {H})\) is not FIGEX-TXT-learnable.

Theorem 4.5

The set\(\kappa (\mathcal {H})\)does not have finite tell-tale sets.


Fix a figure \(K = \kappa(H) \in \kappa (\mathcal {H})\), where there exists a pair v,wH such that \(\rho(vvv\dots) \not= \rho(www\dots)\), and fix a finite set \(T = \left \{w_{1}, w_{2}, \dots, w_{n}\right \}\) contained in Pos(K). Suppose that #Posm(K)>n holds for a natural number m. For each finite sequence wi, there exists ui∈Pos(K) such that |ui|>m, wiui, and uiHk for some k. For the figure L=κ(U) with U={u1,u2,…,un}, T⊂Pos(L) and Pos(L)⊂Pos(K) hold. Therefore K has no finite tell-tale set with respect to \(\kappa (\mathcal {H})\). □

Corollary 4.6

The set of figures\(\kappa (\mathcal {H})\)is notFIGEX-TXT-learnable.

In any realistic scenarios of machine learning, however, this set \(\kappa (\mathcal {H})\) is too large to search for the best hypothesis since we usually want to obtain a “compact” representation of a target figure. Thus we (implicitly) have an upper bound on the number of elements in a hypothesis. Here we give a positive result for the above situation, that is, if we fix the number of elements #H in each hypothesis Ha priori, the resulting set of figures becomes FIGEX-TXT-learnable. Intuitively, this is because if we take k large enough, the set {w∈Pos(K)∣|w|≤k} becomes a finite tell-tale set of K. Here we denote by Red(H) the hypothesis in which for every pair v,wH with |v|≤|w|, w is removed if ρ(vvv…)=ρ(www…). For a finite subset of natural numbers N⊂ℕ, we define the set of hypotheses \(\mathcal {H}_{N} := \{H \in \mathcal {H}\mid\#\mathrm {Red}(H) \in N\}\).

Theorem 4.7

There exists a procedure that, for all finite subsetsN⊂ℕ and all figures\(K \in \kappa (\mathcal {H}_{N})\), enumerates a finite tell-tale set ofKwith respect to\(\kappa (\mathcal {H}_{N})\).


First, we assume that N={1}. It is trivial that there exists a procedure that, for an arbitrary figure \(K \in \kappa (\mathcal {H}_{N})\), enumerates a finite tell-tale set of K with respect to \(\kappa (\mathcal {H}_{N})\), since we always have \(L \not\subset K\) for all pairs of figures \(K, L \in \kappa (\mathcal {H}_{N})\).

Next, fix N⊂ℕ with N≠{1}. Let us consider the procedure that enumerates elements of the sets
$$ \mathrm {Pos}_1(K), \mathrm {Pos}_2(K), \mathrm {Pos}_3(K), \dots. $$
We show that this procedure enumerates a finite tell-tale set of K with respect to \(\kappa (\mathcal {H}_{N})\). It is enough to show that there exists a natural number m, where there is no hypothesis H such that κ(H)⊂K, #H≤maxN, and Pos(κ(H))⊃Posm(K).

We construct a tree as follows (the similar technique called d-explorer was used by Jain and Sharma (1997)). Each node has a pair (H,w) as its label, where κ(H)⊂K and w∈Pos(K)∖Pos(κ(H)). The root node is labeled (∅,v) with a finite sequence v∈Pos(K). The tree is constructed iteratively by adding children for each node of the tree, whose depth (the length to the root) is at most maxN−1. Let the label of such a node be (H,w). For every finite sequence w′ with |w′|≤|w|, if there exists a finite sequence w″ satisfying |w″|>|w| and w″∈Pos(K)∖κ(H∪{w′}), add a child labeled (H∪{w′},w″) to the node.

The above tree is bounded in depth maxN and the number of children for any node is always finite, hence the number of nodes of the tree is finite. Let m be the length of the longest w such that (H,w) is the label of a node of the tree. Then, we can easily check that there is no hypothesis H′ such that κ(H′)⊂K, #H′≤maxN, and Pos(κ(H′))⊃Posm(K). □

Corollary 4.8

For all finite subsets of natural numbersN⊂ℕ, the set of figures\(\kappa(\mathcal {H}_{N})\)isFIGEX-TXT-learnable.

4.2 Consistent learning

In a learning process, it is natural that every hypothesis generated by a learner is consistent with the examples received by it so far. Here we introduce FIGCONS-INF- and FIGCONS-TXT-learning (CONS means CONSistent). These criteria correspond to CONS-learning that was first introduced by Blum and Blum (1975).4 This model was also used (but implicitly) in the Model Inference System (MIS) proposed by Shapiro (1981), Shapiro (1983), and studied in the computational learning of formal languages and recursive functions (Jain et al. 1999).

Definition 4.9

(Consistent learning)

A learner MFIGCONS-INF-learns (resp. FIGCONS-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\) and for all figures \(K \in \mathcal {F}\) and all informants (resp. texts) σ of K, each hypothesis Mσ(i) is consistent with Ei that is the set of examples received by M until just before it generates the hypothesis Mσ(i).

Assume that a learner M achieves FIGEX-INF-learning of \(\kappa (\mathcal {H})\) using Procedure 1. We can easily check that M always generates a hypothesis that is consistent with the received examples.

Corollary 4.10


Suppose that \(\mathcal {F}\subset \kappa (\mathcal {H})\) is FIGEX-TXT-learnable. We can construct a learner M in the same way as in the case of EX-learning of languages from texts (Angluin 1980), where M always outputs a hypothesis that is consistent with received examples.

Corollary 4.11


4.3 Reliable and refutable learning

In this subsection, we consider target figures that might not be represented exactly by any hypothesis since there are infinitely many such figures, and if we have no background knowledge, there is no guarantee of the existence of an exact hypothesis. Thus in practice this approach is more convenient than the explanatory or consistent learning considered in the previous two subsections.

To realize the above case, we use two concepts, reliability and refutability. The aim of the concepts is to introduce targets which cannot be exactly represented by any hypotheses. Reliable learning was introduced by Blum and Blum (1975), Minicozzi (1976) and refutable learning by Mukouchi and Arikawa (1995), Sakurai (1991) in computational learning of languages and recursive functions, and developed by Jain et al. (2001), Merkle and Stephan (2003), Mukouchi and Sato (2003). Here we introduce these concepts into the learning of figures and analyze learnability.

First, we treat reliable learning of figures. Intuitively, reliability requires that an infinite sequence of hypotheses only converges to a correct hypothesis.

Definition 4.12

(Reliable learning)

A learner MFIGRELEX-INF-learns (resp. FIGRELEX-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if M satisfies the following conditions:
  1. 1.

    The learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\).

  2. 2.

    For any target figure \(K \in \mathcal {K}^{*}\) and its informants (resp. texts) σ, the infinite sequence of hypotheses Mσ does not converge to a wrong hypothesis H such that \(\mbox {$\mathrm {GE}$}(K, \kappa (H)) \not= 0\).


We analyze reliable learning of figures from informants. Intuitively, for any target figure \(K \in \mathcal {F}\), if a learner can judge whether or not the current hypothesis H is consistent with the target, i.e., κ(H)=K or not in finite time, then the set \(\mathcal {F}\) is reliably learnable.

Theorem 4.13



The statement FIGRELEX-INFFIGEX-INF is trivial, thus we prove FIGEX-INFFIGRELEX-INF. Fix a set of figures \(\mathcal {F}\subseteq \kappa (\mathcal {H})\) with \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\), and suppose that a learner MFIGEX-INF-learns \(\mathcal {F}\) using Procedure 1. The goal is to show that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). Assume that a target figure K belongs to \(\mathcal {K}^{*} \setminus \mathcal {F}\). Here we have the following property: for all figures \(L \in \mathcal {F}\), there must exist a finite sequences w∈(Σd) such that
$$ w \in \mathrm {Pos}(K) \ominus \mathrm {Pos}(L), $$
hence for any M’s current hypothesis H, M changes H if it receives a positive or negative example (w,l) such that w∈Pos(K)⊖Pos(κ(H)). This means that an infinite sequence of hypotheses does not converge to any hypothesis. Thus we have \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). □

In contrast, we have an interesting result on reliable learning from texts. We show in the following that FIGEX-TXTFIGRELEX-TXT holds and that a set of figures \(\mathcal {F}\) is reliably learnable from positive data only if any figure \(K \in \mathcal {F}\) is a singleton. Remember that \(\mathcal {H}_{N}\) denotes the set of hypotheses \(\{H \in \mathcal {H}\mid\# H \in N\}\) for a subset N⊂ℕ and, for simplicity, we denote \(\mathcal {H}_{\{n\}}\) by \(\mathcal {H}_{n}\) for a natural number n∈ℕ.

Theorem 4.14

The set of figures\(\kappa(\mathcal {H}_{N})\)isFIGRELEX-TXT-learnable if and only ifN={1}.


First we show that the set of figures \(\kappa (\mathcal {H}_{1})\) is FIGRELEX-TXT-learnable. From the self-similar sets property of hypotheses, we have the following: A figure \(K \in \kappa (\mathcal {H})\) is a singleton if and only if \(K \in \kappa (\mathcal {H}_{1})\). Let \(K \in \mathcal {K}^{*} \setminus\kappa(\mathcal {H}_{1})\), and assume that a learner MFIGEX-TXT-learns \(\kappa(\mathcal {H}_{1})\). We can naturally suppose that M changes the current hypothesis H whenever it receives a positive example (w,1) such that w∉Pos(κ(H)) without loss of generality. For any hypothesis \(H \in \mathcal {H}_{1}\), there exists w∈(Σd) such that
$$ w \in \mathrm {Pos}(K) \setminus \mathrm {Pos}\bigl(\kappa(H) \bigr). $$
Thus if the learner M receives such a positive example (w,1), it changes the hypothesis H. This means that an infinite sequence of hypotheses does not converge to any hypothesis. Therefore \(\kappa (\mathcal {H}_{1})\) is FIGRELEX-TXT-learnable.
Next, we prove that \(\kappa (\mathcal {H}_{n})\) is not FIGRELEX-TXT-learnable for any n>1. Fix such n∈ℕ with n>1. We can easily check that, for a figure \(K \in \kappa (\mathcal {H}_{n})\) and any of its finite tell-tale sets \(\mathcal {T}\) with respect to \(\kappa(\mathcal {H}_{n})\), there exists a figure \(L \in \mathcal {K}^{*} \setminus \kappa (\mathcal {H}_{n})\) such that LK and \(\mathcal {T}\subset \mathrm {Pos}(L)\). This means that
$$ \mathrm {Pos}(L) \subseteq \mathrm {Pos}(K)\quad \text{and}\quad \mathcal {T}\subseteq \mathrm {Pos}(L) $$
hold. Thus if a learner MFIGEX-TXT-learns \(\kappa (\mathcal {H}_{n})\), Mσ for some presentation σ of such L must converge to some hypothesis in \(\mathcal {H}_{n}\). Consequently, we have \(\kappa (\mathcal {H}_{n}) \notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). □

Corollary 4.15


Sakurai (1991) proved that a set of concepts \(\mathcal{C}\) is reliably EX-learnable from texts if and only if \(\mathcal{C}\) contains no infinite concept (p. 182, Theorem 3.1).5 However, we have shown that the set \(\kappa (\mathcal {H}_{1})\) is FIGRELEX-TXT-learnable, though all figures \(K \in \kappa (\mathcal {H}_{1})\) correspond to infinite concepts since Pos(K) is infinite for all \(K \in \kappa (\mathcal {H}_{1})\). The monotonicity of the set Pos(K) (Lemma 3.5), which is a constraint naturally derived from the geometric property of examples, causes this difference.

Next, we extend FIGEX-INF- and FIGEX-TXT-learning by paying our attention to refutability. In refutable learning, a learner tries to learn figures in the limit, but it understands that it cannot find a correct hypothesis in finite time, that is, outputs the refutation symbol △ and stops if the target figure is not in the considered space.

Definition 4.16

(Refutable learning)

A learner MFIGREFEX-INF-learns (resp. FIGREFEX-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if M satisfies the following conditions. Here, △ is the refutation symbol.
  1. 1.

    The learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\).

  2. 2.

    If \(K \in \mathcal {F}\), then for all informants (resp. texts) σ of K, Mσ(i)≠△ for all i∈ℕ.

  3. 3.

    If \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\), then for all informants (resp. texts) σ of K, there exists m∈ℕ such that Mσ(i)≠△ for all i<m, and Mσ(i)=△ for all im.


Conditions 2 and 3 in the above definition mean that a learner M refutes the set \(\mathcal {F}\) in finite time if and only if a target figure \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\). We compare FIGREFEX-INF-learnability with other learning criteria.

Theorem 4.17

\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\)and\(\mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\).


First we consider \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). We show an example of a set of figures \(\mathcal {F}\) with \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) and \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) in the case of d=2. Let K0=κ({〈0,0〉,〈1,1〉}), Ki=κ({〈w,w〉∣wΣi∖{1}i}) for every i≥1, and \(\mathcal {F}= \{K_{i} \mid i \in \mathbb {N}\}\). Note that K0 is the line y=x and KiK0 for all i≥1.

We prove that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). It is trivial that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\), thereby assume that a target figure \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\). If a target figure KK0, it is trivial that, for any informant σ of K, the set of examples \(\operatorname {range}(\sigma [n])\) for some n∈ℕ is not consistent with any \(K_{i} \in \mathcal {F}\) (consider a positive example for a point xKK0). Otherwise if KK0, there should exist a negative example 〈v,v〉∈Neg(K). Then we have \(K \not= K_{i}\) for all i>|v|. Thus a learner can refute candidates {K1,K2,…,K|v|} in finite time. Therefore \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) holds.

Next we show that \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). Let K0 be the target figure. For any finite set of positive examples \(\mathcal {T}\subset \mathrm {Pos}(K_{0})\), there exists a figure \(K_{i} \in \mathcal {F}\) such that KiK0 and \(\mathcal {T}\) is consistent with Ki. Therefore it has no finite tell-tale set with respect to \(\mathcal {F}\) and hence \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) from Theorem 4.4.

Second we check \(\mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). Assume that \(\mathcal {F}= \kappa (\mathcal {H}_{\{1\}})\) and a target figure K is a singleton {x} with \(K \notin \mathcal {F}\). It is clear that, for any informant σ of K and n∈ℕ, \(\operatorname {range}(\sigma [n])\) is consistent with some figure \(L \in \mathcal {F}\). Thus \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) whereas \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). □

Corollary 4.18

\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\)and\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\).

Note that it is trivial that \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) since we have \(\kappa (\mathcal {H}_{\{1\}}) \notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) in the above proof and \(\kappa (\mathcal {H}_{\{1\}}) \in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) from Theorem 4.14. Moreover, the condition \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) holds since \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) and FIGRELEX-TXTFIGEX-TXT. These results mean that both FIGREFEX-INF- and FIGRELEX-TXT-learning are difficult, but they are incomparable in terms of learnability. Furthermore, we have the following hierarchy.

Theorem 4.19

\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not= \emptyset\)andFIGREFEX-TXTFIGREFEX-INF.


Let a set of figures \(\mathcal {F}\) be a singleton {K} such that K=κ(w) for some w∈(Σd). Then there exists a learner M that FIGREFEX-TXT-learns \(\mathcal {F}\), i.e., \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\), since all M has to do is to check whether or not, for a given positive example (v,1), vu for some u∈Pos(K)={xxwww…}.

Next, let \(\mathcal {F}= \{K\}\) such that K=κ(H) with #Red(H)≥2. We can easily check that \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) because if a target figure L is a proper subset of K, no learner can refute \(\mathcal {F}\) in finite time. Conversely, \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) since for all L with LK, there exists an example with which the hypothesis H is not consistent. □

Corollary 4.20


5 Effective learning of figures

In learning under the proposed criteria, i.e. explanatory, consistent, reliable, and refutable learning, each hypothesis is just considered as exactly “correct” or not, that is, for a target figure K and for a hypothesis H, H is correct if \(\mbox {$\mathrm {GE}$}(K, H) = 0\) and is not correct if \(\mbox {$\mathrm {GE}$}(K, H) \neq0\). Thus we cannot know the rate of convergence to the target figure and how far it is from the recent hypothesis to the target. It is therefore more useful if we consider approximate hypotheses by taking various generalization errors into account in the learning process.

We define novel learning criteria, FIGEFEX-INF- and FIGEFEX-TXT-learning (EF means EFfective), to introduce into learning the concept of effectivity, which has been analyzed in computation of real numbers in the area of computable analysis (Weihrauch 2000). Intuitively, these criteria guarantee that for any target figure, a generalization error becomes smaller and smaller monotonically and converges to zero. Thus we can know when the learner learns the target figure “well enough”. Furthermore, if a target figure is learnable in the limit, then the generalization error goes to zero in finite time.

Definition 5.1

(Effective learning)

A learner MFIGEFEX-INF-learns (resp. FIGEFEX-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if M satisfies the following conditions:
  1. 1.

    The learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\).

  2. 2.
    For an arbitrary target figure \(K \in \mathcal {K}^{*}\) and all informants (resp. texts) σ of K, for all i∈ℕ,
    $$ \mbox {$\mathrm {GE}$}\bigl(K, \text {\textbf {M}}_{\sigma }(i) \bigr) \le2^{-i}. $$

This definition is inspired by the Cauchy representation of real numbers (Weihrauch 2000, Definition 4.1.5).

Effective learning is related to monotonic learning (Lange and Zeugmann 1993, 1994; Kinber 1994; Zeugmann et al. 1995) originally introduced by Jantke (1991), Wiehagen (1991), since both learning models consider monotonic convergence of hypotheses. In contrast to their approach, where various monotonicity over languages was considered, we geometrically measure the generalization error of a hypothesis by the Hausdorff metric. On the other hand, the effective learning is different from BC-learning developed in the learning of languages and recursive functions (Jain et al. 1999) since BC-learning only guarantees that generalization errors go to zero in finite time. This means that BC-learning is not effective.

First we show that we can bound the generalization error of the hypothesis H using the diameter \(\operatorname {diam}(k)\) of the set ρ(w) with |w|=k. Recall that we have
$$ \operatorname {diam}(k) = \sqrt{d} \cdot2^{-k} $$
(see Proof of Lemma 3.3). In the following, we denote the set of examples {(w,l)∣|w|=k} in σ by Ek and call each example in it a level-kexample.

Lemma 5.2

Letσbe an informant of a figureKandHbe a hypothesis that is consistent with the set of examplesEk={(w,l)∣|w|=k}. We have the inequality
$$ \mbox {$\mathrm {GE}$}(K, H) \le \operatorname {diam}(k). $$


Since H is consistent with Ek,
$$ \kappa (H) \cap\rho(w)\ \begin{cases} \neq\emptyset& \text{if}\ (w, 1) \in E^k,\\ = \emptyset& \text{if}\ (w, 0) \in E^k. \end{cases} $$
Thus for \(\delta= \operatorname {diam}(k)\), the δ-neighborhood of κ(H) contains K and the δ-neighborhood of K contains κ(H). It therefore follows that \(\mbox {$\mathrm {GE}$}(K, H) = d_{\mathrm {H}}(K, \kappa (H)) \le \operatorname {diam}(k)\). □

Theorem 5.3

The set of figures\(\kappa (\mathcal {H})\)isFIGEFEX-INF-learnable.


We show the learner M that FIGEFEX-INF-learns \(\kappa (\mathcal {H})\) in Procedure 2. We use the function
$$ g(k) = \lceil k + \log_2 \sqrt{d} \rceil. $$
Then for all k∈ℕ, we have
$$ \operatorname {diam}\bigl(g(k) \bigr) = \sqrt{d} \cdot2^{-g(k)} \le2^{-k}. $$
The learner M stores examples, and when it receives all examples at the level g(k), it outputs a hypothesis. Every kth hypothesis Mσ(k) is consistent with the set of examples Eg(k). Thus we have
$$ \mbox {$\mathrm {GE}$}\bigl(K, \text {\textbf {M}}_{\sigma }(k) \bigr) \le \operatorname {diam}\bigl(g(k) \bigr) \le 2^{-k} $$
for all k∈ℕ from Lemma 5.2.
Procedure 2

Learning procedure that FIGEFEX-INF-learns \(\kappa (\mathcal {H})\)

Assume that \(K \in \kappa (\mathcal {H})\). If M outputs a wrong hypothesis, there must be a positive or negative example that is not consistent with the hypothesis, and it changes the wrong hypothesis. If it produces a correct hypothesis, then it never changes the correct hypothesis, since every example is consistent with the hypothesis. Thus there exists n∈ℕ with \(\mbox {$\mathrm {GE}$}(K, \text {\textbf {M}}_{\sigma }(i)) = 0\) for all in. Therefore MFIGEFEX-INF-learns \(\kappa (\mathcal {H})\). □

Corollary 5.4


Thus the learner with Procedure 2 can treat the set of all figures \(\mathcal {K}^{*}\) as learning targets, since for any figure \(K \in \mathcal {K}^{*}\), it can approximate the figure arbitrarily closely using only the figures represented by hypotheses in the hypothesis space \(\mathcal {H}\).

In contrast to FIGEX-TXT-learning, there is no set of figures that is FIGEFEX-TXT-learnable.

Theorem 5.5



We show a counterexample of a target figure which no learner M can approximate effectively. Assume that d=2 and a learner MFIGEFEX-TXT-learns a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\). Let us consider two target figures K={(0,0),(1,1)} and L={(0,0)}. For a text σ of L, for all examples \((w, 1) \in \operatorname {range}(\sigma )\), w∈{00}. Since MFIGEFEX-TXT-learns \(\mathcal {F}\), it should output the hypothesis H as Mσ(2) such that \(\mbox {$\mathrm {GE}$}(L, H) < 1/4\). Suppose that M receives n examples before outputting the hypothesis H. Then there exists a presentation τ of the figure K such that τ[n−1]=σ[n−1], and M outputs the hypothesis H with receiving τ[n−1]. However, \(\mbox {$\mathrm {GE}$}(K, H) \ge\sqrt{2} - 1/4\) holds from the triangle inequality, contradicting our assumption that MFIGEFEX-TXT-learns \(\mathcal {F}\). This proof can be applied for any \(\mathcal {F}\subseteq \mathcal {K}^{*}\), thereby we have FIGEFEX-TXT=∅. □

Since FIGREFEX-TXT≠∅, we have the relation
$$ \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize F}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\subset \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}. $$
This result means that we cannot learn any figures “effectively” by using only positive examples.

6 Evaluation of learning using dimensions

Here we show a novel mathematical connection between fractal geometry and Gold’s learning under the proposed learning model described in Sect. 3. More precisely, we bound the number of positive examples, one of the complexities of learning, using the Hausdorff dimension and the VC dimension. The Hausdorff dimension is known as the central concept of fractal geometry, which measures the density of figures, and VC dimension is the central concept of Valiant’s model (PAC learning model) (Kearns and Vazirani 1994), which measures the complexity of classes of hypotheses.

6.1 Preliminaries for dimensions

First we introduce the Hausdorff dimension and related dimensions: the box-counting dimension, the similarity dimension, and also introduce the VC dimension.

For X⊆ℝn and s∈ℝ with s>0, define
$$ \mathfrak {H}_{\delta}^{s}(X) := \inf \bigg \{\sum _{U \in \mathcal {U}} \vert U\vert ^s \,\bigg| \,\mathcal {U}\ \text{is a}\ \delta\text{-cover of}\ X\bigg \}. $$
The s-dimensional Hausdorff measure of X is limδ→0\(\mathfrak {H}_{\delta}^{s}(X)\), denoted by ℌs(X). We say that \(\mathcal {U}\) is a δ-cover of X if \(\mathcal {U}\) is countable, \(X \subseteq\bigcup_{U \in\, \mathcal {U}} U\), and |U|≤δ for all \(U \in \mathcal {U}\). When we fix a set X and view ℌs(X) as a function with respect to s, it has at most one value where the value ℌs(X) changes from ∞ to 0 (Federer 1996). This value is called the Hausdorff dimension of X. Formally, the Hausdorff dimension of a set X, written as dimHX, is defined by
$$ \mathrm {dim}_{\mathrm {H}}\,{X} := \sup \big \{s | \mathfrak {H}^s(X) = \infty\big \} = \inf \big \{s \ge0 | \mathfrak {H}^s(X) = 0\big \}. $$
The box-counting dimension, also known as the Minkowski-Bouligand dimension, is one of the most widely used dimensions since its mathematical calculation and empirical estimation are relatively easy compared to the Hausdorff dimension. Moreover, if we try to calculate the box-counting dimension, which is given as the limit of the following equation (8) by decreasing δ, the values obtained often converge to the Hausdorff dimension at the same time. Thus we can obtain an approximate value of the Hausdorff dimension by an empirical method. Let X be a nonempty bounded subset of ℝn and Nδ(X) be the smallest cardinality of a δ-cover of X. The box-counting dimension dimBX of X is defined by
$$ \mathrm {dim}_{\mathrm {B}}\,{X} := \lim_{\delta\to0}\frac{\log N_{\delta} (X)}{-\log\delta} $$
if this limit exists. Falconer (2003, Equivalent definitions 3.1, P.43) also shows that we have the equivalent box-counting dimension dimBX if Nδ(K) is the smallest number of cubes of side δ that cover K, or the number of δ-mesh cubes that intersect K. We have
$$ \mathrm {dim}_{\mathrm {H}}\,{X} \le \mathrm {dim}_{\mathrm {B}}\,{X} $$
for all X⊆ℝn.
It is usually difficult to find the Hausdorff dimension of a given set. However, we can obtain the dimension of a certain class of self-similar sets in the following manner. Let C be a finite set of contractions, and F be the self-similar set of C. The similarity dimension of F, denoted by dimSF, is defined by the equation
$$ \sum_{\varphi \in C}L(\varphi )^{\mathrm {dim}_{\mathrm {S}}\,{F}} = 1, $$
where L(φ) is the contractivity factor of φ, which is defined by the infimum of all real numbers c with 0<c<1 such that d(φ(x),φ(y))≤cd(x,y) for all x,yX. We have
$$ \mathrm {dim}_{\mathrm {H}}\,{F} \le \mathrm {dim}_{\mathrm {S}}\,{F} $$
and if C satisfies the open set condition,
$$ \mathrm {dim}_{\mathrm {H}}\,{F} = \mathrm {dim}_{\mathrm {B}}\,{F} = \mathrm {dim}_{\mathrm {S}}\,{F} $$
(Falconer 2003). Here, a finite set of contractions C satisfies the open set condition if there exists a nonempty bounded open set O⊂ℝn such that φ(O)⊂O for all φC and φ(O)∩φ′(O)=∅ for all φ,φ′∈C with \(\varphi \not= \varphi '\).
Intuitively, the Vapnik-Chervonenkis (VC) dimension (Blumer et al. 1989; Vapnik and Chervonenkis 1971; Valiant 1984) is a parameter of separability and it gives lower and upper bounds for the sample size in Valiant’s (PAC) learning model (Kearns and Vazirani 1994). For all \(\mathcal {R}\subseteq \mathcal {H}\) and WΣ, define
$$ \varPi_{\mathcal {R}}(W) := \big \{\mathrm {Pos}\bigl(\kappa (H) \bigr) \cap W | H \in \mathcal {R}\big \}. $$
If \(\varPi_{\mathcal {R}}(W) = 2^{W}\), we say that W is shattered by \(\mathcal {R}\). Here the VC dimension of \(\mathcal {R}\), denoted by \(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {R}}\), is the cardinality of the largest set W shattered by \(\mathcal {R}\).

6.2 Measuring the complexity of learning with dimensions

We show that the Hausdorff dimension of a target figure gives a lower bound to the number of positive examples. Remember that Posk(K)={w∈Pos(K)∣|w|=k} and the diameter \(\operatorname {diam}(k)\) of the set ρ(w) with |w|=k is \(\sqrt{d}2^{-k}\). Moreover, the size #{w∈(Σd)∣|w|=k}=2kd for all k∈ℕ.

Theorem 6.1

For every figure\(K \in \mathcal {K}^{*}\)and for anys<dimHK, if we takeklarge enough,
$$ \# \mathrm {Pos}_k(K) \ge2^{ks}. $$


Fix s<dimHK. From the definition of the Hausdorff measure,
$$ \mathfrak {H}_{\operatorname {diam}(k)}^s (K) \le\# \mathrm {Pos}_k(K) \cdot \bigl(\sqrt{d} 2^{-k} \bigr)^s $$
since \(\operatorname {diam}(k) = \sqrt{d} 2^{-k}\). If we take k large enough,
$$ \mathfrak {H}_{\operatorname {diam}(k)}^s(K) \ge(\sqrt{d})^s $$
because \(\mathfrak {H}_{\delta}^{s}(K)\) is monotonically increasing with decreasing δ, and goes to ∞. Thus
$$ \# \mathrm {Pos}_k(K) \ge \mathfrak {H}_{\operatorname {diam}(k)}^s (K) \bigl( \sqrt{d} 2^{-k} \bigr)^{-s} \ge(\sqrt{d})^s \bigl( \sqrt{d} 2^{-k} \bigr)^{-s} = 2^{ks}. \hfill\text{ } $$

Moreover, if a target figure K can be represented by some hypothesis, that is, \(K \in \kappa (\mathcal {H})\), we can use the exact dimension dimHK as a bound for the number of positive examples #Posk(K).

Theorem 6.2

For every figure\(K \in \kappa (\mathcal {H})\), if we takeklarge enough,
$$ \# \mathrm {Pos}_k(K) \ge2^{k\,\mathrm {dim}_{\mathrm {H}}\,{K}}. $$


Since the set of contractions encoded by a hypothesis H meets the open set condition, dimHκ(H)=dimBκ(H)=dimSκ(H) holds. Thus we have
$$ \mathrm {dim}_{\mathrm {H}}\,{K} = \mathrm {dim}_{\mathrm {B}}\,{K} = \lim_{\delta\to0}\frac{\log N_{\delta}(X)}{-\log \delta} \le \lim_{k \to\infty} \frac{\log\#\mathrm {Pos}_k(K)}{-\log2^{-k}}, $$
where 2k is the length of one side of an interval ρ(w) with |w|=k. The above inequality is trivial from the definition of the box-counting dimension since Nδ(X)≤#Posk(K). Therefore if we take k large enough,  □

Example 6.3

Let us consider the figure K in Example 3.1. It is known that dimHK=log3/log2=1.584…. From Theorem 6.2,
$$ \mathrm {Pos}_1(K) \ge2^{\mathrm {dim}_{\mathrm {H}}\,{K}} = 3 $$
holds at level 1 and
$$ \mathrm {Pos}_2(K) \ge4^{\mathrm {dim}_{\mathrm {H}}\,{K}} = 9 $$
holds at level 2. Actually, Pos1(K)=4 and Pos2(K)=13. Note that K is already covered by 3 and 9 intervals at level 1 and 2, respectively (Fig. 4).
Fig. 4

Positive and negative examples for the Sierpiński triangle at level 1 and 2. White (resp. gray) squares mean positive (resp. negative) examples

The VC dimension can also be used to characterize the number of positive examples. Define
$$ \mathcal {H}^k := \big \{H \in \mathcal {H}\mid |w| = k\ \text{for all}\ w \in H\big \}, $$
and call each hypothesis in the set a level-khypothesis. We show that the VC dimension of the set of level k hypotheses \(\mathcal {H}^{k}\) is equal to #{ w∈(Σd)∣|w|=k }=2kd.

Lemma 6.4

At each levelk, we have\(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^{k}} = 2^{kd}\).


First of all,
$$ \mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^k} \le2^{kd} $$
is trivial since \(\#\mathcal {H}^{k} = 2^{2^{kd}}\). Let \(\mathcal {H}_{n}^{k}\) denote the set \(\{\,H \in \mathcal {H}^{k} \mid\#\mathrm {Red}(H) = n\,\}\). For all \(H \in \mathcal {H}_{1}^{k}\), there exists w∈Pos(κ(H)) such that w∉Pos(κ(G)) for all \(G \in \mathcal {H}_{1}^{k}\) with \(H \not= G\). Thus if we assume \(\mathcal {H}_{1}^{k} = \{H_{1}, \dots, H_{2^{kd}}\}\), there exists the set of finite sequences \(W = \{w_{1}, \dots, w_{2^{kd}}\} \) such that for all i∈{1,…,2kd}, wi∈Pos(κ(Hi)) and wi∉Pos(κ(Hj)) for all j∈{1,…,2kd} with ij. For every pair V,W⊂(Σd), VW implies κ(V)⊂κ(W). Therefore the set W is shattered by \(\mathcal {H}^{k}\), meaning that we have \(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^{k}} = 2^{kd}\). □

Therefore we can rewrite Theorems 6.1 and 6.2 as follows.

Theorem 6.5

For every figure\(K \in \mathcal {K}^{*}\)and for anys<dimHK, if we takeklarge enough,
$$ \#\mathrm {Pos}_k(K) \ge\bigl(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^k} \bigr)^{s/d}. $$
Moreover, when\(K \in \kappa (\mathcal {H})\), if we takeklarge enough,
$$ \#\mathrm {Pos}_k(K) \ge\bigl(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^k} \bigr)^{\mathrm {dim}_{\mathrm {H}}\,{K}/d}. $$

These results demonstrate a relationship among the complexities of learning figures (numbers of positive examples), classes of hypotheses (VC dimension), and target figures (Hausdorff dimension).

6.3 Learning the box-counting dimension through effective learning

One may think that FIGEFEX-INF-learning can be achieved without the proposed hypothesis space. For instance, if a learner just outputs figures represented by a set of received positive examples, the generalization error becomes smaller and smaller. Here we show that one “quality” of a target figure, the box-counting dimension, is also learned in FIGEFEX-INF-learning, whereas if a learner outputs figures represented by a set of received positive examples, the box-counting dimension (and also the Hausdorff dimension) of any figure represented by a hypothesis is always d.

Recall that for all hypotheses \(H \in \mathcal {H}\), dimHκ(H)=dimBκ(H)=dimSκ(H), since the set of contractions encoded by the hypothesis H meets the open set condition.

Theorem 6.6

Assume that a learnerMFIGEFEX-INF-learns\(\kappa (\mathcal {H})\). For every target figure\(K \in \mathcal {K}^{*}\),
$$ \lim_{k \to\infty} \mathrm {dim}_{\mathrm {B}}\,{\kappa \bigl(\text {\textbf {M}}_{\sigma }(k) \bigr)} = \mathrm {dim}_{\mathrm {B}}\,{K}. $$


First, we assume that a target figure \(K \in \kappa (\mathcal {H})\). For every informant σ of K, Mσ converges to a hypothesis H with κ(H)=K. Thus
$$ \lim_{k \to\infty} \mathrm {dim}_{\mathrm {B}}\,{\kappa \bigl(\text {\textbf {M}}_{\sigma }(k) \bigr)} = \mathrm {dim}_{\mathrm {B}}\,{K} = \mathrm {dim}_{\mathrm {H}}\,{K}. $$
Next, we assume \(K \in \mathcal {K}^{*} \setminus \kappa (\mathcal {H})\). Since \(\mbox {$\mathrm {GE}$}(K, \text {\textbf {M}}_{\sigma }(i)) \le2^{-i}\) holds for every i∈ℕ, for each k∈ℕ we have some ik such that the hypothesis Mσ(i) is consistent with the set of level-k examples \(E^{k} = \{(w, l) \in \operatorname {range}(\sigma ) \mid|w| = k\}\). Thus
$$ \mathrm {dim}_{\mathrm {B}}\,{\kappa \bigl(\text {\textbf {M}}_{\sigma }(i) \bigr)} = \lim_{k \to\infty} \frac{\log\#\mathrm {Pos}_k(K)}{- \log2^{-k}}. $$
Falconer (2003, Equivalent definitions 3.1, p. 43) shows that the box-counting dimension dimBK is defined equivalently by
$$ \mathrm {dim}_{\mathrm {B}}\,{K} = \lim_{k \to\infty} \frac{\log\#\mathrm {Pos}_k(K)}{-\log2^{-k}}. $$
Therefore from the definition of the box-counting dimension, we have
$$ \lim_{i \to\infty} \mathrm {dim}_{\mathrm {B}}\,{\kappa \bigl(\text {\textbf {M}}_{\sigma }(i) \bigr)} = \lim_{k \to\infty} \frac{\log\#\mathrm {Pos}_k(K)}{- \log2^{-k}} = \mathrm {dim}_{\mathrm {B}}\,{K}. $$

7 Computational interpretation of learning

Recently, the concept of “computability” for continuous objects has been introduced in the framework of Type-2 Theory of Effectivity (TTE) (Schröder 2002b; Weihrauch 2000, 2008; Weihrauch and Grubba 2009; Tavana and Weihrauch 2011), where we treat an uncountable set X as objects for computing through infinite sequences over a given alphabet Σ. Using the framework, we analyze our learning model from the computational point of view. Some studies by de Brecht and Yamamoto (2009), de Brecht (2010) have already demonstrated a close connection between TTE and Gold’s model, and our analysis becomes an instance and extension of their analysis.

7.1 Preliminaries for Type-2 theory of effectivity

We prepare mathematical notations for TTE. In the following in this section, we assume Σ={0,1,[,],∥,♢}. A partial (resp. total) function g from a set A to a set B is denoted by g:⊆AB (resp. g:AB). A representation of a set X is a surjection ξ:⊆CX, where C is Σ or Σω. We see \(p \in \operatorname {dom}(\xi)\) as a name of the encoded element ξ(p).

Computability of string functions f:⊆XY, where X and Y are Σ or Σω, is defined via a Type-2 machine, which is a usual Turing machine with one-way input tapes, some work tapes, and a one-way output tape (Weihrauch 2000). The function fM:⊆XY computed by a Type-2 machine M is defined as follows: When Y is Σ, fM(p):=q if M with input p halts with q on the output tape, and when Y is Σω, fM(p):=q if M with input p writes step by step q onto the output tape. We say that a function f:⊆CD is computable if there is a Type-2 machine that computes f, and a finite or infinite sequence p is computable if the constant function f which outputs p is computable. A Type-2 machine never changes symbols that have already been written onto the output tape, thus each prefix of the output depends only on a prefix of the input.

By treating a Type-2 machine as a translator between names of some objects, a hierarchy of representations is introduced. A representation ξ is reducible to ζ, denoted by ξζ, if there exists a computable function f such that ξ(p)=ζ(f(p)) for all \(p \in \operatorname {dom}(\xi)\). Two representations ξ and ζ are equivalent, denoted by ξζ, if both ξζ and ζξ hold. As usual, ξ<ζ means ξζ and not ζξ.

Computability for functions is defined through representations and computability of string functions.

Definition 7.1

Let ξ and ζ be representations of X and Y, respectively. An element xX is ξ-computable if there is some computable p such that ξ(p)=x. A function f:⊆XY is (ξ,ζ)-computable if there is some computable function g such that
$$ f \circ\xi(p) = \zeta\circ g(p) $$
for all \(p \in \operatorname {dom}(\xi)\). This g is called a (ξ,ζ)-realization of f.

Thus the abstract function f is “realized” by the concrete function (Type-2 machine) g through the two representations ξ and ζ.

Various representations of the set of nonempty compact sets \(\mathcal {K}^{*}\) are well studied by Brattka and Weihrauch (1999), Brattka and Presser (2003). Let
$$ \mathcal {Q}= \big \{A \subset \mathbb {Q}^d | A \ \text{is finite and nonempty}\big \} $$
and define a representation \(\nu_{\mathcal {Q}} :\subseteq \varSigma ^{\ast }\to \mathcal {Q}\) by
$$ \nu_{\mathcal {Q}} \bigl([w_0\Vert w_1\Vert\dots\Vert w_n] \bigr) := \left \{\nu_{\mathbb {Q}^d}(w_0), \dots, \nu_{\mathbb {Q}^d}(w_n)\right \}, $$
where \(\nu_{\mathbb {Q}^{d}} :\subseteq(\varSigma^{d})^{*} \to \mathbb {Q}^{d}\) is the standard binary notation of rational numbers defined by
$$ \nu_{\mathbb {Q}^d} \bigl(\bigl\langle w^1, w^2, \dots, w^d\bigr\rangle\bigr) := \Biggl( \sum_{i = 0}^{|w^1| - 1} w_i^1\cdot2^{-(i + 1)}, \sum _{i = 0}^{|w^2| - 1} w_i^2\cdot 2^{-(i + 1)}, \dots, \sum_{i = 0}^{|w^d| - 1} w_i^d\cdot2^{-(i + 1)} \Biggr) $$
and “[”, “]”, and “∥” are special symbols used to separate two finite sequences. For a finite set of finite sequences {w0,…,wm}, for convenience we introduce the mapping ι which translates the set into a finite sequence defined by ι(w0,…,wm):=[w0∥…∥wm]. Note that \(\nu_{\mathbb {Q}^{d}}(\langle w^{1}, \dots, w^{d}\rangle) = (\min\rho (w^{1}), \ldots, \min\rho(w^{d}))\) for our representation ρ introduced in (2). The standard representation of the topological space \((\mathcal {K}^{*}, d_{\mathrm {H}})\), given by Brattka and Weihrauch (1999, Definition 4.8), is defined in the following manner.

Definition 7.2

(Standard representation of figures)

Define the representation \(\kappa_{\mathrm {H}} : \subseteq \varSigma ^{\omega }\to \mathcal {K}^{*}\) of figures by κH(p)=K if p=w0w1w2♢…,
$$ d_{\mathrm {H}} \bigl(K, \nu_{\mathcal {Q}}(w_i) \bigr) < 2^{-i} $$
for each i∈ℕ, and \(\lim_{i \to\infty}\nu_{\mathcal {Q}}(w_{i}) = K\), where ♢ denotes a separator of two finite sequences.

This representation κH is known to be an admissible representation of the space \((\mathcal {K}^{*}, d_{\mathrm {H}})\), which is the key concept in TTE (Schröder 2002b; Weihrauch 2000), and is also known as the \(\boldsymbol {\varSigma }_{1}^{0}\)-admissible representation proposed by de Brecht and Yamamoto (2009).

7.2 Computability and learnability of figures

First, we show computability of figures in \(\kappa (\mathcal {H})\).

Theorem 7.3

For every figure\(K \in \kappa (\mathcal {H})\), KisκH-computable.


It is enough to prove that there exists a computable function f such that κ(H)=κH(f(H)) for all \(H \in \mathcal {H}\). Fix a hypothesis \(H \in \mathcal {H}\) such that κ(H)=K. For all k∈ℕ and for Hk defined by
$$ H_k := \bigl\{w \in\bigl(\varSigma^{d} \bigr)^{*} \mid w \sqsubseteq v\ \text{with}\ v \in H^m\ \text{for some}\ m,\ \text{and}\ |w| = k \bigr\}, $$
we can easily check that
$$ d_{\mathrm {H}} \bigl(K, \nu_{\mathcal {Q}} \bigl(\iota(H_k) \bigr) \bigr) < \operatorname {diam}(k) = \sqrt{d} \cdot2^{-k}. $$
Moreover, for each k, \(\sqrt{d} \cdot2^{-g(k)} < 2^{-k}\), where
$$ g(k) = \lceil k + \log_2 \sqrt{d} \rceil. $$
Therefore there exists a computable function f which translates H into a representation of K given as follows: f(H)=p with p=w0w1♢… such that ι(Hg(k))=wk for all k∈ℕ. □

Thus a hypothesis H can be viewed as a “program” of a Type-2 machine that produces a κH-representation of the figure κ(H).

Both informants and texts are also representations (in the sense of TTE) of compact sets. Define the mapping ηINF by ηINF(σ):=K for every \(K \in \mathcal {K}^{*}\) and informant σ of K, and the mapping ηTXT by ηTXT(σ):=K for every \(K \in \mathcal {K}^{*}\) and text σ of K. Trivially ηINF<ηTXT holds, that is, some Type-2 machine can translate ηINF to ηTXT, but no machine can translate ηTXT to ηINF. Moreover, we have the following hierarchy of representations.

Lemma 7.4

ηINF<κH, \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\), and\(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\).


First we prove ηINFκH, that is, there is some computable function f such that ηINF(σ)=κH(f(σ)). Fix a figure K and its informant \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). For all k∈ℕ, we have
$$ d_{\mathrm {H}} \bigl(K, \mathrm {Pos}_k(K) \bigr) \le \operatorname {diam}(k) = \sqrt{d} \cdot2^{-k} $$
and Posk(K) can be obtained from σ. Moreover, for each k, \(\sqrt{d} \cdot2^{-g(k)} < 2^{-k}\), where
$$ g(k) = \lceil\, k + \log_2 \sqrt{d} \,\rceil. $$
Therefore there exists a computable function f that translates σ into a representation of K as follows: f(σ)=p, where p=w0w1♢… such that wk=ι(Posg(k)(K)) for all k∈ℕ.

Second, we prove \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\). Assume that the opposite, ηTXTκH holds. Then there exists a computable function f such that ηTXT(σ)=κH(f(σ)) for every figure \(K \in \mathcal {K}^{*}\). Fix a figure K and its text \(\sigma \in \operatorname {dom}(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}})\). This means that for any small ε∈ℝ, f can pick up finite sequences w1,w2,…,wn from Pos(K) such that \(d_{\mathrm {H}}(K, \nu_{\mathcal {Q}}(\iota(w_{1}, w_{2}, \dots, w_{n}))) \le \varepsilon \). However, if such f exists, we can easily check that {K}∈FIGEFEX-TXT, contradicting to our result (Theorem 5.5). It follows that \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\).

Third, we prove \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}\) and \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\). There is a figure K such that Kρ(w)={x} for some wΣ, i.e., K and ρ(w) intersect in only one point x. Such a w must be in σ as a positive example, that is, w∈Pos(K). However, a representation of K can be constructed without w. There exists an infinite sequence pκH with p=w0w1♢… such that \(x \notin\nu_{\mathcal {Q}}(w_{k})\) for all k∈ℕ. Thus, if there exists a computable f which outputs an example (w,1) from such a sequence after only seeing w0w1♢…♢wn, one can extend the sequence in such a way for some figure L with w∉Pos(L), in contradiction to the reduction. Therefore there is no computable function that outputs an example (w,1) from p, meaning that \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}\) and \(\kappa_{\mathrm {H}} \not \le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\). □

Here we interpret learning of figures as computation based on TTE. If we see the output of a learner, i.e., an infinite sequence of hypotheses, as an infinite sequence encoding a figure, the learner can be viewed as a translator of codes of figures. Naturally, we can assume that the hypothesis space \(\mathcal {H}\) is a discrete topological space, that is, every hypothesis \(H \in \mathcal {H}\) is isolated and is an open set itself. Define the mapping \(\lim_{\mathcal {H}}: \mathcal {H}^{\omega}\to \mathcal {H}\), where \(\mathcal {H}^{\omega}\) is the set of infinite sequences of hypotheses in \(\mathcal {H}\), by \(\lim_{\mathcal {H}}(\tau) := H\) if τ is an infinite sequence of hypotheses that converges to H, i.e., there exists n∈ℕ such that τ(i)=τ(n) for all in. This coincides with the naïve Cauchy representation given by Weihrauch (2000) and \(\boldsymbol {\varSigma }_{2}^{0}\)-admissible representation of hypotheses introduced by de Brecht and Yamamoto (2009). For any set \(\mathcal {F}\subseteq \mathcal {K}^{*}\), let \(\mathcal {F}_{\mathrm {D}}\) denote the space \(\mathcal {F}\) equipped with the discrete topology, that is, every subset of \(\mathcal {F}\) is open, and the mapping \(\mathrm {id}_{\mathcal {F}} : \mathcal {F}\to \mathcal {F}_{\mathrm {D}}\) be the identity on \(\mathcal {F}\). The computability of this identity is not trivial, since the topology of \(\mathcal {F}_{\mathrm {D}}\) is finer than that of \(\mathcal {F}\). Intuitively, this means that \(\mathcal {F}_{\mathrm {D}}\) is more informative than \(\mathcal {F}\). We can interpret learnability of \(\mathcal {F}\) as computability of the identity \(\mathrm {id}_{\mathcal {F}}\). The results in the following are summarized in Fig. 5.
Fig. 5

The commutative diagram representing FIGEX-INF-learning of \(\mathcal {F}\) (left), and FIGEFEX-INF-learning of \(\mathcal {F}\) (both left and right). In this diagram, \(\mathrm {I}{\scriptsize \mathrm {NF}}(\mathcal {F})\) denotes the set of informants of \(K \in \mathcal {F}\).

Theorem 7.5

A set\(\mathcal {F}\subseteq \mathcal {K}^{*}\)isFIGEX-INF-learnable (resp. FIGEX-TXT-learnable) if and only if the identity\(\mathrm {id}_{\mathcal {F}}\)is\((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-computable (resp. \((\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}, \kappa \circ\lim_{\mathcal {H}})\)-computable).


We only prove the case of FIGEX-INF-learning, since we can prove the case of FIGEX-TXT-learning in exactly the same way.

The “only if” part: There is a learner M that FIGEX-INF-learns \(\mathcal {F}\), hence for all \(K \in \mathcal {F}\) and all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\), Mσ converges to a hypothesis \(H \in \mathcal {H}\) such that κ(H)=K. Thus
$$ \mathrm {id}_{\mathcal {F}}\circ\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}} (\sigma ) = \kappa \circ\mathrm{lim}_{\mathcal {H}} (\text {\textbf {M}}_{\sigma }), $$
and this means that \(\mathrm {id}_{\mathcal {F}}\) is \((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-computable.

The “if” part: For some M, the above equation (9) holds for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). This means that M is a learner that FIGEX-INF-learns \(\mathcal {F}\). □

Here we consider two more learning criteria, FIGFIN-INF- and FIGFIN-TXT-learning, where the learner generates only one correct hypothesis and halts. This learning corresponds to finite learning or one shot learning introduced by Gold (1967), Trakhtenbrot and Barzdin (1970) and it is a special case of learning with a bound of mind change complexity, the number of changes of hypothesis, introduced by Freivalds and Smith (1993) and used to measure the complexity of learning classes (Jain et al. 1999). We obtain the following theorem.

Theorem 7.6

A set\(\mathcal {F}\subseteq \mathcal {K}^{*}\)isFIGFIN-INF-learnable (resp. FIGFIN-TXT-learnable) if and only if the identity\(\mathrm {id}_{\mathcal {F}}\)is (ηINF,κ)-computable (resp. (ηTXT,κ)-computable).


We only prove the case of FIGFIN-INF-learning, since we can prove the case of FIGFIN-TXT-learning in exactly the same way.

The “only if” part: There is a learner M that FIGFIN-INF-learns \(\mathcal {F}\), hence for all \(K \in \mathcal {F}\) and all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\) of K, we can assume that Mσ=H such that κ(H)=K. Thus we have
$$ \mathrm {id}_{\mathcal {F}}\circ\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}} (\sigma ) = \kappa (\text {\textbf {M}}_{\sigma }). $$
This means that \(\mathrm {id}_{\mathcal {F}}\) is (ηINF,κ)-computable.

The “if” part: For some M, the above equation (10) holds for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). This means that M is a learner that FIGFIN-INF-learns \(\mathcal {F}\). □

Finally, we show a connection between effective learning of figures and the computability of figures. Since FIGEFEX-TXT=∅ (Theorem 5.5), we only treat effective learning from informants. We define the representation \(\gamma:\subseteq \mathcal {H}^{\omega} \to \mathcal {K}^{*}\) by γ(p):=K if p=H0,H1,… such that \(H_{i} \in \mathcal {H}\) and dH(K,κ(Hi))≤2i for all i∈ℕ.

Lemma 7.7



First we prove γκH. For the function g:ℕ→ℝ such that
$$ g(i) = \lceil\, i + \log_2 \sqrt{d}\, \rceil, $$
we have \(\operatorname {diam}(g(i)) = \sqrt{d} \cdot2^{-g(i)} \le2^{-i}\) for all i∈ℕ. Thus there exists a computable function f such that, for all \(p \in \operatorname {dom}(\gamma)\), f(p) is a representation of κH since, for an infinite sequence of hypotheses p=H0,H1,…, all f has to do is to generate an infinite sequence q=w0w1w2♢⋯ such that \(w_{i} = \iota(H_{g(i)}^{g(i)})\) for all i∈ℕ, which results in
$$ d_{\mathrm {H}} \bigl(K, \nu_{\mathcal {Q}}(w_i) \bigr) \le \operatorname {diam}\bigl(g(i) \bigr) = \sqrt{d} \cdot2^{-g(i)} \le2^{-i} $$
for all i∈ℕ.
Next, we prove κHγ. Fix \(q \in \operatorname {dom}(\kappa_{\mathrm {H}})\) with q=w0w1♢⋯. For each i∈ℕ, let wi=ι(wi,0,wi,1,…,wi,n). Then the set {wi,0,…,wi,n}, which we denote Hi, becomes a hypothesis. From the definition of κH,
$$ d_{\mathrm {H}} \bigl(K, \kappa (H_i) \bigr) \le2^{-i} $$
holds for all i∈ℕ. This means that, for the sequence p=w0,w1,…, γ(p)=K. We therefore have γκH. □

By using this lemma, we interpret effective learning of figures as the computability of two identities (Fig. 5).

Theorem 7.8

A set\(\mathcal {F}\subseteq \mathcal {K}^{*}\)isFIGEFEX-INF-learnable if and only if there exists a computable functionfsuch thatfis a\((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-realization of the identity\(\mathrm {id}_{\mathcal {F}}\), andfis also a (ηINF,γ)-realization of the identity\(\mathrm {id}: \mathcal {K}^{*} \to \mathcal {K}^{*}\).


We prove the latter half of the theorem, since the former part can be proved exactly as for Theorem 7.5.

The “only if” part: We assume that a learner MFIGEFEX-INF-learns \(\mathcal {F}\). For all \(K \in \mathcal {K}^{*}\) and all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\),
$$ \mathrm {id}\circ\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}} (\sigma ) = \gamma(\text {\textbf {M}}_{\sigma }) $$
holds since the identity id is (ηINF,γ)-computable.

The “if” part: For some M, id∘ηINF(σ)=γ(Mσ) for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). It follows that M is a learner that FIGEFEX-INF-learns \(\mathcal {F}\). □

Thus in FIGEFEX-INF- and FIGEFEX-TXT-learning of a set of figures \(\mathcal {F}\), a learner M outputs a hypothesis H with κ(H)=K in finite time if \(K \in \mathcal {F}\), and M outputs the “standard” representation of K if \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\) since we prove that γκH in Lemma 7.7. Informally, this means that there is not too much loss of information of figures even if they are not explanatorily learnable.

8 Conclusion

We have proposed the learning of figures using self-similar sets based on Gold’s learning model towards a new theoretical framework of binary classification focusing on computability, and demonstrated a learnability hierarchy under various learning criteria (Fig. 3). The key to the computable approach is the amalgamation of discretization of data and the learning process. We showed a novel mathematical connection between fractal geometry and Gold’s model by measuring the lower bound of the size of training data with the Hausdorff dimension and the VC dimension. Furthermore, we analyzed our learning model using TTE (Type-2 Theory of Effectivity) and presented several mathematical connections between computability and learnability.

Many recent methods in machine learning are based on a statistical approach (Bishop 2007). The reason is that many data in the real world are in analog (real-valued) form, and the statistical approach can treat such analog data directly in theory. However, all learning methods are performed on computers. This means that all machine learning algorithms actually treat discretized digital data and, now, most research pays no attention to the gap between analog and digital data. In this paper we have proposed a novel and completely computable learning method for analog data, and have analyzed the method precisely. This work provides a theoretical foundation for computable learning from analog data, such as classification, regression, and clustering.


Müller (2001) and Schröder (2002a) give some interesting examples in the study of computation for real numbers.


Sugiyama et al. (2006, 2009) have also contributed to the area, but their work was only presented at closed workshops.


The reason for this notation is that σ can be viewed as a mapping from ℕ (including 0) to the set of examples.


Consistency was also studied in the same form by Barzdin (1974).


The article (Sakurai 1991) is written in Japanese. The same theorem is mentioned by Mukouchi and Arikawa (1995, p. 60, Theorem 3).



The authors sincerely thank to the editor and anonymous reviewers for their lots of useful comments and suggestions which have led to invaluable improvements of this paper. This work was partly supported by Grant-in-Aid for Scientific Research (A) 22240010 and for JSPS Fellows 22⋅5714.

Copyright information

© The Author(s) 2012

Authors and Affiliations

  • Mahito Sugiyama
    • 1
  • Eiju Hirowatari
    • 2
  • Hideki Tsuiki
    • 3
  • Akihiro Yamamoto
    • 4
  1. 1.The Institute of Scientific and Industrial Research (ISIR)Osaka UniversityOsakaJapan
  2. 2.Center for Fundamental EducationThe University of KitakyushuKitakyushuJapan
  3. 3.Graduate School of Human and Environmental StudiesKyoto UniversityKyotoJapan
  4. 4.Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations