# Learning figures with the Hausdorff metric by fractals—towards computable binary classification

## Abstract

We present learning of *figures*, nonempty compact sets in Euclidean space, based on *Gold’s learning model* aiming at a *computable* foundation for binary classification of multivariate data. Encoding real vectors with no numerical error requires *infinite* sequences, resulting in a gap between each real vector and its *discretized* representation used for the actual machine learning process. Our motivation is to provide an analysis of machine learning problems that explicitly tackles this aspect which has been glossed over in the literature on binary classification as well as in other machine learning tasks such as regression and clustering. In this paper, we amalgamate two processes: discretization and binary classification. Each learning target, the set of real vectors classified as positive, is treated as a figure. A learning machine receives discretized vectors as input data and outputs a sequence of discrete representations of the target figure in the form of *self-similar sets*, known as *fractals*. The generalization error of each output is measured by the *Hausdorff metric*. Using this learning framework, we reveal a hierarchy of learnable classes under various learning criteria in the track of traditional analysis based on Gold’s learning model, and show a mathematical connection between machine learning and fractal geometry by measuring the complexity of learning using the *Hausdorff dimension* and the *VC dimension*. Moreover, we analyze computability aspects of learning of figures using the framework of Type-2 Theory of Effectivity (TTE).

## Keywords

Binary classification Discretization Self-similar set Gold’s learning model Hausdorff metric Type-2 theory of effectivity## 1 Introduction

*Discretization* is a fundamental process in machine learning from analog data. For example, Fourier analysis is one of the most essential signal processing methods and its discrete version, *discrete Fourier analysis*, is used for learning or recognition on a computer from continuous signals. However, in the method, only the direction of the time axis is discretized, so each data point is not purely discretized. That is to say, continuous (electrical) waves are essentially treated as finite/infinite sequences of *real numbers*, hence each value is still continuous (analog). The gap between analog and digital data therefore remains.

This problem appears all over machine learning from observed multivariate data. The reason is that an infinite sequence is needed to encode a real vector exactly without any numerical error, since the cardinality of the set of real numbers, which is the same as that of infinite sequences, is much larger than that of the set of finite sequences. Thus to treat each data point on a computer, it has to be *discretized* and considered as an approximate value with some numerical error. However, to date, most machine learning algorithms ignore the gap between the original value and its discretized representation. This gap could result in some unexpected numerical errors.^{1} Since now machine learning algorithms can be applied to massive datasets, it is urgent to give a theoretical foundation for learning, such as classification, regression, and clustering, from multivariate data, in a fully computational manner to guarantee the soundness of the results of learning.

In the field of computational learning theory, *Valiant’s learning model* (also called *PAC, Probably Approximately Correct, learning model*), proposed by Valiant (1984), is used for theoretical analysis of machine learning algorithms. In this model, we can analyze the robustness of a learning algorithm in the face of noise or inaccurate data and the complexity of learning with respect to the rate of convergence or the size of the input using the concept of probability. Blumer et al. (1989) and Ehrenfeucht et al. (1989) provided the crucial conditions for learnability, that is, the lower and upper bounds for the sample size, using the *VC* (*Vapnik-Chervonenkis*) *dimension* (Vapnik and Chervonenkis 1971). These results can be applied to various concept representations that handle real-valued inputs and use real-valued parameters, for example, to analyze learning of neural networks (Baum and Haussler 1989). However, this learning model is not in line with discrete and computational analysis of machine learning. We cannot know which class of continuous objects is exactly learnable and what kind of data are needed to learn from a finite expression of discretized multivariate data. Although PAC learning from axis-parallel rectangles has already been investigated (Blumer et al. 1989; Kearns and Vazirani 1994; Long and Tan 1998), which can be viewed as a variant of learning from multivariate data with numerical error, it is not applicable in the study. Our goal is to investigate computational learning, focusing on a common ground between “learning” and “computation” of real numbers based on the behavior of Turing machines, without any reference to probability distributions. For the purpose of the investigation, we need to distinguish abstract mathematical objects such as real numbers and their concrete representations, or codes, on a computer.

Instead, in this paper we use *Gold’s learning model* (also called *identification in the limit*), which is originally designed for learning of recursive functions (Gold 1965) and languages (Gold 1967). In the model, a learning machine is assumed to be a procedure, i.e., a Turing machine (Turing 1937) which never halts, that receives training data from time to time, and outputs representations (hypotheses) of the target from time to time. All data are usually assumed to be given at some point in the future. Starting from this learning model, learnability of classes of discrete objects, such as languages and recursive functions, has been analyzed in detail under various learning criteria (Jain et al. 1999). However, analysis of learning for continuous objects, such as classification, regression, and clustering for multivariate data, with Gold’s model is still under development, despite such settings being typical in modern machine learning. To the best of our knowledge, the only line of studies devoted to learning of real-valued functions was by Hirowatari and Arikawa (1997, 2001) Apsītis et al. (1999), Hirowatari et al. (2003, 2005, 2006), where they addressed the analysis of learnable classes of real-valued functions using computable representations of real numbers.^{2} We therefore need a new theoretical and computational framework for modern machine learning based on Gold’s learning model with discretization of numerical data.

In this paper we consider the problem of *binary classification* for multivariate data, which is one of the most fundamental problems in machine learning and pattern recognition. In this task, a training dataset consists of a set of pairs {(*x* _{1},*y* _{1}),(*x* _{2},*y* _{2}),…,(*x* _{ n },*y* _{ n })}, where *x* _{ i }∈ℝ^{ d } is a *feature vector*, *y* _{ i }∈{0,1} is a *label*, and the *d*-dimensional Euclidean space ℝ^{ d } is a *feature space*. The goal is to learn a *classifier* from the given training dataset, that is, to find a mapping *h*:ℝ^{ d }→{0,1} such that, for all *x*∈ℝ^{ d }, *h*(*x*) is expected to be the same as the true label of *x*. In other words, such a classifier *h* is the *characteristic function* of a subset *L*={*x*∈ℝ^{ d }∣*h*(*x*)=1} of ℝ^{ d }, which has to be similar to the true set *K*={*x*∈ℝ^{ d }∣the true label of *x* is 1} as far as possible. Throughout the paper, we assume that each feature is normalized by some data preprocessing such as min-max normalization for simplicity, that is, the feature space is the unit interval (cube) \(\mathcal {I}^{d} = [0, 1] \times \dots\times[0, 1]\) in the *d*-dimensional Euclidean space ℝ^{ d }. In many realistic scenarios, each target *K* is a closed and bounded subset of \(\mathcal {I}^{d}\), i.e., a nonempty compact subset of \(\mathcal {I}^{d}\), called a *figure*. Thus here we address the problem of binary classification by treating it as “learning of figures”.

In this machine learning process, we implicitly treat any feature vector through its *representation*, or *code* on a computer, that is, each feature vector \(x \in \mathcal {I}^{d}\) is represented by a sequence *p* over some alphabet *Σ* using an encoding scheme *ρ*. Here such a surjective mapping *ρ* is called a *representation* and should map the set of “infinite” sequences *Σ* ^{ ω } to \(\mathcal {I}^{d}\) since there is no one-to-one correspondence between finite sequences and real numbers (or real vectors). In this paper, we use the *binary representation* *ρ*:*Σ* ^{ ω }→[0,1] with *Σ*={0,1}, which is defined by *ρ*(*p*):=∑*p* _{ i }⋅2^{−(i+1)} for an infinite sequence *p*=*p* _{0} *p* _{1} *p* _{2}…. For example, *ρ*(0100…)=0.25, *ρ*(1000…)=0.5, and *ρ*(0111…)=0.5. However, we cannot treat infinite sequences on a computer in finite time and, instead, we have to use *discretized* values, i.e., *truncated finite sequences* in any actual machine learning process. Thus in learning of a classifier *h* for the target figure *K*, we cannot use an exact data point *x*∈*K* but have to use a discretized finite sequence *w*∈*Σ* ^{∗} which tells us that *x* takes one of the values in the set {*ρ*(*p*)∣*w*⊏*p*} (*w*⊏*p* means that *w* is a *prefix* of *p*). For instance, if *w*=01, then *x* should be in the interval [0.25,0.5]. For a finite sequence *w*∈*Σ* ^{∗}, we define *ρ*(*w*):={*ρ*(*p*)∣*w*⊏*p* with *p*∈*Σ* ^{ ω }} using the same symbol *ρ*. From a geometric point of view, *ρ*(*w*) means a hyper-rectangle whose sides are parallel to the axes in the space \(\mathcal {I}^{d}\). For example, for the binary representation *ρ*, we have *ρ*(0)=[0,0.5], *ρ*(1)=[0.5,1], *ρ*(01)=[0.25,0.5], and so on. Therefore in the actual learning process, while a target set *K* and each point *x*∈*K* exist mathematically, a learning machine can only treat finite sequences as training data.

Here the problem of binary classification is stated in a computational manner as follows: Given a training dataset {(*w* _{1},*y* _{1}),(*w* _{2},*y* _{2}),…,(*w* _{ n },*y* _{ n })} (*w* _{ i }∈*Σ* ^{∗} for each *i*∈{1,2,…,*n*}), where *y* _{ i }=1 if \(\rho(w_{i}) \cap K \not= \emptyset\) for a target figure \(K \subseteq \mathcal {I}^{d}\) and *y* _{ i }=0 otherwise, learn a classifier *h*:*Σ* ^{∗}→{0,1} for which *h*(*w*) should be the same as the true label of *w* for all *w*∈*Σ* ^{∗}. Each training datum (*w* _{ i },*y* _{ i }) is called a *positive example* if *y* _{ i }=1 and a *negative example* if *y* _{ i }=0.

Assume that a figure *K* is represented by a set *P* of infinite sequences, i.e., {*ρ*(*p*)∣*p*∈*P*}=*K*, using the binary representation *ρ*. Then learning the figure is different from learning the well-known *prefix closed set* Pref(*P*), defined as Pref(*P*):={*w*∈*Σ* ^{∗}∣*w*⊏*p* for some *p*∈*P*}, since generally \(\mathrm {Pref}(P) \not= \{w \in\varSigma^{*} \mid\rho(w) \cap K \not= \emptyset\}\) holds. For example, if *P*={*p*∈*Σ* ^{ ω }∣1⊏*p*}, the corresponding figure *K* is the interval [0.5,1]. Then, the infinite sequence 0111… is a positive example since *ρ*(0111…)=0.5 and \(\rho(\mathtt {0}\mathtt {1}\mathtt {1}\mathtt {1}\dots) \cap K \not= \emptyset\), but it is not contained in Pref(*P*). This problem is fundamentally due to rational numbers having two representations, for example, both 0111… and 1000… represent 0.5. Solving this mismatch between objects of learning and their representations is one of the challenging problems of learning continuous objects based on their representation in a computational manner.

For finite expression of classifiers, we use *self-similar sets* known as *fractals* (Mandelbrot 1982) to exploit their simplicity and the power of expression theoretically provided by the field of fractal geometry. Specifically, we can approximate any figure by some self-similar set arbitrarily closely (derived from the Collage Theorem given by Falconer 2003) and can compute it by a simple recursive algorithm, called an *IFS* (*Iterated Function System*) (Barnsley 1993; Falconer 2003). This approach can be viewed as the analog of the discrete Fourier analysis, where *FFT* (*Fast Fourier Transformation*) is used as the fundamental recursive algorithm. Moreover, in the process of sampling from analog data in discrete Fourier analysis, *scalability* is a desirable property. It requires that when the sample resolution increases, the accuracy of the result is monotonically refined. We formalize this property as *effective learning* of figures, which is inspired by *effective computing* in the framework of Type-2 Theory of Effectivity (TTE) studied in computable analysis (Schröder 2002b; Weihrauch 2000). This model guarantees that as a computer reads more and more precise information of the input, it produces more and more accurate approximations of the result. Here we adapt this model from computation to learning, where if a learner (learning machine) receives more and more accurate training data, it learns better and better classifiers (self-similar sets) approximating the target figure.

*generalization error*and measure the error by the

*Hausdorff metric*since it induces the standard topology on the set of figures (Beer 1993).

- 1.
We formalize the learning of figures using self-similar sets based on Gold’s learning model towards realizing fully computable binary classification (Sect. 3). We construct a representational system for learning using self-similar sets based on the binary representation of real numbers, and show desirable properties of it (Lemmas 3.2, 3.3, and 3.4).

- 2.
We construct a learnability hierarchy under various learning criteria, summarized in Fig. 3 (Sect. 4 and 5). We consider five criteria for learning: explanatory learning (Sect. 4.1), consistent learning (Sect. 4.2), reliable and refutable learning (Sect. 4.3), and effective learning (Sect. 5).

- 3.
We show a mathematical connection between learning and fractal geometry by measuring the complexity of learning using the Hausdorff dimension and the VC dimension (Sect. 6). Specifically, we give a lower bound on the number of positive examples using the dimensions.

- 4.
We also show a connection between computability of figures studied in computable analysis and learnability of figures discussed in this paper using TTE (Sect. 7). Learning can be viewed as computable realization of the identity from the set of figures to the same set equipped with a finer topology.

The rest of the paper is organized as follows: We review related work in comparison to the present work in Sect. 2. We formalize computable binary classification as learning of figures in Sect. 3 and analyze the learnability hierarchy induced by variants of our model in Sects. 4 and 5. The mathematical connection between fractal geometry and Gold’s model with the Hausdorff and the VC dimensions is presented in Sect. 6 and between computability and learnability of figures in Sect. 7. Section 8 gives the conclusion.

A preliminary version of this paper was presented at the 21st International Conference on Algorithmic Learning Theory (Sugiyama et al. 2010). In this paper, formalization of learning in Sect. 3 is completely updated for clarity and simplicity, and all theorems and lemmas have formal proofs (they were omitted in the conference paper). Furthermore, discussion about related work in Sect. 2 and TTE analysis in Sect. 7 are new contributions. In addition, several examples and figures are added for readability.

## 2 Related work

Statistical approaches to machine learning are now achieving great success since they are originally designed for analyzing observed multivariate data and, to date, many statistical methods have been proposed to treat continuous objects such as real-valued functions (Bishop 2007). However, most methods pay no attention to discretization and the finite representation of analog data on a computer. For example, multi-layer perceptrons are used to learn real-valued functions, since they can approximate every continuous function arbitrarily and accurately. However, a perceptron is based on the idea of regulating analog wiring (Rosenblatt 1958), hence such learning is not purely computable, i.e., it ignores the gap between analog raw data and digital discretized data. Furthermore, although several discretization techniques have been proposed by Elomaa and Rousu (2003), Fayyad and Irani (1993), Gama and Pinto (2006), Kontkanen et al. (1997), Li et al. (2003), Lin et al. (2003), Liu et al. (2002), Skubacz and Hollmén (2000), they treat discretization as data preprocessing for improving the accuracy or efficiency of machine learning algorithms. The process of discretization is therefore not considered from a computational point of view, and “computability” of machine learning algorithms is not discussed at sufficient depth.

There are several related articles considering learning under various restrictions in Gold’s model (Goldman et al. 2003), Valiant’s model (Ben-David and Dichterman 1998; Decatur and Gennaro 1995), and other learning context (Khardon and Roth 1999). Moreover, recently learning from partial examples, or examples with missing information, has attracted much attention in Valiant’s learning model (Michael 2010, 2011). In this paper we also consider learning from examples with missing information, which are truncated finite sequences. However, our model is different from the cited work, since the “missing information” in this paper corresponds to *measurement error* of real-valued data. Our motivation comes from actual measurement/observation of a physical object, where every datum obtained by an experimental instrument must have some numerical error in principle (Baird 1994). For example, if we measure the size of a cell by a microscope equipped with micrometers, we cannot know the true value of the size but an approximate value with numerical error, which depends on the degree of magnification by the micrometers. In this paper we try to treat this process as learning from multivariate data, where an approximate value corresponds to a truncated finite sequence and error becomes small as the length of the sequence increases. The model of computation for real numbers within the framework of TTE, as mentioned in the introduction, fits our motivation, and this approach is unique in computational learning theory.

Self-similar sets can be viewed as a geometric interpretation of languages recognized by *ω*-*automata* (Perrin and Pin 2004), first introduced by Büchi (1960), and learning of such languages has been investigated by De La Higuera and Janodet (2001), Jain et al. (2011). Both works focus on learning *ω*-languages from their prefixes, i.e. texts (positive data), and show several learnable classes. This approach is different from ours since our motivation is to address computability issues in the field of machine learning from numerical data, and hence there is a gap between prefixes of *ω*-languages and positive data for learning in our setting as mentioned in the introduction. Moreover, we consider learning from both positive and negative data, which is a new approach in the context of learning of infinite words.

Recently, two of the authors, Sugiyama and Yamamoto (2010), have addressed discretization of real vectors in a computational approach and proposed a new similarity measure, called *coding divergence*. It evaluates the similarity between two sets of real vectors and can be applied to many machine learning tasks such as classification and clustering. However, it does not address the issue of the learnability or complexity of learning of continuous objects.

## 3 Formalization of learning

^{+}(resp. ℝ

^{+}) is the set of positive natural (resp. real) numbers. The

*d*-fold product of ℝ is denoted by ℝ

^{ d }and the set of nonempty compact subsets of ℝ

^{ d }is denoted by \(\mathcal {K}^{*}\). Notations used in this paper are summarized in Table 1.

Notation

ℕ | The set of natural numbers including 0 |

ℕ | The set of positive natural numbers, i.e., ℕ |

ℚ | The set of rational numbers |

ℝ | The set of real numbers |

ℝ | The set of positive real numbers |

| The number of dimensions ( |

ℝ | |

\(\mathcal {K}^{*}\) | The set of figures (nonempty compact subsets of ℝ |

\(\mathcal {I}^{d}\) | The unit interval [0,1]×…×[0,1] |

| Figures (nonempty compact sets) |

# | The number of elements in |

\(\mathcal {F}\) | Set of figures |

| Contraction for real numbers |

| Finite set of contractions |

| Contraction for figures |

| Alphabet |

| The set of finite sequences whose length are |

| The set of finite sequences |

| The set of finite sequences without the empty string |

| The set of infinite sequences |

| The empty string |

| Finite sequences |

| |

↑ | The set { |

〈⋅〉 | The tupling function, i.e., \(\langle p^{1}, p^{2}, \dots, p^{d}\rangle :=p_{0}^{1}p_{0}^{2}\dots p_{0}^{d} p_{1}^{1}p_{1}^{2}\dots p_{1}^{d} p_{2}^{1}p_{2}^{2}\dots p_{2}^{d}\dots\) |

| | The length of |

\(\operatorname {diam}(k)\) | The diameter of the set |

| Infinite sequences |

| Set of finite or infinite sequences |

| Binary representation |

| Representation, i.e., a mapping from finite or infinite sequences to some objects |

| |

| |

\(\nu_{\mathbb {Q}^{d}}\) | Representation for rational numbers |

\(\nu_{\mathcal {Q}}\) | Representation for finite sets of rational numbers |

\(\mathcal {H}\) | The hypothesis space (The set of finite sets of finite sequences) |

| Hypothesis |

| Classifier of hypothesis |

| The mapping from hypotheses to figures |

| Learner |

| Presentation (informant or text) |

Pos( | The set of finite sequences of positive examples of |

Pos | The set { |

Neg( | The set of finite sequences of negative examples of |

| The Euclidean distance |

| The Hausdorff distance |

ℌ | The Hausdorff measure |

dim | The Hausdorff dimension |

dim | The box-counting dimension |

dim | The similarity dimension |

dim | The VC dimension |

*binary representation*\(\rho^{d} : (\varSigma^{d})^{\omega} \to \mathcal {I}^{d}\) as the canonical representation for real numbers. If

*d*=1, this is defined as follows:

*Σ*={0,1} and

*p*=

*p*

_{0}

*p*

_{1}

*p*

_{2}…. Note that

*Σ*

^{ d }denotes the set {

*a*

_{1}

*a*

_{2}…

*a*

_{ d }∣

*a*

_{ i }∈

*Σ*} and

*Σ*

^{1}=

*Σ*. For example,

*ρ*

^{1}(0100…)=0.25,

*ρ*

^{1}(1000…)=0.5, and so on. Moreover, by using the same symbol

*ρ*, we introduce a representation \(\rho^{1} :\varSigma^{*} \to \mathcal {K}^{*}\) for finite sequences defined as follows: where ↑

*w*={

*p*∈

*Σ*

^{ ω }∣

*w*⊏

*p*}. For instance,

*ρ*

^{1}(01)=[0.25,0.5] and

*ρ*

^{1}(10)=[0.5,0.75].

*d*-dimensional space with

*d*>1, we use the

*d-dimensional binary representation*\(\rho^{d} : (\varSigma^{d})^{\omega } \to \mathcal {I}^{d}\) defined in the following manner.

*d*infinite sequences

*p*

^{1},

*p*

^{2}, …, and

*p*

^{ d }are concatenated using the

*tupling function*〈⋅〉 such that

*w*

^{1}|=|

*w*

^{2}|=…=|

*w*

^{ d }|=

*n*. Note that, for any

*w*=〈

*w*

^{1},…,

*w*

^{ d }〉∈(

*Σ*

^{ d })

^{∗}, |

*w*

^{1}|=|

*w*

^{2}|=…=|

*w*

^{ d }| always holds, and we denote the length by |

*w*| in this paper. For a set of finite sequences, i.e., a

*language*

*L*⊂(

*Σ*

^{ d })

^{∗}, we define

*d*of

*ρ*

^{ d }if it is understood from the context.

*a priori*, and one of them is chosen as a target in each learning phase. A learning machine uses

*self-similar sets*, known as fractals and defined by finite sets of contractions. This approach is one of the key ideas in this paper. Here, a

*contraction*is a mapping

*φ*:ℝ

^{ d }→ℝ

^{ d }such that, for all

*x*,

*y*∈

*X*,

*d*(

*φ*(

*x*),

*φ*(

*y*))≤

*cd*(

*x*,

*y*) for some real number

*c*with 0<

*c*<1. For a finite set of contractions

*C*, a nonempty compact set

*F*satisfying

*F*is called the

*self-similar set*of

*C*. Moreover, if we define a mapping \(\varPhi:\mathcal {K}^{*} \to \mathcal {K}^{*}\) by

*k*∈ℕ recursively, then

*φ*(

*K*)⊂

*K*for every

*φ*∈

*C*. This means that we have a level-wise construction algorithm with

*Φ*to obtain the self-similar set

*F*.

*hypotheses*, each of which is a finite language and becomes a finite expression of a self-similar set that works as a classifier. Formally, for a finite language

*H*⊂(

*Σ*

^{ d })

^{∗}, we consider

*H*

^{0},

*H*

^{1},

*H*

^{2},… such that

*H*

^{ k }is recursively defined as follows:

*P*(⋅) which generates

*H*

^{0},

*H*

^{1},

*H*

^{2},… when receiving a hypothesis

*H*. We give the semantics of a hypothesis

*H*by the following equation:

*ρ*(

*H*

^{ k })⊃⋃

*ρ*(

*H*

^{ k+1}) holds for all

*k*∈ℕ,

*κ*(

*H*)=lim

_{ k→∞}⋃

*ρ*(

*H*

^{ k }). We denote the set of hypotheses {

*H*⊂(

*Σ*

^{ d })

^{∗}∣

*H*is finite} by \(\mathcal {H}\) and call it the

*hypothesis space*. We use this hypothesis space throughout the paper. Note that, for a pair of hypotheses

*H*and

*L*,

*H*=

*L*implies

*κ*(

*H*)=

*κ*(

*L*), but the converse may not hold.

## Example 3.1

*d*=2 and let a hypothesis

*H*be the set {〈0,0〉,〈0,1〉,〈1,1〉}={00,01,11}. We have and the figure

*κ*(

*H*) defined in (6) is the

*Sierpiński triangle*(Fig. 2). If we consider the following three mappings:

*ρ*(00),

*ρ*(01), and

*ρ*(11), respectively. Thus each sequence in a hypothesis can be viewed as a representation of one of these squares, which are called

*generators*for a self-similar set since if we have the initial set \(\mathcal {I}^{d}\) and generators \(\varphi _{1}(\mathcal {I}^{d})\), \(\varphi _{2}(\mathcal {I}^{d})\), and \(\varphi _{3}(\mathcal {I}^{d})\), we can reproduce the three mappings

*φ*

_{1},

*φ*

_{2}, and

*φ*

_{3}and construct the self-similar set from them. Note that there exist infinitely many hypotheses

*L*such that

*κ*(

*H*)=

*κ*(

*L*) and

*H*≠

*L*. For example,

*L*={〈0,0〉, 〈1,1〉, 〈00,10〉, 〈00,11〉, 〈01,11〉}.

## Lemma 3.2

(Soundness of hypotheses)

*For every hypothesis* \(H \in \mathcal {H}\), *the set* *κ*(*H*) *defined by *(6) *is a self*-*similar set*.

## Proof

*H*={

*w*

_{1},

*w*

_{2},…,

*w*

_{ n }}. We can easily check that the set of rectangles

*ρ*(

*w*

_{1}),

*ρ*(

*w*

_{2}),…,

*ρ*(

*w*

_{ n }) is a generator defined by the mappings

*φ*

_{1},

*φ*

_{2},…,

*φ*

_{ n }, where each

*φ*

_{ i }maps the unit interval \(\mathcal {I}^{d}\) to the figure

*ρ*(

*w*

_{ i }). Define

*Φ*and

*Φ*

^{ k }in the same way as (4) and (5). For each

*k*∈ℕ,

*κ*(

*H*) is exactly the same as the self-similar set defined by the mappings

*φ*

_{1},

*φ*

_{2},…,

*φ*

_{ n }, that is,

*κ*(

*H*)=⋃

*φ*

_{ i }(

*κ*(

*H*)) holds. □

*generalization error*, which is usually used to score the quality of hypotheses in a machine learning context. The generalization error of a hypothesis

*H*for a target figure

*K*, written by \(\mbox {$\mathrm {GE}$}(K, H)\), is defined by the

*Hausdorff metric*

*d*

_{H}on the space of figures, i.e.,

*K*

_{ δ }is the

*δ*-

*neighborhood*of

*K*defined by

*d*

_{E}is the Euclidean metric such that

*x*=(

*x*

^{1},…,

*x*

^{ d }),

*a*=(

*a*

^{1},…,

*a*

^{ d })∈ℝ

^{ d }. The Hausdorff metric is one of the standard metrics on the space since the metric space \((\mathcal {K}^{*}, d_{\mathrm {H}})\) is complete (in the sense of topology) and \(\mbox {$\mathrm {GE}$}(K, H) = 0\) if and only if

*K*=

*κ*(

*H*) (Beer 1993; Kechris 1995). The topology on \(\mathcal {K}^{*}\) induced by the Hausdorff metric is called the

*Vietoris topology*. Since the cardinality of the set of hypotheses \(\mathcal {H}\) is smaller than that of the set of figures \(\mathcal {K}^{*}\), we often cannot find the exact hypothesis

*H*for a figure

*K*such that \(\mbox {$\mathrm {GE}$}(K, H) = 0\). However, following the Collage Theorem given by Falconer (2003), we show that the power of representation of hypotheses is still sufficient, that is, we always can approximate a given figure arbitrarily closely by some hypothesis.

## Lemma 3.3

(Representational power of hypotheses)

*For any* *δ*∈ℝ *and for every figure* \(K \in \mathcal {K}^{*}\), *there exists a hypothesis* *H* *such that* \(\mbox {$\mathrm {GE}$}(K, H) < \delta\).

## Proof

*K*and the parameter

*δ*. Here we denote the diameter of the set

*ρ*(

*w*) with |

*w*|=

*k*by \(\operatorname {diam}(k)\). Then we have

*d*=1, and \(\operatorname {diam}(1) = 1/\sqrt{2}\) and \(\operatorname {diam}(2) = 1/\sqrt{8}\) if

*d*=2. For

*k*with \(\operatorname {diam}(k) < \delta\), let

*K*contains

*κ*(

*H*) and the \(\operatorname {diam}(k)\)-neighborhood of

*κ*(

*H*) contains

*K*. Therefore we have \(\mbox {$\mathrm {GE}$}(K, H) < \delta\). □

*H*has to be

*computable*, that is, the function

*h*:(

*Σ*

^{ d })

^{∗}→{0,1} such that, for all

*w*∈(

*Σ*

^{ d })

^{∗},

*h*is the

*classifier*of

*H*. The computability of

*h*is not trivial, since for a finite sequence

*w*, the two conditions

*h*(

*w*)=1 and

*w*∈

*H*

^{ k }are not equivalent. Intuitively, this is because each interval represented by a finite sequence is

*closed*. For example, in the case of Example 3.1,

*h*(10)=1 because

*ρ*(10)=[0.5,1]×[0,0.5] and

*ρ*(10)∩

*κ*(

*H*)={(0.5,0.5)}≠∅ whereas 10∉

*H*

^{ k }for any

*k*∈ℕ. Here we guarantee this property of computability.

## Lemma 3.4

(Computability of classifiers)

*For every hypothesis* \(H \in \mathcal {H}\), *the classifier* *h* *of H* *defined by *(7) *is computable*.

## Proof

First we consider whether or not the boundary of an interval is contained in *κ*(*H*). Suppose *d*=1 and let *C* be a finite set of contractions and *F* be the self-similar set of *C*. We have the following property: Let \([x, y] = \varphi _{1} \circ \varphi _{2} \circ\dots\circ \varphi _{n} (\mathcal {I}^{1})\) for some *φ* _{1},*φ* _{2},…,*φ* _{ n }∈*C* and let \(I = \varphi '_{1} \circ \varphi '_{2} \circ\dots\circ \varphi '_{n'} (\mathcal {I}^{1})\) for \(\varphi '_{1}, \varphi '_{2}, \dots, \varphi '_{n'} \in C\). Assume that, if *n*′ is large enough, there is no such *I* satisfying *x*∈*I* and min*I*<*x* (resp. max*I*>*y*). Then, we have *x*∈*F* (resp. *y*∈*F*) if and only if \(0 \in \varphi (\mathcal {I}^{1})\) (resp. \(1 \in \varphi (\mathcal {I}^{1})\)) for some *φ*∈*C*. This means that if [*x*,*y*]=*ρ*(*v*) with a sequence *v*∈*H* ^{ k } (*k*∈ℕ) for a hypothesis *H*, where there is no sequence *v*′∈*H* ^{ k′} with *x*∈*ρ*(*v*′) and min*ρ*(*v*′)<*x* (resp. max*ρ*(*v*′)>*y*) when *k*′ is large enough, we have *x*∈*κ*(*H*) (resp. *y*∈*κ*(*H*)) if and only if *u*∈{0}^{+} (resp. *u*∈{1}^{+}) for some *u*∈*H*.

*h*in Algorithm 1 and prove that the output of the algorithm is 1 if and only if

*h*(

*w*)=1, i.e.,

*ρ*(

*w*)∩

*κ*(

*H*)≠∅. In the algorithm, \(\underline{v^{s}}\) and \(\overline{v^{s}}\) denote the previous and subsequent binary sequences of

*v*

^{ s }with \(|v^{s}| = |\underline{v^{s}}| = |\overline{v^{s}}|\) in the lexicographic order, respectively. For example, if

*v*

^{ s }=001, \(\underline{v^{s}} = \mathtt {0}\mathtt {0}\mathtt {0}\) and \(\overline{v^{s}} = \mathtt {0}\mathtt {1}\mathtt {0}\). Moreover, we use the special symbol ⊥ meaning undefinedness, that is,

*v*=

*w*if and only if

*v*

_{ i }=

*w*

_{ i }for all

*i*∈{0,1,…,|

*v*|−1} with

*v*

_{ i }≠⊥ and

*w*

_{ i }≠⊥.

*w*and a hypothesis

*H*, if

*h*(

*w*)=1, there are two possibilities as follows:

- 1.
For some

*k*∈ℕ, there exists*v*∈*H*^{ k }such that*w*⊑*v*. This is because*ρ*(*w*)⊇*ρ*(*v*) and*ρ*(*v*)∩*κ*(*H*)≠∅. - 2.
The above condition does not hold, but

*ρ*(*w*)∩*κ*(*H*)≠∅.

*h*(

*w*)=1, there should exist a sequence

*v*∈

*H*such that

*u*=

*aaa*…

*a*for some

*u*∈

*H*, where

*a*is obtained in lines 1–10. CheckBoundary therefore returns 1.

The “only if” part: In Algorithm 1, if *v*∈*H* ^{ k } satisfies conditions in line 6 or line 8, *h*(*w*)∩*κ*(*H*)≠∅. Thus *h*(*w*)=1 holds. □

The set {*κ*(*H*)∣ *H*⊂(*Σ* ^{ d })^{∗} and the classifier *h* of *H* is computable} exactly corresponds to an *indexed family of recursive concepts/languages* discussed in computational learning theory (Angluin 1980), which is a common assumption for learning of languages. On the other hand, there exists some class of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) that is not an indexed family of recursive concepts. This means that, for some figure *K*, there is no *computable* classifier which classifies all data correctly. Therefore we address the problems of both exact and approximate learning of figures to obtain a computable classifier for any target figure.

*example*and is defined as a pair (

*w*,

*l*) of a finite sequence

*w*∈(

*Σ*

^{ d })

^{∗}and a label

*l*∈{0,1}. For a target figure

*K*,

*K*, we denote the set of finite sequences of positive examples {

*w*∈(

*Σ*

^{ d })

^{∗}∣

*ρ*(

*w*)∩

*K*≠∅} by Pos(

*K*) and that of negative examples by Neg(

*K*). Moreover, we denote Pos

_{ k }(

*K*)={

*w*∈Pos(

*K*)∣|

*w*|=

*k*}. From the geometric nature of figures, we obtain the following

*monotonicity*of examples:

## Lemma 3.5

(Monotonicity of examples)

*If* (*v*,1) *is an example of* *K*, *then* (*w*,1) *is an example of* *K* *for all prefixes* *w*⊑*v*, *and* (*va*,1) *is an example of* *K* *for some* *a*∈*Σ* ^{ d }. *If* (*w*,0) *is an example of* *K*, *then* (*wv*,0) *is an example of* *K* *for all* *v*∈(*Σ* ^{ d })^{∗}.

## Proof

*ρ*in (1) and (3), if

*w*⊑

*v*, we have

*ρ*(

*w*)⊇

*ρ*(

*v*), hence (

*w*,1) is an example of

*K*. Moreover,

*va*,1) for some

*a*∈

*Σ*

^{ d }. Furthermore, for all

*v*∈

*Σ*

^{∗},

*ρ*(

*wv*)⊂

*ρ*(

*w*). Therefore if

*K*∩

*ρ*(

*w*)=∅, then

*K*∩

*ρ*(

*wv*)=∅ for all

*v*∈(

*Σ*

^{ d })

^{∗}, and (

*wv*,0) is an example of

*K*. □

*σ*of examples of a figure

*K*is a

*presentation*of

*K*. The

*i*th example is denoted by

*σ*(

*i*−1), and the set of all examples occurring in

*σ*is denoted by \(\operatorname {range}(\sigma )\).

^{3}The initial segment of

*σ*of length

*n*, i.e., the sequence

*σ*(0),

*σ*(1),…,

*σ*(

*n*−1), is denoted by

*σ*[

*n*−1]. A

*text*of a figure

*K*is a presentation

*σ*such that

*informant*is a presentation

*σ*such that Table 2 shows the relationship between the standard terminology in classification and our definitions. For a target figure

*K*and the classifier

*h*of a hypothesis

*H*, the set {

*w*∈Pos(

*K*)∣

*h*(

*w*)=1} corresponds to

*true positive*, {

*w*∈Neg(

*K*)∣

*h*(

*w*)=1}

*false positive*(type I error), {

*w*∈Pos(

*K*)∣

*h*(

*w*)=0}

*false negative*(type II error), and {

*w*∈Neg(

*K*)∣

*h*(

*w*)=0}

*true negative*.

Relationship between the conditions for each finite sequence *w*∈*Σ* ^{∗} and the standard notation of binary classification

Target figure | |||
---|---|---|---|

| | ||

( | ( | ||

Hypothesis | | True positive | False positive |

( | (Type I error) | ||

| False negative | True negative | |

( | (Type II error) |

Let *h* be the classifier of a hypothesis *H*. We say that the hypothesis *H* is *consistent* with an example (*w*,*l*) if *l*=1 implies *h*(*w*)=1 and *l*=0 implies *h*(*w*)=0, and consistent with a set of examples *E* if *H* is consistent with all examples in *E*.

A learning machine, called a *learner*, is a procedure, (i.e. a Turing machine that never halts) that reads a presentation of a target figure from time to time, and outputs hypotheses from time to time. In the following, we denote a learner by **M** and an infinite sequence of hypotheses produced by **M** on the input *σ* by **M** _{ σ }, and **M** _{ σ }(*i*−1) denotes the *i*th hypothesis produced by **M**. Assume that **M** receives *j* examples *σ*(0),*σ*(1),…,*σ*(*j*−1) so far when it outputs the *i*th hypothesis **M** _{ σ }(*i*−1). We do not require the condition *i*=*j*, that is, the inequality *i*≤*j* usually holds since **M** can “wait” until it receives enough examples. We say that an infinite sequence of hypotheses **M** _{ σ } *converges* to a hypothesis *H* if there exists *n*∈ℕ such that **M** _{ σ }(*i*)=*H* for all *i*≥*n*.

## 4 Exact learning of figures

We analyze “exact” learning of figures. This means that, for any target figure *K*, there should be a hypothesis *H* such that the generalization error is zero (i.e., *K*=*κ*(*H*)), hence the classifier *h* of *H* can classify all data correctly with no error, that is, *h* satisfies (7). The goal is to find such a hypothesis *H* from examples (training data) of *K*.

### 4.1 Explanatory learning

The most basic learning criterion in Gold’s model is **EX**-learning (EX means EXplain), i.e., learning in the limit proposed by Gold (1967). We call these criteria **FIGEX**-**INF**- (INF means an informant) and **FIGEX**-**TXT**-learning (TXT means a text) for **EX**-learning from informants and texts, respectively. We introduce these criteria into the learning of figures, and analyze the learnability of figures.

## Definition 4.1

(Explanatory learning)

A learner **M** **FIGEX**-**INF** *-learns* (resp. **FIGEX**-**TXT** *-learns*) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if for all figures \(K \in \mathcal {F}\) and all informants (resp. texts) *σ* of *K*, the outputs **M** _{ σ } converge to a hypothesis *H* such that \(\mbox {$\mathrm {GE}$}(K, H) = 0\).

For every learning criterion **CR** introduced in the following, we say that a set of figures \(\mathcal {F}\) is **CR**-*learnable* if there exists a learner that **CR**-learns \(\mathcal {F}\), and denote by **CR** the collection of **CR**-learnable sets of figures following the standard notation of this field (Jain et al. 1999).

**FIGEX**-

**INF**-learning. Informally, a learner can

**FIGEX**-

**INF**-learn a set of figures if it has the ability to enumerate all hypotheses and to judge whether or not each hypothesis is consistent with the received examples (Gold 1967). Here we introduce a convenient enumeration of hypotheses. An infinite sequence of hypotheses

*H*

_{0},

*H*

_{1},… is called a

*normal enumeration*if \(\left \{H_{i} | i \in \mathbb {N}\right \} = \mathcal {H}\) and, for all

*i*,

*j*∈ℕ,

*i*<

*j*implies

## Theorem 4.2

*The set of figures* \(\kappa (\mathcal {H}) = \left \{\kappa(H) | H \in \mathcal {H}\right \}\) *is* **FIGEX**-**INF**-*learnable*.

## Proof

**M**that

**FIGEX**-

**INF**-learns \(\kappa (\mathcal {H})\) in Procedure 1. The learner

**M**generates hypotheses through normal enumeration. If

**M**outputs a wrong hypothesis

*H*, there must exist a positive or negative example that is not consistent with the hypothesis since, for a target figure

*K*

_{∗},

*H*with

*κ*(

*H*)≠

*K*

_{∗}, where

*X*⊖

*Y*denotes the

*symmetric difference*, i.e.,

*X*⊖

*Y*=(

*X*∪

*Y*)∖(

*X*∩

*Y*). Thus the learner

**M**changes the wrong hypothesis and reaches a correct hypothesis

*H*

_{∗}such that

*κ*(

*H*

_{∗})=

*K*

_{∗}in finite time. If

**M**produces a correct hypothesis, it never changes the hypothesis, since every example is consistent with it. Therefore the learner

**M**

**FIGEX**-

**INF**-learns \(\kappa (\mathcal {H})\). □

Next, we consider **FIGEX**-**TXT**-learning. In learning of languages from texts, the necessary and sufficient conditions for learning have been studied in detail by Angluin (1980, 1982), Kobayashi (1996), Lange et al. (2008), Motoki et al. (1991), Wright (1989), and characterization of learnability using finite tell-tale sets is one of the crucial results. We adapt these results into the learning of figures and show the **FIGEX**-**TXT**-learnability.

## Definition 4.3

(Finite tell-tale set, cf. Angluin 1980)

Let \(\mathcal {F}\) be a set of figures. For a figure \(K \in \mathcal {F}\), a finite subset \(\mathcal {T}\) of the set of positive examples Pos(*K*) is a *finite tell-tale set of* *K* *with respect to* \(\mathcal {F}\) if for all figures \(L \in \mathcal {F}\), \(\mathcal {T}\subset \mathrm {Pos}(L)\) implies \(\mathrm {Pos}(L) \not \subset \mathrm {Pos}(K)\) (i.e., \(L \not\subset K\)). If every figure \(K \in \mathcal {F}\) has finite tell-tale sets with respect to \(\mathcal {F}\), we say that \(\mathcal {F}\) has finite tell-tale sets.

## Theorem 4.4

*Let* \(\mathcal {F}\) *be a subset of* \(\kappa (\mathcal {H})\). *Then* \(\mathcal {F}\) *is* **FIGEX**-**TXT**-*learnable if and only if there is a procedure that*, *for every figure* \(K \in \mathcal {F}\), *enumerates a finite tell*-*tale set* *W* *of* *K* *with respect to* \(\mathcal {F}\).

This theorem can be proved in exactly the same way as that for learning of languages given by Angluin (1980). Note that such procedure does not need to stop. Using this theorem, we show that the set \(\kappa (\mathcal {H})\) is not **FIGEX**-**TXT**-learnable.

## Theorem 4.5

*The set* \(\kappa (\mathcal {H})\) *does not have finite tell*-*tale sets*.

## Proof

Fix a figure \(K = \kappa(H) \in \kappa (\mathcal {H})\), where there exists a pair *v*,*w*∈*H* such that \(\rho(vvv\dots) \not= \rho(www\dots)\), and fix a finite set \(T = \left \{w_{1}, w_{2}, \dots, w_{n}\right \}\) contained in Pos(*K*). Suppose that #Pos_{ m }(*K*)>*n* holds for a natural number *m*. For each finite sequence *w* _{ i }, there exists *u* _{ i }∈Pos(*K*) such that |*u* _{ i }|>*m*, *w* _{ i }⊏*u* _{ i }, and *u* _{ i }∈*H* ^{ k } for some *k*. For the figure *L*=*κ*(*U*) with *U*={*u* _{1},*u* _{2},…,*u* _{ n }}, *T*⊂Pos(*L*) and Pos(*L*)⊂Pos(*K*) hold. Therefore *K* has no finite tell-tale set with respect to \(\kappa (\mathcal {H})\). □

## Corollary 4.6

*The set of figures* \(\kappa (\mathcal {H})\) *is not* **FIGEX**-**TXT**-*learnable*.

In any realistic scenarios of machine learning, however, this set \(\kappa (\mathcal {H})\) is too large to search for the best hypothesis since we usually want to obtain a “compact” representation of a target figure. Thus we (implicitly) have an upper bound on the number of elements in a hypothesis. Here we give a positive result for the above situation, that is, if we fix the number of elements #*H* in each hypothesis *H* *a priori*, the resulting set of figures becomes **FIGEX**-**TXT**-learnable. Intuitively, this is because if we take *k* large enough, the set {*w*∈Pos(*K*)∣|*w*|≤*k*} becomes a finite tell-tale set of *K*. Here we denote by Red(*H*) the hypothesis in which for every pair *v*,*w*∈*H* with |*v*|≤|*w*|, *w* is removed if *ρ*(*vvv*…)=*ρ*(*www*…). For a finite subset of natural numbers *N*⊂ℕ, we define the set of hypotheses \(\mathcal {H}_{N} := \{H \in \mathcal {H}\mid\#\mathrm {Red}(H) \in N\}\).

## Theorem 4.7

*There exists a procedure that*, *for all finite subsets* *N*⊂ℕ *and all figures* \(K \in \kappa (\mathcal {H}_{N})\), *enumerates a finite tell*-*tale set of* *K* *with respect to* \(\kappa (\mathcal {H}_{N})\).

## Proof

First, we assume that *N*={1}. It is trivial that there exists a procedure that, for an arbitrary figure \(K \in \kappa (\mathcal {H}_{N})\), enumerates a finite tell-tale set of *K* with respect to \(\kappa (\mathcal {H}_{N})\), since we always have \(L \not\subset K\) for all pairs of figures \(K, L \in \kappa (\mathcal {H}_{N})\).

*N*⊂ℕ with

*N*≠{1}. Let us consider the procedure that enumerates elements of the sets

*K*with respect to \(\kappa (\mathcal {H}_{N})\). It is enough to show that there exists a natural number

*m*, where there is no hypothesis

*H*such that

*κ*(

*H*)⊂

*K*, #

*H*≤max

*N*, and Pos(

*κ*(

*H*))⊃Pos

_{ m }(

*K*).

We construct a tree as follows (the similar technique called *d*-explorer was used by Jain and Sharma (1997)). Each node has a pair (*H*,*w*) as its label, where *κ*(*H*)⊂*K* and *w*∈Pos(*K*)∖Pos(*κ*(*H*)). The root node is labeled (∅,*v*) with a finite sequence *v*∈Pos(*K*). The tree is constructed iteratively by adding children for each node of the tree, whose depth (the length to the root) is at most max*N*−1. Let the label of such a node be (*H*,*w*). For every finite sequence *w*′ with |*w*′|≤|*w*|, if there exists a finite sequence *w*″ satisfying |*w*″|>|*w*| and *w*″∈Pos(*K*)∖*κ*(*H*∪{*w*′}), add a child labeled (*H*∪{*w*′},*w*″) to the node.

The above tree is bounded in depth max*N* and the number of children for any node is always finite, hence the number of nodes of the tree is finite. Let *m* be the length of the longest *w* such that (*H*,*w*) is the label of a node of the tree. Then, we can easily check that there is no hypothesis *H*′ such that *κ*(*H*′)⊂*K*, #*H*′≤max*N*, and Pos(*κ*(*H*′))⊃Pos_{ m }(*K*). □

## Corollary 4.8

*For all finite subsets of natural numbers* *N*⊂ℕ, *the set of figures* \(\kappa(\mathcal {H}_{N})\) *is* **FIGEX**-**TXT**-*learnable*.

### 4.2 Consistent learning

In a learning process, it is natural that every hypothesis generated by a learner is consistent with the examples received by it so far. Here we introduce **FIGCONS**-**INF**- and **FIGCONS**-**TXT**-learning (CONS means CONSistent). These criteria correspond to **CONS**-learning that was first introduced by Blum and Blum (1975).^{4} This model was also used (but implicitly) in the Model Inference System (MIS) proposed by Shapiro (1981), Shapiro (1983), and studied in the computational learning of formal languages and recursive functions (Jain et al. 1999).

## Definition 4.9

(Consistent learning)

A learner **M** **FIGCONS**-**INF**-*learns* (resp. **FIGCONS**-**TXT**-*learns*) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if **M** **FIGEX**-**INF**-learns (resp. **FIGEX**-**TXT**-learns) \(\mathcal {F}\) and for all figures \(K \in \mathcal {F}\) and all informants (resp. texts) *σ* of *K*, each hypothesis **M** _{ σ }(*i*) is consistent with *E* _{ i } that is the set of examples received by **M** until just before it generates the hypothesis **M** _{ σ }(*i*).

Assume that a learner **M** achieves **FIGEX**-**INF**-learning of \(\kappa (\mathcal {H})\) using Procedure 1. We can easily check that **M** always generates a hypothesis that is consistent with the received examples.

## Corollary 4.10

**FIGEX**-**INF**=**FIGCONS**-**INF**.

Suppose that \(\mathcal {F}\subset \kappa (\mathcal {H})\) is **FIGEX**-**TXT**-learnable. We can construct a learner **M** in the same way as in the case of **EX**-learning of languages from texts (Angluin 1980), where **M** always outputs a hypothesis that is consistent with received examples.

## Corollary 4.11

**FIGEX**-**TXT**=**FIGCONS**-**TXT**.

### 4.3 Reliable and refutable learning

In this subsection, we consider target figures that might not be represented exactly by any hypothesis since there are infinitely many such figures, and if we have no background knowledge, there is no guarantee of the existence of an exact hypothesis. Thus in practice this approach is more convenient than the explanatory or consistent learning considered in the previous two subsections.

To realize the above case, we use two concepts, *reliability* and *refutability*. The aim of the concepts is to introduce targets which cannot be exactly represented by any hypotheses. Reliable learning was introduced by Blum and Blum (1975), Minicozzi (1976) and refutable learning by Mukouchi and Arikawa (1995), Sakurai (1991) in computational learning of languages and recursive functions, and developed by Jain et al. (2001), Merkle and Stephan (2003), Mukouchi and Sato (2003). Here we introduce these concepts into the learning of figures and analyze learnability.

First, we treat reliable learning of figures. Intuitively, reliability requires that an infinite sequence of hypotheses only converges to a correct hypothesis.

## Definition 4.12

(Reliable learning)

**M**

**FIGRELEX**-

**INF**-

*learns*(resp.

**FIGRELEX**-

**TXT**-

*learns*) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if

**M**satisfies the following conditions:

- 1.
The learner

**M****FIGEX**-**INF**-learns (resp.**FIGEX**-**TXT**-learns) \(\mathcal {F}\). - 2.
For any target figure \(K \in \mathcal {K}^{*}\) and its informants (resp. texts)

*σ*, the infinite sequence of hypotheses**M**_{ σ }does not converge to a wrong hypothesis*H*such that \(\mbox {$\mathrm {GE}$}(K, \kappa (H)) \not= 0\).

We analyze reliable learning of figures from informants. Intuitively, for any target figure \(K \in \mathcal {F}\), if a learner can judge whether or not the current hypothesis *H* is consistent with the target, i.e., *κ*(*H*)=*K* or not in finite time, then the set \(\mathcal {F}\) is reliably learnable.

## Theorem 4.13

**FIGEX**-**INF**=**FIGRELEX**-**INF**.

## Proof

**FIGRELEX**-

**INF**⊆

**FIGEX**-

**INF**is trivial, thus we prove

**FIGEX**-

**INF**⊆

**FIGRELEX**-

**INF**. Fix a set of figures \(\mathcal {F}\subseteq \kappa (\mathcal {H})\) with \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\), and suppose that a learner

**M**

**FIGEX**-

**INF**-learns \(\mathcal {F}\) using Procedure 1. The goal is to show that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). Assume that a target figure

*K*belongs to \(\mathcal {K}^{*} \setminus \mathcal {F}\). Here we have the following property: for all figures \(L \in \mathcal {F}\), there must exist a finite sequences

*w*∈(

*Σ*

^{ d })

^{∗}such that

**M**’s current hypothesis

*H*,

**M**changes

*H*if it receives a positive or negative example (

*w*,

*l*) such that

*w*∈Pos(

*K*)⊖Pos(

*κ*(

*H*)). This means that an infinite sequence of hypotheses does not converge to any hypothesis. Thus we have \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). □

In contrast, we have an interesting result on reliable learning from texts. We show in the following that **FIGEX**-**TXT**≠**FIGRELEX**-**TXT** holds and that a set of figures \(\mathcal {F}\) is reliably learnable from positive data only if any figure \(K \in \mathcal {F}\) is a singleton. Remember that \(\mathcal {H}_{N}\) denotes the set of hypotheses \(\{H \in \mathcal {H}\mid\# H \in N\}\) for a subset *N*⊂ℕ and, for simplicity, we denote \(\mathcal {H}_{\{n\}}\) by \(\mathcal {H}_{n}\) for a natural number *n*∈ℕ.

## Theorem 4.14

*The set of figures* \(\kappa(\mathcal {H}_{N})\) *is* **FIGRELEX**-**TXT**-*learnable if and only if* *N*={1}.

## Proof

**FIGRELEX**-

**TXT**-learnable. From the self-similar sets property of hypotheses, we have the following: A figure \(K \in \kappa (\mathcal {H})\) is a singleton if and only if \(K \in \kappa (\mathcal {H}_{1})\). Let \(K \in \mathcal {K}^{*} \setminus\kappa(\mathcal {H}_{1})\), and assume that a learner

**M**

**FIGEX**-

**TXT**-learns \(\kappa(\mathcal {H}_{1})\). We can naturally suppose that

**M**changes the current hypothesis

*H*whenever it receives a positive example (

*w*,1) such that

*w*∉Pos(

*κ*(

*H*)) without loss of generality. For any hypothesis \(H \in \mathcal {H}_{1}\), there exists

*w*∈(

*Σ*

^{ d })

^{∗}such that

**M**receives such a positive example (

*w*,1), it changes the hypothesis

*H*. This means that an infinite sequence of hypotheses does not converge to any hypothesis. Therefore \(\kappa (\mathcal {H}_{1})\) is

**FIGRELEX**-

**TXT**-learnable.

**FIGRELEX**-

**TXT**-learnable for any

*n*>1. Fix such

*n*∈ℕ with

*n*>1. We can easily check that, for a figure \(K \in \kappa (\mathcal {H}_{n})\) and any of its finite tell-tale sets \(\mathcal {T}\) with respect to \(\kappa(\mathcal {H}_{n})\), there exists a figure \(L \in \mathcal {K}^{*} \setminus \kappa (\mathcal {H}_{n})\) such that

*L*⊂

*K*and \(\mathcal {T}\subset \mathrm {Pos}(L)\). This means that

**M**

**FIGEX**-

**TXT**-learns \(\kappa (\mathcal {H}_{n})\),

**M**

_{ σ }for some presentation

*σ*of such

*L*must converge to some hypothesis in \(\mathcal {H}_{n}\). Consequently, we have \(\kappa (\mathcal {H}_{n}) \notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). □

## Corollary 4.15

**FIGRELEX**-**TXT**⊂**FIGEX**-**TXT**.

Sakurai (1991) proved that a set of concepts \(\mathcal{C}\) is reliably **EX**-learnable from texts if and only if \(\mathcal{C}\) contains no infinite concept (p. 182, Theorem 3.1).^{5} However, we have shown that the set \(\kappa (\mathcal {H}_{1})\) is **FIGRELEX**-**TXT**-learnable, though all figures \(K \in \kappa (\mathcal {H}_{1})\) correspond to infinite concepts since Pos(*K*) is infinite for all \(K \in \kappa (\mathcal {H}_{1})\). The monotonicity of the set Pos(*K*) (Lemma 3.5), which is a constraint naturally derived from the geometric property of examples, causes this difference.

Next, we extend **FIGEX**-**INF**- and **FIGEX**-**TXT**-learning by paying our attention to *refutability*. In refutable learning, a learner tries to learn figures in the limit, but it understands that it cannot find a correct hypothesis in finite time, that is, outputs the refutation symbol △ and stops if the target figure is not in the considered space.

## Definition 4.16

(Refutable learning)

**M**

**FIGREFEX**-

**INF**-

*learns*(resp.

**FIGREFEX**-

**TXT**-

*learns*) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if

**M**satisfies the following conditions. Here, △ is the

*refutation symbol*.

- 1.
The learner

**M****FIGEX**-**INF**-learns (resp.**FIGEX**-**TXT**-learns) \(\mathcal {F}\). - 2.
If \(K \in \mathcal {F}\), then for all informants (resp. texts)

*σ*of*K*,**M**_{ σ }(*i*)≠△ for all*i*∈ℕ. - 3.
If \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\), then for all informants (resp. texts)

*σ*of*K*, there exists*m*∈ℕ such that**M**_{ σ }(*i*)≠△ for all*i*<*m*, and**M**_{ σ }(*i*)=△ for all*i*≥*m*.

Conditions 2 and 3 in the above definition mean that a learner **M** refutes the set \(\mathcal {F}\) in finite time if and only if a target figure \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\). We compare **FIGREFEX**-**INF**-learnability with other learning criteria.

## Theorem 4.17

\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) *and* \(\mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\).

## Proof

First we consider \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). We show an example of a set of figures \(\mathcal {F}\) with \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) and \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) in the case of *d*=2. Let *K* _{0}=*κ*({〈0,0〉,〈1,1〉}), *K* _{ i }=*κ*({〈*w*,*w*〉∣*w*∈*Σ* ^{ i }∖{1}^{ i }}) for every *i*≥1, and \(\mathcal {F}= \{K_{i} \mid i \in \mathbb {N}\}\). Note that *K* _{0} is the line *y*=*x* and *K* _{ i }⊂*K* _{0} for all *i*≥1.

We prove that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). It is trivial that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\), thereby assume that a target figure \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\). If a target figure *K*⊃*K* _{0}, it is trivial that, for any informant *σ* of *K*, the set of examples \(\operatorname {range}(\sigma [n])\) for some *n*∈ℕ is not consistent with any \(K_{i} \in \mathcal {F}\) (consider a positive example for a point *x*∈*K*∖*K* _{0}). Otherwise if *K*⊂*K* _{0}, there should exist a negative example 〈*v*,*v*〉∈Neg(*K*). Then we have \(K \not= K_{i}\) for all *i*>|*v*|. Thus a learner can refute candidates {*K* _{1},*K* _{2},…,*K* _{|v|}} in finite time. Therefore \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) holds.

Next we show that \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). Let *K* _{0} be the target figure. For any finite set of positive examples \(\mathcal {T}\subset \mathrm {Pos}(K_{0})\), there exists a figure \(K_{i} \in \mathcal {F}\) such that *K* _{ i }⊂*K* _{0} and \(\mathcal {T}\) is consistent with *K* _{ i }. Therefore it has no finite tell-tale set with respect to \(\mathcal {F}\) and hence \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) from Theorem 4.4.

Second we check \(\mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). Assume that \(\mathcal {F}= \kappa (\mathcal {H}_{\{1\}})\) and a target figure *K* is a singleton {*x*} with \(K \notin \mathcal {F}\). It is clear that, for any informant *σ* of *K* and *n*∈ℕ, \(\operatorname {range}(\sigma [n])\) is consistent with some figure \(L \in \mathcal {F}\). Thus \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) whereas \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). □

## Corollary 4.18

\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) *and* \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\).

Note that it is trivial that \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) since we have \(\kappa (\mathcal {H}_{\{1\}}) \notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) in the above proof and \(\kappa (\mathcal {H}_{\{1\}}) \in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) from Theorem 4.14. Moreover, the condition \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) holds since \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) and **FIGRELEX**-**TXT**⊂**FIGEX**-**TXT**. These results mean that both **FIGREFEX**-**INF**- and **FIGRELEX**-**TXT**-learning are difficult, but they are incomparable in terms of learnability. Furthermore, we have the following hierarchy.

## Theorem 4.19

\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not= \emptyset\) *and* **FIGREFEX**-**TXT**⊂**FIGREFEX**-**INF**.

## Proof

Let a set of figures \(\mathcal {F}\) be a singleton {*K*} such that *K*=*κ*(*w*) for some *w*∈(*Σ* ^{ d })^{∗}. Then there exists a learner **M** that **FIGREFEX**-**TXT**-learns \(\mathcal {F}\), i.e., \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\), since all **M** has to do is to check whether or not, for a given positive example (*v*,1), *v*⊑*u* for some *u*∈Pos(*K*)={*x*∣*x*⊑*www*…}.

Next, let \(\mathcal {F}= \{K\}\) such that *K*=*κ*(*H*) with #Red(*H*)≥2. We can easily check that \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) because if a target figure *L* is a proper subset of *K*, no learner can refute \(\mathcal {F}\) in finite time. Conversely, \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) since for all *L* with *L*≠*K*, there exists an example with which the hypothesis *H* is not consistent. □

## Corollary 4.20

**FIGREFEX**-**TXT**⊂**FIGRELEX**-**TXT**.

## 5 Effective learning of figures

In learning under the proposed criteria, i.e. explanatory, consistent, reliable, and refutable learning, each hypothesis is just considered as exactly “correct” or not, that is, for a target figure *K* and for a hypothesis *H*, *H* is correct if \(\mbox {$\mathrm {GE}$}(K, H) = 0\) and is not correct if \(\mbox {$\mathrm {GE}$}(K, H) \neq0\). Thus we cannot know the rate of convergence to the target figure and how far it is from the recent hypothesis to the target. It is therefore more useful if we consider *approximate* hypotheses by taking various *generalization errors* into account in the learning process.

We define novel learning criteria, **FIGEFEX**-**INF**- and **FIGEFEX**-**TXT**-learning (EF means EFfective), to introduce into learning the concept of *effectivity*, which has been analyzed in computation of real numbers in the area of computable analysis (Weihrauch 2000). Intuitively, these criteria guarantee that for any target figure, a generalization error becomes smaller and smaller monotonically and converges to zero. Thus we can know when the learner learns the target figure “well enough”. Furthermore, if a target figure is learnable in the limit, then the generalization error goes to zero in finite time.

## Definition 5.1

(Effective learning)

**M**

**FIGEFEX**-

**INF**

*-learns*(resp.

**FIGEFEX**-

**TXT**-

*learns*) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if

**M**satisfies the following conditions:

- 1.
The learner

**M****FIGEX**-**INF**-learns (resp.**FIGEX**-**TXT**-learns) \(\mathcal {F}\). - 2.For an arbitrary target figure \(K \in \mathcal {K}^{*}\) and all informants (resp. texts)
*σ*of*K*, for all*i*∈ℕ,$$ \mbox {$\mathrm {GE}$}\bigl(K, \text {\textbf {M}}_{\sigma }(i) \bigr) \le2^{-i}. $$

This definition is inspired by the *Cauchy representation* of real numbers (Weihrauch 2000, Definition 4.1.5).

Effective learning is related to *monotonic* learning (Lange and Zeugmann 1993, 1994; Kinber 1994; Zeugmann et al. 1995) originally introduced by Jantke (1991), Wiehagen (1991), since both learning models consider monotonic convergence of hypotheses. In contrast to their approach, where various monotonicity over languages was considered, we geometrically measure the generalization error of a hypothesis by the Hausdorff metric. On the other hand, the effective learning is different from **BC**-learning developed in the learning of languages and recursive functions (Jain et al. 1999) since **BC**-learning only guarantees that generalization errors go to zero in finite time. This means that **BC**-learning is *not* effective.

*H*using the diameter \(\operatorname {diam}(k)\) of the set

*ρ*(

*w*) with |

*w*|=

*k*. Recall that we have

*w*,

*l*)∣|

*w*|=

*k*} in

*σ*by

*E*

^{ k }and call each example in it a

*level*-

*k*

*example*.

## Lemma 5.2

*Let*

*σ*

*be an informant of a figure*

*K*

*and*

*H*

*be a hypothesis that is consistent with the set of examples*

*E*

^{ k }={(

*w*,

*l*)∣|

*w*|=

*k*}.

*We have the inequality*

## Proof

*H*is consistent with

*E*

^{ k },

*δ*-neighborhood of

*κ*(

*H*) contains

*K*and the

*δ*-neighborhood of

*K*contains

*κ*(

*H*). It therefore follows that \(\mbox {$\mathrm {GE}$}(K, H) = d_{\mathrm {H}}(K, \kappa (H)) \le \operatorname {diam}(k)\). □

## Theorem 5.3

*The set of figures* \(\kappa (\mathcal {H})\) *is* **FIGEFEX**-**INF**-*learnable*.

## Proof

**M**that

**FIGEFEX**-

**INF**-learns \(\kappa (\mathcal {H})\) in Procedure 2. We use the function

*k*∈ℕ, we have

**M**stores examples, and when it receives all examples at the level

*g*(

*k*), it outputs a hypothesis. Every

*k*th hypothesis

**M**

_{ σ }(

*k*) is consistent with the set of examples

*E*

^{ g(k)}. Thus we have

*k*∈ℕ from Lemma 5.2.

Assume that \(K \in \kappa (\mathcal {H})\). If **M** outputs a wrong hypothesis, there must be a positive or negative example that is not consistent with the hypothesis, and it changes the wrong hypothesis. If it produces a correct hypothesis, then it never changes the correct hypothesis, since every example is consistent with the hypothesis. Thus there exists *n*∈ℕ with \(\mbox {$\mathrm {GE}$}(K, \text {\textbf {M}}_{\sigma }(i)) = 0\) for all *i*≥*n*. Therefore **M** **FIGEFEX**-**INF**-learns \(\kappa (\mathcal {H})\). □

## Corollary 5.4

**FIGEFEX**-**INF**=**FIGRELEX**-**INF**=**FIGEX**-**INF**.

Thus the learner with Procedure 2 can treat the set of *all* figures \(\mathcal {K}^{*}\) as learning targets, since for any figure \(K \in \mathcal {K}^{*}\), it can approximate the figure arbitrarily closely using only the figures represented by hypotheses in the hypothesis space \(\mathcal {H}\).

In contrast to **FIGEX**-**TXT**-learning, there is no set of figures that is **FIGEFEX**-**TXT**-learnable.

## Theorem 5.5

**FIGEFEX**-**TXT**=∅.

## Proof

We show a counterexample of a target figure which no learner **M** can approximate effectively. Assume that *d*=2 and a learner **M** **FIGEFEX**-**TXT**-learns a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\). Let us consider two target figures *K*={(0,0),(1,1)} and *L*={(0,0)}. For a text *σ* of *L*, for all examples \((w, 1) \in \operatorname {range}(\sigma )\), *w*∈{00}^{∗}. Since **M** **FIGEFEX**-**TXT**-learns \(\mathcal {F}\), it should output the hypothesis *H* as **M** _{ σ }(2) such that \(\mbox {$\mathrm {GE}$}(L, H) < 1/4\). Suppose that **M** receives *n* examples before outputting the hypothesis *H*. Then there exists a presentation *τ* of the figure *K* such that *τ*[*n*−1]=*σ*[*n*−1], and **M** outputs the hypothesis *H* with receiving *τ*[*n*−1]. However, \(\mbox {$\mathrm {GE}$}(K, H) \ge\sqrt{2} - 1/4\) holds from the triangle inequality, contradicting our assumption that **M** **FIGEFEX**-**TXT**-learns \(\mathcal {F}\). This proof can be applied for any \(\mathcal {F}\subseteq \mathcal {K}^{*}\), thereby we have **FIGEFEX**-**TXT**=∅. □

**FIGREFEX**-

**TXT**≠∅, we have the relation

## 6 Evaluation of learning using dimensions

Here we show a novel mathematical connection between fractal geometry and Gold’s learning under the proposed learning model described in Sect. 3. More precisely, we bound the number of positive examples, one of the complexities of learning, using the Hausdorff dimension and the VC dimension. The Hausdorff dimension is known as the central concept of fractal geometry, which measures the density of figures, and VC dimension is the central concept of Valiant’s model (PAC learning model) (Kearns and Vazirani 1994), which measures the complexity of classes of hypotheses.

### 6.1 Preliminaries for dimensions

First we introduce the Hausdorff dimension and related dimensions: the box-counting dimension, the similarity dimension, and also introduce the VC dimension.

*X*⊆ℝ

^{ n }and

*s*∈ℝ with

*s*>0, define

*s*-

*dimensional Hausdorff measure*of

*X*is lim

_{ δ→0}\(\mathfrak {H}_{\delta}^{s}(X)\), denoted by ℌ

^{ s }(

*X*). We say that \(\mathcal {U}\) is a

*δ*-cover of

*X*if \(\mathcal {U}\) is countable, \(X \subseteq\bigcup_{U \in\, \mathcal {U}} U\), and |

*U*|≤

*δ*for all \(U \in \mathcal {U}\). When we fix a set

*X*and view ℌ

^{ s }(

*X*) as a function with respect to

*s*, it has at most one value where the value ℌ

^{ s }(

*X*) changes from ∞ to 0 (Federer 1996). This value is called the

*Hausdorff dimension*of

*X*. Formally, the Hausdorff dimension of a set

*X*, written as dim

_{H}

*X*, is defined by

*δ*, the values obtained often converge to the Hausdorff dimension at the same time. Thus we can obtain an approximate value of the Hausdorff dimension by an empirical method. Let

*X*be a nonempty bounded subset of ℝ

^{ n }and

*N*

_{ δ }(

*X*) be the smallest cardinality of a

*δ*-cover of

*X*. The

*box-counting dimension*dim

_{B}

*X*of

*X*is defined by

_{B}

*X*if

*N*

_{ δ }(

*K*) is the smallest number of cubes of side

*δ*that cover

*K*, or the number of

*δ*-mesh cubes that intersect

*K*. We have

*X*⊆ℝ

^{ n }.

*C*be a finite set of contractions, and

*F*be the self-similar set of

*C*. The

*similarity dimension*of

*F*, denoted by dim

_{S}

*F*, is defined by the equation

*L*(

*φ*) is the

*contractivity factor*of

*φ*, which is defined by the infimum of all real numbers

*c*with 0<

*c*<1 such that

*d*(

*φ*(

*x*),

*φ*(

*y*))≤

*cd*(

*x*,

*y*) for all

*x*,

*y*∈

*X*. We have

*C*satisfies the open set condition,

*C*satisfies the

*open set condition*if there exists a nonempty bounded open set

*O*⊂ℝ

^{ n }such that

*φ*(

*O*)⊂

*O*for all

*φ*∈

*C*and

*φ*(

*O*)∩

*φ*′(

*O*)=∅ for all

*φ*,

*φ*′∈

*C*with \(\varphi \not= \varphi '\).

*W*⊆

*Σ*

^{∗}, define

*W*is

*shattered*by \(\mathcal {R}\). Here the

*VC dimension*of \(\mathcal {R}\), denoted by \(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {R}}\), is the cardinality of the largest set

*W*shattered by \(\mathcal {R}\).

### 6.2 Measuring the complexity of learning with dimensions

We show that the Hausdorff dimension of a target figure gives a lower bound to the number of positive examples. Remember that Pos_{ k }(*K*)={*w*∈Pos(*K*)∣|*w*|=*k*} and the diameter \(\operatorname {diam}(k)\) of the set *ρ*(*w*) with |*w*|=*k* is \(\sqrt{d}2^{-k}\). Moreover, the size #{*w*∈(*Σ* ^{ d })^{∗}∣|*w*|=*k*}=2^{ kd } for all *k*∈ℕ.

## Theorem 6.1

*For every figure*\(K \in \mathcal {K}^{*}\)

*and for any*

*s*<dim

_{H}

*K*,

*if we take*

*k*

*large enough*,

## Proof

*s*<dim

_{H}

*K*. From the definition of the Hausdorff measure,

*k*large enough,

*δ*, and goes to ∞. Thus

Moreover, if a target figure *K* can be represented by some hypothesis, that is, \(K \in \kappa (\mathcal {H})\), we can use the exact dimension dim_{H}
*K* as a bound for the number of positive examples #Pos_{ k }(*K*).

## Theorem 6.2

*For every figure*\(K \in \kappa (\mathcal {H})\),

*if we take*

*k*

*large enough*,

## Proof

*H*meets the open set condition, dim

_{H}

*κ*(

*H*)=dim

_{B}

*κ*(

*H*)=dim

_{S}

*κ*(

*H*) holds. Thus we have

^{−k }is the length of one side of an interval

*ρ*(

*w*) with |

*w*|=

*k*. The above inequality is trivial from the definition of the box-counting dimension since

*N*

_{ δ }(

*X*)≤#Pos

_{ k }(

*K*). Therefore if we take

*k*large enough, □

## Example 6.3

*K*in Example 3.1. It is known that dim

_{H}

*K*=log3/log2=1.584…. From Theorem 6.2,

_{1}(

*K*)=4 and Pos

_{2}(

*K*)=13. Note that

*K*is already covered by 3 and 9 intervals at level 1 and 2, respectively (Fig. 4).

*level*-

*k*

*hypothesis*. We show that the VC dimension of the set of level

*k*hypotheses \(\mathcal {H}^{k}\) is equal to #{

*w*∈(

*Σ*

^{ d })

^{∗}∣|

*w*|=

*k*}=2

^{ kd }.

## Lemma 6.4

*At each level* *k*, *we have* \(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^{k}} = 2^{kd}\).

## Proof

*w*∈Pos(

*κ*(

*H*)) such that

*w*∉Pos(

*κ*(

*G*)) for all \(G \in \mathcal {H}_{1}^{k}\) with \(H \not= G\). Thus if we assume \(\mathcal {H}_{1}^{k} = \{H_{1}, \dots, H_{2^{kd}}\}\), there exists the set of finite sequences \(W = \{w_{1}, \dots, w_{2^{kd}}\} \) such that for all

*i*∈{1,…,2

^{ kd }},

*w*

_{ i }∈Pos(

*κ*(

*H*

_{ i })) and

*w*

_{ i }∉Pos(

*κ*(

*H*

_{ j })) for all

*j*∈{1,…,2

^{ kd }} with

*i*≠

*j*. For every pair

*V*,

*W*⊂(

*Σ*

^{ d })

^{∗},

*V*⊂

*W*implies

*κ*(

*V*)⊂

*κ*(

*W*). Therefore the set

*W*is shattered by \(\mathcal {H}^{k}\), meaning that we have \(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^{k}} = 2^{kd}\). □

Therefore we can rewrite Theorems 6.1 and 6.2 as follows.

## Theorem 6.5

*For every figure*\(K \in \mathcal {K}^{*}\)

*and for any*

*s*<dim

_{H}

*K*,

*if we take*

*k*

*large enough*,

*Moreover*,

*when*\(K \in \kappa (\mathcal {H})\),

*if we take*

*k*

*large enough*,

These results demonstrate a relationship among the complexities of learning figures (numbers of positive examples), classes of hypotheses (VC dimension), and target figures (Hausdorff dimension).

### 6.3 Learning the box-counting dimension through effective learning

One may think that **FIGEFEX**-**INF**-learning can be achieved without the proposed hypothesis space. For instance, if a learner just outputs figures represented by a set of received positive examples, the generalization error becomes smaller and smaller. Here we show that one “quality” of a target figure, the box-counting dimension, is also learned in **FIGEFEX**-**INF**-learning, whereas if a learner outputs figures represented by a set of received positive examples, the box-counting dimension (and also the Hausdorff dimension) of any figure represented by a hypothesis is always *d*.

Recall that for all hypotheses \(H \in \mathcal {H}\), dim_{H}
*κ*(*H*)=dim_{B}
*κ*(*H*)=dim_{S}
*κ*(*H*), since the set of contractions encoded by the hypothesis *H* meets the open set condition.

## Theorem 6.6

*Assume that a learner*

**M**

**FIGEFEX**-

**INF**-

*learns*\(\kappa (\mathcal {H})\).

*For every target figure*\(K \in \mathcal {K}^{*}\),

## Proof

*σ*of

*K*,

**M**

_{ σ }converges to a hypothesis

*H*with

*κ*(

*H*)=

*K*. Thus

*i*∈ℕ, for each

*k*∈ℕ we have some

*i*≥

*k*such that the hypothesis

**M**

_{ σ }(

*i*) is consistent with the set of level-

*k*examples \(E^{k} = \{(w, l) \in \operatorname {range}(\sigma ) \mid|w| = k\}\). Thus

_{B}

*K*is defined equivalently by

## 7 Computational interpretation of learning

Recently, the concept of “computability” for continuous objects has been introduced in the framework of Type-2 Theory of Effectivity (TTE) (Schröder 2002b; Weihrauch 2000, 2008; Weihrauch and Grubba 2009; Tavana and Weihrauch 2011), where we treat an uncountable set *X* as objects for computing through infinite sequences over a given alphabet *Σ*. Using the framework, we analyze our learning model from the computational point of view. Some studies by de Brecht and Yamamoto (2009), de Brecht (2010) have already demonstrated a close connection between TTE and Gold’s model, and our analysis becomes an instance and extension of their analysis.

### 7.1 Preliminaries for Type-2 theory of effectivity

We prepare mathematical notations for TTE. In the following in this section, we assume *Σ*={0,1,[,],∥,♢}. A partial (resp. total) function *g* from a set *A* to a set *B* is denoted by *g*:⊆*A*→*B* (resp. *g*:*A*→*B*). A *representation* of a set *X* is a surjection *ξ*:⊆*C*→*X*, where *C* is *Σ* ^{∗} or *Σ* ^{ ω }. We see \(p \in \operatorname {dom}(\xi)\) as a name of the encoded element *ξ*(*p*).

Computability of string functions *f*:⊆*X*→*Y*, where *X* and *Y* are *Σ* ^{∗} or *Σ* ^{ ω }, is defined via a *Type-2 machine*, which is a usual Turing machine with one-way input tapes, some work tapes, and a one-way output tape (Weihrauch 2000). The function *f* _{ M }:⊆*X*→*Y* computed by a Type-2 machine *M* is defined as follows: When *Y* is *Σ* ^{∗}, *f* _{ M }(*p*):=*q* if *M* with input *p* halts with *q* on the output tape, and when *Y* is *Σ* ^{ ω }, *f* _{ M }(*p*):=*q* if *M* with input *p* writes step by step *q* onto the output tape. We say that a function *f*:⊆*C*→*D* is *computable* if there is a Type-2 machine that computes *f*, and a finite or infinite sequence *p* is computable if the constant function *f* which outputs *p* is computable. A Type-2 machine never changes symbols that have already been written onto the output tape, thus each prefix of the output depends only on a prefix of the input.

By treating a Type-2 machine as a translator between names of some objects, a hierarchy of representations is introduced. A representation *ξ* is *reducible* to *ζ*, denoted by *ξ*≤*ζ*, if there exists a computable function *f* such that *ξ*(*p*)=*ζ*(*f*(*p*)) for all \(p \in \operatorname {dom}(\xi)\). Two representations *ξ* and *ζ* are *equivalent*, denoted by *ξ*≡*ζ*, if both *ξ*≤*ζ* and *ζ*≤*ξ* hold. As usual, *ξ*<*ζ* means *ξ*≤*ζ* and not *ζ*≤*ξ*.

Computability for functions is defined through representations and computability of string functions.

## Definition 7.1

*ξ*and

*ζ*be representations of

*X*and

*Y*, respectively. An element

*x*∈

*X*is

*ξ*-

*computable*if there is some computable

*p*such that

*ξ*(

*p*)=

*x*. A function

*f*:⊆

*X*→

*Y*is (

*ξ*,

*ζ*)-

*computable*if there is some computable function

*g*such that

*g*is called a (

*ξ*,

*ζ*)-

*realization*of

*f*.

Thus the abstract function *f* is “realized” by the concrete function (Type-2 machine) *g* through the two representations *ξ* and *ζ*.

*w*

_{0},…,

*w*

_{ m }}, for convenience we introduce the mapping

*ι*which translates the set into a finite sequence defined by

*ι*(

*w*

_{0},…,

*w*

_{ m }):=[

*w*

_{0}∥…∥

*w*

_{ m }]. Note that \(\nu_{\mathbb {Q}^{d}}(\langle w^{1}, \dots, w^{d}\rangle) = (\min\rho (w^{1}), \ldots, \min\rho(w^{d}))\) for our representation

*ρ*introduced in (2). The standard representation of the topological space \((\mathcal {K}^{*}, d_{\mathrm {H}})\), given by Brattka and Weihrauch (1999, Definition 4.8), is defined in the following manner.

## Definition 7.2

(Standard representation of figures)

*κ*

_{H}(

*p*)=

*K*if

*p*=

*w*

_{0}♢

*w*

_{1}♢

*w*

_{2}♢…,

*i*∈ℕ, and \(\lim_{i \to\infty}\nu_{\mathcal {Q}}(w_{i}) = K\), where ♢ denotes a separator of two finite sequences.

This representation *κ* _{H} is known to be an *admissible representation* of the space \((\mathcal {K}^{*}, d_{\mathrm {H}})\), which is the key concept in TTE (Schröder 2002b; Weihrauch 2000), and is also known as the \(\boldsymbol {\varSigma }_{1}^{0}\)-admissible representation proposed by de Brecht and Yamamoto (2009).

### 7.2 Computability and learnability of figures

First, we show computability of figures in \(\kappa (\mathcal {H})\).

## Theorem 7.3

*For every figure* \(K \in \kappa (\mathcal {H})\), *K* *is* *κ* _{H}-*computable*.

## Proof

*f*such that

*κ*(

*H*)=

*κ*

_{H}(

*f*(

*H*)) for all \(H \in \mathcal {H}\). Fix a hypothesis \(H \in \mathcal {H}\) such that

*κ*(

*H*)=

*K*. For all

*k*∈ℕ and for

*H*

_{ k }defined by

*k*, \(\sqrt{d} \cdot2^{-g(k)} < 2^{-k}\), where

*f*which translates

*H*into a representation of

*K*given as follows:

*f*(

*H*)=

*p*with

*p*=

*w*

_{0}♢

*w*

_{1}♢… such that

*ι*(

*H*

_{ g(k)})=

*w*

_{ k }for all

*k*∈ℕ. □

Thus a hypothesis *H* can be viewed as a “program” of a Type-2 machine that produces a *κ* _{H}-representation of the figure *κ*(*H*).

Both informants and texts are also representations (in the sense of TTE) of compact sets. Define the mapping *η* _{INF} by *η* _{INF}(*σ*):=*K* for every \(K \in \mathcal {K}^{*}\) and informant *σ* of *K*, and the mapping *η* _{TXT} by *η* _{TXT}(*σ*):=*K* for every \(K \in \mathcal {K}^{*}\) and text *σ* of *K*. Trivially *η* _{INF}<*η* _{TXT} holds, that is, some Type-2 machine can translate *η* _{INF} to *η* _{TXT}, but no machine can translate *η* _{TXT} to *η* _{INF}. Moreover, we have the following hierarchy of representations.

## Lemma 7.4

*η* _{INF}<*κ* _{H}, \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\), *and* \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\).

## Proof

*η*

_{INF}≤

*κ*

_{H}, that is, there is some computable function

*f*such that

*η*

_{INF}(

*σ*)=

*κ*

_{H}(

*f*(

*σ*)). Fix a figure

*K*and its informant \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). For all

*k*∈ℕ, we have

_{ k }(

*K*) can be obtained from

*σ*. Moreover, for each

*k*, \(\sqrt{d} \cdot2^{-g(k)} < 2^{-k}\), where

*f*that translates

*σ*into a representation of

*K*as follows:

*f*(

*σ*)=

*p*, where

*p*=

*w*

_{0}♢

*w*

_{1}♢… such that

*w*

_{ k }=

*ι*(Pos

_{ g(k)}(

*K*)) for all

*k*∈ℕ.

Second, we prove \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\). Assume that the opposite, *η* _{TXT}≤*κ* _{H} holds. Then there exists a computable function *f* such that *η* _{TXT}(*σ*)=*κ* _{H}(*f*(*σ*)) for every figure \(K \in \mathcal {K}^{*}\). Fix a figure *K* and its text \(\sigma \in \operatorname {dom}(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}})\). This means that for any small *ε*∈ℝ, *f* can pick up finite sequences *w* _{1},*w* _{2},…,*w* _{ n } from Pos(*K*) such that \(d_{\mathrm {H}}(K, \nu_{\mathcal {Q}}(\iota(w_{1}, w_{2}, \dots, w_{n}))) \le \varepsilon \). However, if such *f* exists, we can easily check that {*K*}∈**FIGEFEX**-**TXT**, contradicting to our result (Theorem 5.5). It follows that \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\).

Third, we prove \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}\) and \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\). There is a figure *K* such that *K*∩*ρ*(*w*)={*x*} for some *w*∈*Σ* ^{∗}, i.e., *K* and *ρ*(*w*) intersect in only one point *x*. Such a *w* must be in *σ* as a positive example, that is, *w*∈Pos(*K*). However, a representation of *K* can be constructed without *w*. There exists an infinite sequence *p*∈*κ* _{H} with *p*=*w* _{0}♢*w* _{1}♢… such that \(x \notin\nu_{\mathcal {Q}}(w_{k})\) for all *k*∈ℕ. Thus, if there exists a computable *f* which outputs an example (*w*,1) from such a sequence after only seeing *w* _{0}♢*w* _{1}♢…♢*w* _{ n }, one can extend the sequence in such a way for some figure *L* with *w*∉Pos(*L*), in contradiction to the reduction. Therefore there is no computable function that outputs an example (*w*,1) from *p*, meaning that \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}\) and \(\kappa_{\mathrm {H}} \not \le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\). □

*learning*of figures as

*computation*based on TTE. If we see the output of a learner, i.e., an infinite sequence of hypotheses, as an infinite sequence encoding a figure, the learner can be viewed as a translator of codes of figures. Naturally, we can assume that the hypothesis space \(\mathcal {H}\) is a discrete topological space, that is, every hypothesis \(H \in \mathcal {H}\) is isolated and is an open set itself. Define the mapping \(\lim_{\mathcal {H}}: \mathcal {H}^{\omega}\to \mathcal {H}\), where \(\mathcal {H}^{\omega}\) is the set of infinite sequences of hypotheses in \(\mathcal {H}\), by \(\lim_{\mathcal {H}}(\tau) := H\) if

*τ*is an infinite sequence of hypotheses that converges to

*H*, i.e., there exists

*n*∈ℕ such that

*τ*(

*i*)=

*τ*(

*n*) for all

*i*≥

*n*. This coincides with the

*naïve Cauchy representation*given by Weihrauch (2000) and \(\boldsymbol {\varSigma }_{2}^{0}\)-

*admissible representation*of hypotheses introduced by de Brecht and Yamamoto (2009). For any set \(\mathcal {F}\subseteq \mathcal {K}^{*}\), let \(\mathcal {F}_{\mathrm {D}}\) denote the space \(\mathcal {F}\) equipped with the discrete topology, that is, every subset of \(\mathcal {F}\) is open, and the mapping \(\mathrm {id}_{\mathcal {F}} : \mathcal {F}\to \mathcal {F}_{\mathrm {D}}\) be the identity on \(\mathcal {F}\). The computability of this identity is not trivial, since the topology of \(\mathcal {F}_{\mathrm {D}}\) is finer than that of \(\mathcal {F}\). Intuitively, this means that \(\mathcal {F}_{\mathrm {D}}\) is more informative than \(\mathcal {F}\). We can interpret learnability of \(\mathcal {F}\) as computability of the identity \(\mathrm {id}_{\mathcal {F}}\). The results in the following are summarized in Fig. 5.

## Theorem 7.5

*A set* \(\mathcal {F}\subseteq \mathcal {K}^{*}\) *is* **FIGEX**-**INF**-*learnable* (*resp*. **FIGEX**-**TXT**-*learnable*) *if and only if the identity* \(\mathrm {id}_{\mathcal {F}}\) *is* \((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-*computable* (*resp*. \((\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}, \kappa \circ\lim_{\mathcal {H}})\)-*computable*).

## Proof

We only prove the case of **FIGEX**-**INF**-learning, since we can prove the case of **FIGEX**-**TXT**-learning in exactly the same way.

**M**that

**FIGEX**-

**INF**-learns \(\mathcal {F}\), hence for all \(K \in \mathcal {F}\) and all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\),

**M**

_{ σ }converges to a hypothesis \(H \in \mathcal {H}\) such that

*κ*(

*H*)=

*K*. Thus

The “if” part: For some **M**, the above equation (9) holds for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). This means that **M** is a learner that **FIGEX**-**INF**-learns \(\mathcal {F}\). □

Here we consider two more learning criteria, **FIGFIN**-**INF**- and **FIGFIN**-**TXT**-learning, where the learner generates only one correct hypothesis and halts. This learning corresponds to *finite learning* or *one shot learning* introduced by Gold (1967), Trakhtenbrot and Barzdin (1970) and it is a special case of learning with a bound of *mind change complexity*, the number of changes of hypothesis, introduced by Freivalds and Smith (1993) and used to measure the complexity of learning classes (Jain et al. 1999). We obtain the following theorem.

## Theorem 7.6

*A set* \(\mathcal {F}\subseteq \mathcal {K}^{*}\) *is* **FIGFIN**-**INF**-*learnable* (*resp*. **FIGFIN**-**TXT**-*learnable*) *if and only if the identity* \(\mathrm {id}_{\mathcal {F}}\) *is* (*η* _{INF},*κ*)-*computable* (*resp*. (*η* _{TXT},*κ*)-*computable*).

## Proof

We only prove the case of **FIGFIN**-**INF**-learning, since we can prove the case of **FIGFIN**-**TXT**-learning in exactly the same way.

**M**that

**FIGFIN**-

**INF**-learns \(\mathcal {F}\), hence for all \(K \in \mathcal {F}\) and all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\) of

*K*, we can assume that

**M**

_{ σ }=

*H*such that

*κ*(

*H*)=

*K*. Thus we have

*η*

_{INF},

*κ*)-computable.

The “if” part: For some **M**, the above equation (10) holds for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). This means that **M** is a learner that **FIGFIN**-**INF**-learns \(\mathcal {F}\). □

Finally, we show a connection between effective learning of figures and the computability of figures. Since **FIGEFEX**-**TXT**=∅ (Theorem 5.5), we only treat effective learning from informants. We define the representation \(\gamma:\subseteq \mathcal {H}^{\omega} \to \mathcal {K}^{*}\) by *γ*(*p*):=*K* if *p*=*H* _{0},*H* _{1},… such that \(H_{i} \in \mathcal {H}\) and *d* _{H}(*K*,*κ*(*H* _{ i }))≤2^{−i } for all *i*∈ℕ.

## Lemma 7.7

*γ*≡*κ* _{H}.

## Proof

*γ*≤

*κ*

_{H}. For the function

*g*:ℕ→ℝ such that

*i*∈ℕ. Thus there exists a computable function

*f*such that, for all \(p \in \operatorname {dom}(\gamma)\),

*f*(

*p*) is a representation of

*κ*

_{H}since, for an infinite sequence of hypotheses

*p*=

*H*

_{0},

*H*

_{1},…, all

*f*has to do is to generate an infinite sequence

*q*=

*w*

_{0}♢

*w*

_{1}♢

*w*

_{2}♢⋯ such that \(w_{i} = \iota(H_{g(i)}^{g(i)})\) for all

*i*∈ℕ, which results in

*i*∈ℕ.

*κ*

_{H}≤

*γ*. Fix \(q \in \operatorname {dom}(\kappa_{\mathrm {H}})\) with

*q*=

*w*

_{0}♢

*w*

_{1}♢⋯. For each

*i*∈ℕ, let

*w*

_{ i }=

*ι*(

*w*

_{ i,0},

*w*

_{ i,1},…,

*w*

_{ i,n }). Then the set {

*w*

_{ i,0},…,

*w*

_{ i,n }}, which we denote

*H*

_{ i }, becomes a hypothesis. From the definition of

*κ*

_{H},

*i*∈ℕ. This means that, for the sequence

*p*=

*w*

_{0},

*w*

_{1},…,

*γ*(

*p*)=

*K*. We therefore have

*γ*≡

*κ*

_{H}. □

By using this lemma, we interpret effective learning of figures as the computability of two identities (Fig. 5).

## Theorem 7.8

*A set* \(\mathcal {F}\subseteq \mathcal {K}^{*}\) *is* **FIGEFEX**-**INF**-*learnable if and only if there exists a computable function* *f* *such that* *f* *is a* \((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-*realization of the identity* \(\mathrm {id}_{\mathcal {F}}\), *and* *f* *is also a* (*η* _{INF},*γ*)-*realization of the identity* \(\mathrm {id}: \mathcal {K}^{*} \to \mathcal {K}^{*}\).

## Proof

We prove the latter half of the theorem, since the former part can be proved exactly as for Theorem 7.5.

**M**

**FIGEFEX**-

**INF**-learns \(\mathcal {F}\). For all \(K \in \mathcal {K}^{*}\) and all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\),

*η*

_{INF},

*γ*)-computable.

The “if” part: For some **M**, id∘*η* _{INF}(*σ*)=*γ*(**M** _{ σ }) for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). It follows that **M** is a learner that **FIGEFEX**-**INF**-learns \(\mathcal {F}\). □

Thus in **FIGEFEX**-**INF**- and **FIGEFEX**-**TXT**-learning of a set of figures \(\mathcal {F}\), a learner **M** outputs a hypothesis *H* with *κ*(*H*)=*K* in finite time if \(K \in \mathcal {F}\), and **M** outputs the “standard” representation of *K* if \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\) since we prove that *γ*≡*κ* _{H} in Lemma 7.7. Informally, this means that there is not too much loss of information of figures even if they are not explanatorily learnable.

## 8 Conclusion

We have proposed the learning of figures using self-similar sets based on Gold’s learning model towards a new theoretical framework of binary classification focusing on computability, and demonstrated a learnability hierarchy under various learning criteria (Fig. 3). The key to the computable approach is the amalgamation of discretization of data and the learning process. We showed a novel mathematical connection between fractal geometry and Gold’s model by measuring the lower bound of the size of training data with the Hausdorff dimension and the VC dimension. Furthermore, we analyzed our learning model using TTE (Type-2 Theory of Effectivity) and presented several mathematical connections between computability and learnability.

Many recent methods in machine learning are based on a statistical approach (Bishop 2007). The reason is that many data in the real world are in analog (real-valued) form, and the statistical approach can treat such analog data directly in theory. However, all learning methods are performed on computers. This means that all machine learning algorithms actually treat discretized digital data and, now, most research pays no attention to the gap between analog and digital data. In this paper we have proposed a novel and completely computable learning method for analog data, and have analyzed the method precisely. This work provides a theoretical foundation for computable learning from analog data, such as classification, regression, and clustering.

## Footnotes

## Notes

### Acknowledgements

The authors sincerely thank to the editor and anonymous reviewers for their lots of useful comments and suggestions which have led to invaluable improvements of this paper. This work was partly supported by Grant-in-Aid for Scientific Research (A) 22240010 and for JSPS Fellows 22⋅5714.

## References

- Angluin, D. (1980). Inductive inference of formal languages from positive data.
*Information and Control*,*45*(2), 117–135. MathSciNetMATHCrossRefGoogle Scholar - Angluin, D. (1982). Inference of reversible languages.
*Journal of the ACM*,*29*(3), 741–765. MathSciNetMATHCrossRefGoogle Scholar - Apsītis, K., Arikawa, S., Freivalds, R., Hirowatari, E., & Smith, C. H. (1999). On the inductive inference of recursive real-valued functions.
*Theoretical Computer Science*,*219*(1–2), 3–12. MathSciNetMATHCrossRefGoogle Scholar - Baird, D. C. (1994).
*Experimentation: an introduction to measurement theory and experiment design*(3rd ed.). Redwood City: Benjamin Cummings. Google Scholar - Barnsley, M. F. (1993).
*Fractals everywhere*(2nd ed.). San Mateo: Morgan Kaufmann. MATHGoogle Scholar - Barzdin, Y. M. (1974). Inductive inference of automata, languages and programs. In
*Proceedings of the international congress of mathematicians*(Vol. 2, pp. 455–460) (in Russian). Google Scholar - Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization?
*Neural Computation*,*1*(1), 151–160. CrossRefGoogle Scholar - Beer, G. A. (1993).
*Mathematics and its applications: Vol.**268*.*Topologies on closed and closed convex sets*. Dordrecht: Kluwer Academic. MATHGoogle Scholar - Ben-David, S., & Dichterman, E. (1998). Learning with restricted focus of attention.
*Journal of Computer and System Sciences*,*56*(3), 277–298. MathSciNetMATHCrossRefGoogle Scholar - Bishop, C. (2007).
*Pattern recognition and machine learning (information science and statistics)*. Berlin: Springer. Google Scholar - Blum, L., & Blum, M. (1975). Toward a mathematical theory of inductive inference.
*Information and Control*,*28*(2), 125–155. MathSciNetMATHCrossRefGoogle Scholar - Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension.
*Journal of the ACM*,*36*(4), 929–965. MathSciNetMATHCrossRefGoogle Scholar - Brattka, V., & Presser, G. (2003). Computability on subsets of metric spaces.
*Theoretical Computer Science*,*305*(1–3), 43–76. MathSciNetMATHCrossRefGoogle Scholar - Brattka, V., & Weihrauch, K. (1999). Computability on subsets of Euclidean space I: closed and compact subsets.
*Theoretical Computer Science*,*219*(1–2), 65–93. MathSciNetMATHCrossRefGoogle Scholar - Büchi, J. R. (1960). On a decision method in restricted second order arithmetic. In
*Proceedings of international congress on logic, methodology and philosophy of science*(pp. 1–12). Google Scholar - de Brecht, M. (2010).
*Topological and algebraic aspects of algorithmic learning theory*. PhD thesis, Graduate School of Informatics, Kyoto University. Google Scholar - de Brecht, M., & Yamamoto, A. (2009). \(\varSigma^{0}_{\alpha}\)-admissible representations. In
*Proceedings of the 6th international conference on computability and complexity in analysis*. Google Scholar - De La Higuera, C., & Janodet, J. C. (2001). Inference of
*ω*-languages from prefixes. In N. Abe, R. Khardon, & T. Zeugmann (Eds.),*Lecture notes in computer science: Vol.**2225*.*Algorithmic learning theory*(pp. 364–377). Berlin: Springer. CrossRefGoogle Scholar - Decatur, S. E., & Gennaro, R. (1995). On learning from noisy and incomplete examples. In
*Proceedings of the 8th annual conference on computational learning theory*(pp. 353–360). CrossRefGoogle Scholar - Ehrenfeucht, A., Haussler, D., Kearns, M., & Valiant, L. (1989). A general lower bound on the number of examples needed for learning.
*Information and Computation*,*82*(3), 247–261. MathSciNetMATHCrossRefGoogle Scholar - Elomaa, T., & Rousu, J. (2003). Necessary and sufficient pre-processing in numerical range discretization.
*Knowledge and Information Systems*,*5*(2), 162–182. CrossRefGoogle Scholar - Falconer, K. (2003).
*Fractal geometry: mathematical foundations and applications*. New York: Wiley. CrossRefGoogle Scholar - Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In
*Proceedings of the 13th international joint conference on artificial intelligence*(pp. 1022–1029). Google Scholar - Federer, H. (1996).
*Geometric measure theory*. Berlin: Springer. MATHGoogle Scholar - Freivalds, R., & Smith, C. H. (1993). On the role of procrastination in machine learning.
*Information and Computation*,*107*(2), 237–271. MathSciNetMATHCrossRefGoogle Scholar - Gama, J., & Pinto, C. (2006). Discretization from data streams: applications to histograms and data mining. In
*Proceedings of the 21st annual ACM symposium on applied computing*(pp. 23–27). Google Scholar - Gold, E. M. (1965). Limiting recursion.
*The Journal of Symbolic Logic*,*30*(1), 28–48. MathSciNetMATHCrossRefGoogle Scholar - Gold, E. M. (1967). Language identification in the limit.
*Information and Control*,*10*(5), 447–474. MATHCrossRefGoogle Scholar - Goldman, S. A., Kwek, S. S., & Scott, S. D. (2003). Learning from examples with unspecified attribute values.
*Information and Computation*,*180*(2), 82–100. MathSciNetMATHCrossRefGoogle Scholar - Hirowatari, E., & Arikawa, S. (1997). Inferability of recursive real-valued functions. In M. Li & A. Maruoka (Eds.),
*Lecture notes in computer science: Vol.**1316*.*Algorithmic learning theory*(pp. 18–31). Berlin: Springer. CrossRefGoogle Scholar - Hirowatari, E., & Arikawa, S. (2001). A comparison of identification criteria for inductive inference of recursive real-valued functions.
*Theoretical Computer Science*,*268*(2), 351–366. MathSciNetMATHCrossRefGoogle Scholar - Hirowatari, E., Hirata, K., Miyahara, T., & Arikawa, S. (2003). Criteria for inductive inference with mind changes and anomalies of recursive real-valued functions.
*IEICE Transactions on Information and Systems*,*86*(2), 219–227. Google Scholar - Hirowatari, E., Hirata, K., Miyahara, T., & Arikawa, S. (2005). Refutability and reliability for inductive inference of recursive real-valued functions.
*IPSJ Digital Courier*,*1*, 141–152. CrossRefGoogle Scholar - Hirowatari, E., Hirata, K., & Miyahara, T. (2006). Prediction of recursive real-valued functions from finite examples. In T. Washio, A. Sakurai, K. Nakajima, H. Takeda, S. Tojo, & M. Yokoo (Eds.),
*Lecture notes in computer science: Vol.**4012*.*New frontiers in artificial intelligence*(pp. 224–234). Berlin: Springer. CrossRefGoogle Scholar - Jain, S. (2011). Hypothesis spaces for learning.
*Information and Computation*,*209*(3), 513–527. MathSciNetMATHCrossRefGoogle Scholar - Jain, S., & Sharma, A. (1997). Elementary formal systems, intrinsic complexity, and procrastination.
*Information and Computation*,*132*(1), 65–84. MathSciNetMATHCrossRefGoogle Scholar - Jain, S., Osherson, D., Royer, S., & Sharma, A. (1999).
*Systems that learn*(2nd ed.). Cambridge: MIT Press. Google Scholar - Jain, S., Kinber, E., Wiehagen, R., & Zeugmann, T. (2001). Learning recursive functions refutably. In N. Abe, R. Khardon, & T. Zeugmann (Eds.),
*Lecture notes in computer science: Vol.**2225*.*Algorithmic learning theory*(pp. 283–298). CrossRefGoogle Scholar - Jain, S., Luo, Q., Semukhin, P., & Stephan, F. (2011). Uncountable automatic classes and learning.
*Theoretical Computer Science*,*412*(19), 1805–1820. MathSciNetMATHCrossRefGoogle Scholar - Jantke, K. P. (1991). Monotonic and non-monotonic inductive inference.
*New Generation Computing*,*8*(4), 349–360. MATHCrossRefGoogle Scholar - Kearns, M. J., & Vazirani, U. V. (1994).
*An introduction to computational learning theory*. Cambridge: MIT Press. Google Scholar - Kechris, A. S. (1995).
*Classical descriptive set theory*. Berlin: Springer. MATHCrossRefGoogle Scholar - Khardon, R., & Roth, D. (1999). Learning to reason with a restricted view.
*Machine Learning*,*35*(2), 95–116. CrossRefGoogle Scholar - Kinber, E. (1994). Monotonicity versus efficiency for learning languages from texts. In
*Lecture notes in computer science: Vol.**872*.*Algorithmic learning theory*(pp. 395–406). Berlin: Springer. CrossRefGoogle Scholar - Kobayashi, S. (1996).
*Approximate identification, finite elasticity and lattice structure of hypothesis space*(Tech. Rep. CSIM 96-04). Department of Computer Science and Information Mathematics, The University of Electro-Communications. Google Scholar - Kontkanen, P., Myllymäki, P., Silander, T., & Tirri, H. (1997). A Bayesian approach to discretization. In
*Proceedings of the European symposium on intelligent techniques*(pp. 265–268). Google Scholar - Lange, S., & Zeugmann, T. (1993). Monotonic versus non-monotonic language learning. In
*Lecture notes in computer science: Vol.**659*.*Nonmonotonic and inductive logic*(pp. 254–269). Berlin: Springer. CrossRefGoogle Scholar - Lange, S., & Zeugmann, T. (1994). Characterization of language learning front informant under various monotonicity constraints.
*Journal of Experimental and Theoretical Artificial Intelligence*,*6*(1), 73–94. MATHCrossRefGoogle Scholar - Lange, S., Zeugmann, T., & Zilles, S. (2008). Learning indexed families of recursive languages from positive data: a survey.
*Theoretical Computer Science*,*397*(1–3), 194–232. MathSciNetMATHCrossRefGoogle Scholar - Li, M., Chen, X., Li, X., Ma, B., & Vitányi, P. (2003). The similarity metric. In
*Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms*(pp. 863–872). Google Scholar - Lin, J., Keogh, E., Lonardi, S., & Chiu, B. (2003). A symbolic representation of time series, with implications for streaming algorithms. In
*Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery*(pp. 1–11). Google Scholar - Liu, H., Hussain, F., Tan, L., & Dash, M. (2002). Discretization: an enabling technique.
*Data Mining and Knowledge Discovery*,*6*(4), 393–423. MathSciNetCrossRefGoogle Scholar - Long, P. M., & Tan, L. (1998). PAC learning axis-aligned rectangles with respect to product distributions from multiple-instance examples.
*Machine Learning*,*30*(1), 7–21. MATHCrossRefGoogle Scholar - Mandelbrot, B. B. (1982).
*The fractal geometry of nature*. San Francisco: W.H. Freeman. MATHGoogle Scholar - Merkle, W., & Stephan, F. (2003). Refuting learning revisited.
*Theoretical Computer Science*,*298*(1), 145–177. MathSciNetMATHCrossRefGoogle Scholar - Michael, L. (2010). Partial observability and learnability.
*Artificial Intelligence*,*174*(11), 639–669. MathSciNetMATHCrossRefGoogle Scholar - Michael, L. (2011). Missing information impediments to learnability. In
*24th annual conference on learning theory*(pp. 1–2). Google Scholar - Minicozzi, E. (1976). Some natural properties of strong-identification in inductive inference.
*Theoretical Computer Science*,*2*(3), 345–360. MathSciNetMATHCrossRefGoogle Scholar - Motoki, T., Shinohara, T., & Wright, K. (1991). The correct definition of finite elasticity: corrigendum to identification of unions. In
*Proceedings of the 4th annual workshop on computational learning theory*(p. 375). Google Scholar - Mukouchi, Y., & Arikawa, S. (1995). Towards a mathematical theory of machine discovery from facts.
*Theoretical Computer Science*,*137*(1), 53–84. MathSciNetMATHCrossRefGoogle Scholar - Mukouchi, Y., & Sato, M. (2003). Refutable language learning with a neighbor system.
*Theoretical Computer Science*,*298*(1), 89–110. MathSciNetMATHCrossRefGoogle Scholar - Müller, N. (2001). The iRRAM: exact arithmetic in C++. In J. Blanck, V. Brattka, & P. Hertling (Eds.),
*Lecture notes in computer science:*Vol.*2064*.*Computability and complexity in analysis*(pp. 222–252). Berlin: Springer. CrossRefGoogle Scholar - Perrin, D., & Pin, J.E. (2004).
*Infinite words*. Amsterdam: Elsevier. MATHGoogle Scholar - Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.
*Psychological Review*,*65*(6), 386–408. MathSciNetCrossRefGoogle Scholar - Sakurai, A. (1991). Inductive inference of formal languages from positive data enumerated primitive-recursively. In
*Algorithmic learning theory, JSAI*(pp. 73–83). Google Scholar - Schröder, M. (2002a).
*Admissible representations for continuous computations*. PhD thesis. dem Fachbereich Informatik, der FernUniversität – Gesamthochschule in Hagen. Google Scholar - Schröder, M. (2002b). Extended admissibility.
*Theoretical Computer Science*,*284*(2), 519–538. MathSciNetMATHCrossRefGoogle Scholar - Shapiro, E. Y. (1981).
*Inductive inference of theories from facts*(Tech. rep). Department of Computer Science, Yale University. Google Scholar - Shapiro, E. Y. (1983).
*Algorithmic program debugging*. Cambridge: MIT Press. Google Scholar - Skubacz, M., & Hollmén, J. (2000). Quantization of continuous input variables for binary classification. In
*Lecture notes in computer science: Vol.**1983*.*Intelligent data engineering and automated learning—IDEAL 2000. Data mining, financial engineering, and intelligent agents*(pp. 42–47). Berlin: Springer. CrossRefGoogle Scholar - Sugiyama, M., & Yamamoto, A. (2010). The coding divergence for measuring the complexity of separating two sets. In
*JMLR workshop and conference proceedings: Vol.**13*.*Proceedings of 2nd Asian conference on machine learning*(pp. 127–143). Google Scholar - Sugiyama, M., Hirowatari, E., Tsuiki, H., & Yamamoto, A. (2006). Learning from real-valued data with the model inference mechanism through the Gray-code embedding. In
*Proceedings of 4th workshop on learning with logics and logics for learning (LLLL2006)*(pp. 31–37). Google Scholar - Sugiyama, M., Hirowatari, E., Tsuiki, H., & Yamamoto, A. (2009). Learning figures with the Hausdorff metric by self-similar sets. In
*Proceedings of 6th workshop on learning with logics and logics for learning (LLLL2009)*(pp. 27–34). Google Scholar - Sugiyama, M., Hirowatari, E., Tsuiki, H., & Yamamoto, A. (2010). Learning figures with the Hausdorff metric by fractals. In M. Hutter, F. Stephan, V. Vovk, & T. Zeugmann (Eds.),
*Lecture notes in computer science: Vol.**6331*.*Algorithmic learning theory*(pp. 315–329). Canberra: Springer. CrossRefGoogle Scholar - Tavana, N. R., & Weihrauch, K. (2011). Turing machines on represented sets, a model of computation for analysis.
*Logical Methods in Computer Science*,*7*(2), 1–21. MathSciNetCrossRefGoogle Scholar - Trakhtenbrot, B., & Barzdin, Y. M. (1970). Konetschnyje awtomaty (powedenie i sintez). English translation: Finite automata-behavior and synthesis.
*Fundamental Studies in Computer Science*,*1*, 1975. Google Scholar - Turing, A. M. (1937). On computable numbers, with the application to the entscheidungsproblem.
*Proceedings of the London Mathematical Society*,*1*(42), 230–265. MathSciNetCrossRefGoogle Scholar - Valiant, L. G. (1984). A theory of the learnable.
*Communications of the ACM*,*27*(11), 1134–1142. MATHCrossRefGoogle Scholar - Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities.
*Theory of Probability and Its Applications*,*16*(2), 264–280. MathSciNetMATHCrossRefGoogle Scholar - Weihrauch, K. (2000).
*Computable analysis: an introduction*. Berlin: Springer. MATHGoogle Scholar - Weihrauch, K. (2008). The computable multi-functions on multi-represented sets are closed under programming.
*Journal of Universal Computer Science*,*14*(6), 801–844. MathSciNetMATHGoogle Scholar - Weihrauch, K., & Grubba, T. (2009). Elementary computable topology.
*Journal of Universal Computer Science*,*15*(6), 1381–1422. MathSciNetMATHGoogle Scholar - Wiehagen, R. (1991). A thesis in inductive inference. In J. Dix, K. P. Jantke, & P. H. Schmitt (Eds.),
*Lecture notes in computer science: Vol.**543*.*Nonmonotonic and inductive logic*(pp. 184–207). Berlin: Springer. CrossRefGoogle Scholar - Wright, K. (1989). Identification of unions of languages drawn from an identifiable class. In
*Proceedings of the 2nd annual workshop on computational learning theory*(pp. 328–333). Google Scholar - Zeugmann, T., & Zilles, S. (2008). Learning recursive functions: a survey.
*Theoretical Computer Science*,*397*(1–3), 4–56. MathSciNetMATHCrossRefGoogle Scholar - Zeugmann, T., Lange, S., & Kapur, S. (1995). Characterizations of monotonic and dual monotonic language learning.
*Information and Computation*,*120*(2), 155–173. MathSciNetMATHCrossRefGoogle Scholar