Learning figures with the Hausdorff metric by fractals—towards computable binary classification
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s10994-012-5301-z
- Cite this article as:
- Sugiyama, M., Hirowatari, E., Tsuiki, H. et al. Mach Learn (2013) 90: 91. doi:10.1007/s10994-012-5301-z
- 1 Citations
- 652 Downloads
Abstract
We present learning of figures, nonempty compact sets in Euclidean space, based on Gold’s learning model aiming at a computable foundation for binary classification of multivariate data. Encoding real vectors with no numerical error requires infinite sequences, resulting in a gap between each real vector and its discretized representation used for the actual machine learning process. Our motivation is to provide an analysis of machine learning problems that explicitly tackles this aspect which has been glossed over in the literature on binary classification as well as in other machine learning tasks such as regression and clustering. In this paper, we amalgamate two processes: discretization and binary classification. Each learning target, the set of real vectors classified as positive, is treated as a figure. A learning machine receives discretized vectors as input data and outputs a sequence of discrete representations of the target figure in the form of self-similar sets, known as fractals. The generalization error of each output is measured by the Hausdorff metric. Using this learning framework, we reveal a hierarchy of learnable classes under various learning criteria in the track of traditional analysis based on Gold’s learning model, and show a mathematical connection between machine learning and fractal geometry by measuring the complexity of learning using the Hausdorff dimension and the VC dimension. Moreover, we analyze computability aspects of learning of figures using the framework of Type-2 Theory of Effectivity (TTE).
Keywords
Binary classification Discretization Self-similar set Gold’s learning model Hausdorff metric Type-2 theory of effectivity1 Introduction
Discretization is a fundamental process in machine learning from analog data. For example, Fourier analysis is one of the most essential signal processing methods and its discrete version, discrete Fourier analysis, is used for learning or recognition on a computer from continuous signals. However, in the method, only the direction of the time axis is discretized, so each data point is not purely discretized. That is to say, continuous (electrical) waves are essentially treated as finite/infinite sequences of real numbers, hence each value is still continuous (analog). The gap between analog and digital data therefore remains.
This problem appears all over machine learning from observed multivariate data. The reason is that an infinite sequence is needed to encode a real vector exactly without any numerical error, since the cardinality of the set of real numbers, which is the same as that of infinite sequences, is much larger than that of the set of finite sequences. Thus to treat each data point on a computer, it has to be discretized and considered as an approximate value with some numerical error. However, to date, most machine learning algorithms ignore the gap between the original value and its discretized representation. This gap could result in some unexpected numerical errors.^{1} Since now machine learning algorithms can be applied to massive datasets, it is urgent to give a theoretical foundation for learning, such as classification, regression, and clustering, from multivariate data, in a fully computational manner to guarantee the soundness of the results of learning.
In the field of computational learning theory, Valiant’s learning model (also called PAC, Probably Approximately Correct, learning model), proposed by Valiant (1984), is used for theoretical analysis of machine learning algorithms. In this model, we can analyze the robustness of a learning algorithm in the face of noise or inaccurate data and the complexity of learning with respect to the rate of convergence or the size of the input using the concept of probability. Blumer et al. (1989) and Ehrenfeucht et al. (1989) provided the crucial conditions for learnability, that is, the lower and upper bounds for the sample size, using the VC (Vapnik-Chervonenkis) dimension (Vapnik and Chervonenkis 1971). These results can be applied to various concept representations that handle real-valued inputs and use real-valued parameters, for example, to analyze learning of neural networks (Baum and Haussler 1989). However, this learning model is not in line with discrete and computational analysis of machine learning. We cannot know which class of continuous objects is exactly learnable and what kind of data are needed to learn from a finite expression of discretized multivariate data. Although PAC learning from axis-parallel rectangles has already been investigated (Blumer et al. 1989; Kearns and Vazirani 1994; Long and Tan 1998), which can be viewed as a variant of learning from multivariate data with numerical error, it is not applicable in the study. Our goal is to investigate computational learning, focusing on a common ground between “learning” and “computation” of real numbers based on the behavior of Turing machines, without any reference to probability distributions. For the purpose of the investigation, we need to distinguish abstract mathematical objects such as real numbers and their concrete representations, or codes, on a computer.
Instead, in this paper we use Gold’s learning model (also called identification in the limit), which is originally designed for learning of recursive functions (Gold 1965) and languages (Gold 1967). In the model, a learning machine is assumed to be a procedure, i.e., a Turing machine (Turing 1937) which never halts, that receives training data from time to time, and outputs representations (hypotheses) of the target from time to time. All data are usually assumed to be given at some point in the future. Starting from this learning model, learnability of classes of discrete objects, such as languages and recursive functions, has been analyzed in detail under various learning criteria (Jain et al. 1999). However, analysis of learning for continuous objects, such as classification, regression, and clustering for multivariate data, with Gold’s model is still under development, despite such settings being typical in modern machine learning. To the best of our knowledge, the only line of studies devoted to learning of real-valued functions was by Hirowatari and Arikawa (1997, 2001) Apsītis et al. (1999), Hirowatari et al. (2003, 2005, 2006), where they addressed the analysis of learnable classes of real-valued functions using computable representations of real numbers.^{2} We therefore need a new theoretical and computational framework for modern machine learning based on Gold’s learning model with discretization of numerical data.
In this paper we consider the problem of binary classification for multivariate data, which is one of the most fundamental problems in machine learning and pattern recognition. In this task, a training dataset consists of a set of pairs {(x_{1},y_{1}),(x_{2},y_{2}),…,(x_{n},y_{n})}, where x_{i}∈ℝ^{d} is a feature vector, y_{i}∈{0,1} is a label, and the d-dimensional Euclidean space ℝ^{d} is a feature space. The goal is to learn a classifier from the given training dataset, that is, to find a mapping h:ℝ^{d}→{0,1} such that, for all x∈ℝ^{d}, h(x) is expected to be the same as the true label of x. In other words, such a classifier h is the characteristic function of a subset L={x∈ℝ^{d}∣h(x)=1} of ℝ^{d}, which has to be similar to the true set K={x∈ℝ^{d}∣the true label of x is 1} as far as possible. Throughout the paper, we assume that each feature is normalized by some data preprocessing such as min-max normalization for simplicity, that is, the feature space is the unit interval (cube) \(\mathcal {I}^{d} = [0, 1] \times \dots\times[0, 1]\) in the d-dimensional Euclidean space ℝ^{d}. In many realistic scenarios, each target K is a closed and bounded subset of \(\mathcal {I}^{d}\), i.e., a nonempty compact subset of \(\mathcal {I}^{d}\), called a figure. Thus here we address the problem of binary classification by treating it as “learning of figures”.
In this machine learning process, we implicitly treat any feature vector through its representation, or code on a computer, that is, each feature vector \(x \in \mathcal {I}^{d}\) is represented by a sequence p over some alphabet Σ using an encoding scheme ρ. Here such a surjective mapping ρ is called a representation and should map the set of “infinite” sequences Σ^{ω} to \(\mathcal {I}^{d}\) since there is no one-to-one correspondence between finite sequences and real numbers (or real vectors). In this paper, we use the binary representationρ:Σ^{ω}→[0,1] with Σ={0,1}, which is defined by ρ(p):=∑p_{i}⋅2^{−(i+1)} for an infinite sequence p=p_{0}p_{1}p_{2}…. For example, ρ(0100…)=0.25, ρ(1000…)=0.5, and ρ(0111…)=0.5. However, we cannot treat infinite sequences on a computer in finite time and, instead, we have to use discretized values, i.e., truncated finite sequences in any actual machine learning process. Thus in learning of a classifier h for the target figure K, we cannot use an exact data point x∈K but have to use a discretized finite sequence w∈Σ^{∗} which tells us that x takes one of the values in the set {ρ(p)∣w⊏p} (w⊏p means that w is a prefix of p). For instance, if w=01, then x should be in the interval [0.25,0.5]. For a finite sequence w∈Σ^{∗}, we define ρ(w):={ρ(p)∣w⊏p with p∈Σ^{ω}} using the same symbol ρ. From a geometric point of view, ρ(w) means a hyper-rectangle whose sides are parallel to the axes in the space \(\mathcal {I}^{d}\). For example, for the binary representation ρ, we have ρ(0)=[0,0.5], ρ(1)=[0.5,1], ρ(01)=[0.25,0.5], and so on. Therefore in the actual learning process, while a target set K and each point x∈K exist mathematically, a learning machine can only treat finite sequences as training data.
Here the problem of binary classification is stated in a computational manner as follows: Given a training dataset {(w_{1},y_{1}),(w_{2},y_{2}),…,(w_{n},y_{n})} (w_{i}∈Σ^{∗} for each i∈{1,2,…,n}), where y_{i}=1 if \(\rho(w_{i}) \cap K \not= \emptyset\) for a target figure \(K \subseteq \mathcal {I}^{d}\) and y_{i}=0 otherwise, learn a classifier h:Σ^{∗}→{0,1} for which h(w) should be the same as the true label of w for all w∈Σ^{∗}. Each training datum (w_{i},y_{i}) is called a positive example if y_{i}=1 and a negative example if y_{i}=0.
Assume that a figure K is represented by a set P of infinite sequences, i.e., {ρ(p)∣p∈P}=K, using the binary representation ρ. Then learning the figure is different from learning the well-known prefix closed set Pref(P), defined as Pref(P):={w∈Σ^{∗}∣w⊏p for some p∈P}, since generally \(\mathrm {Pref}(P) \not= \{w \in\varSigma^{*} \mid\rho(w) \cap K \not= \emptyset\}\) holds. For example, if P={p∈Σ^{ω}∣1⊏p}, the corresponding figure K is the interval [0.5,1]. Then, the infinite sequence 0111… is a positive example since ρ(0111…)=0.5 and \(\rho(\mathtt {0}\mathtt {1}\mathtt {1}\mathtt {1}\dots) \cap K \not= \emptyset\), but it is not contained in Pref(P). This problem is fundamentally due to rational numbers having two representations, for example, both 0111… and 1000… represent 0.5. Solving this mismatch between objects of learning and their representations is one of the challenging problems of learning continuous objects based on their representation in a computational manner.
For finite expression of classifiers, we use self-similar sets known as fractals (Mandelbrot 1982) to exploit their simplicity and the power of expression theoretically provided by the field of fractal geometry. Specifically, we can approximate any figure by some self-similar set arbitrarily closely (derived from the Collage Theorem given by Falconer 2003) and can compute it by a simple recursive algorithm, called an IFS (Iterated Function System) (Barnsley 1993; Falconer 2003). This approach can be viewed as the analog of the discrete Fourier analysis, where FFT (Fast Fourier Transformation) is used as the fundamental recursive algorithm. Moreover, in the process of sampling from analog data in discrete Fourier analysis, scalability is a desirable property. It requires that when the sample resolution increases, the accuracy of the result is monotonically refined. We formalize this property as effective learning of figures, which is inspired by effective computing in the framework of Type-2 Theory of Effectivity (TTE) studied in computable analysis (Schröder 2002b; Weihrauch 2000). This model guarantees that as a computer reads more and more precise information of the input, it produces more and more accurate approximations of the result. Here we adapt this model from computation to learning, where if a learner (learning machine) receives more and more accurate training data, it learns better and better classifiers (self-similar sets) approximating the target figure.
- 1.
We formalize the learning of figures using self-similar sets based on Gold’s learning model towards realizing fully computable binary classification (Sect. 3). We construct a representational system for learning using self-similar sets based on the binary representation of real numbers, and show desirable properties of it (Lemmas 3.2, 3.3, and 3.4).
- 2.
We construct a learnability hierarchy under various learning criteria, summarized in Fig. 3 (Sect. 4 and 5). We consider five criteria for learning: explanatory learning (Sect. 4.1), consistent learning (Sect. 4.2), reliable and refutable learning (Sect. 4.3), and effective learning (Sect. 5).
- 3.
We show a mathematical connection between learning and fractal geometry by measuring the complexity of learning using the Hausdorff dimension and the VC dimension (Sect. 6). Specifically, we give a lower bound on the number of positive examples using the dimensions.
- 4.
We also show a connection between computability of figures studied in computable analysis and learnability of figures discussed in this paper using TTE (Sect. 7). Learning can be viewed as computable realization of the identity from the set of figures to the same set equipped with a finer topology.
The rest of the paper is organized as follows: We review related work in comparison to the present work in Sect. 2. We formalize computable binary classification as learning of figures in Sect. 3 and analyze the learnability hierarchy induced by variants of our model in Sects. 4 and 5. The mathematical connection between fractal geometry and Gold’s model with the Hausdorff and the VC dimensions is presented in Sect. 6 and between computability and learnability of figures in Sect. 7. Section 8 gives the conclusion.
A preliminary version of this paper was presented at the 21st International Conference on Algorithmic Learning Theory (Sugiyama et al. 2010). In this paper, formalization of learning in Sect. 3 is completely updated for clarity and simplicity, and all theorems and lemmas have formal proofs (they were omitted in the conference paper). Furthermore, discussion about related work in Sect. 2 and TTE analysis in Sect. 7 are new contributions. In addition, several examples and figures are added for readability.
2 Related work
Statistical approaches to machine learning are now achieving great success since they are originally designed for analyzing observed multivariate data and, to date, many statistical methods have been proposed to treat continuous objects such as real-valued functions (Bishop 2007). However, most methods pay no attention to discretization and the finite representation of analog data on a computer. For example, multi-layer perceptrons are used to learn real-valued functions, since they can approximate every continuous function arbitrarily and accurately. However, a perceptron is based on the idea of regulating analog wiring (Rosenblatt 1958), hence such learning is not purely computable, i.e., it ignores the gap between analog raw data and digital discretized data. Furthermore, although several discretization techniques have been proposed by Elomaa and Rousu (2003), Fayyad and Irani (1993), Gama and Pinto (2006), Kontkanen et al. (1997), Li et al. (2003), Lin et al. (2003), Liu et al. (2002), Skubacz and Hollmén (2000), they treat discretization as data preprocessing for improving the accuracy or efficiency of machine learning algorithms. The process of discretization is therefore not considered from a computational point of view, and “computability” of machine learning algorithms is not discussed at sufficient depth.
There are several related articles considering learning under various restrictions in Gold’s model (Goldman et al. 2003), Valiant’s model (Ben-David and Dichterman 1998; Decatur and Gennaro 1995), and other learning context (Khardon and Roth 1999). Moreover, recently learning from partial examples, or examples with missing information, has attracted much attention in Valiant’s learning model (Michael 2010, 2011). In this paper we also consider learning from examples with missing information, which are truncated finite sequences. However, our model is different from the cited work, since the “missing information” in this paper corresponds to measurement error of real-valued data. Our motivation comes from actual measurement/observation of a physical object, where every datum obtained by an experimental instrument must have some numerical error in principle (Baird 1994). For example, if we measure the size of a cell by a microscope equipped with micrometers, we cannot know the true value of the size but an approximate value with numerical error, which depends on the degree of magnification by the micrometers. In this paper we try to treat this process as learning from multivariate data, where an approximate value corresponds to a truncated finite sequence and error becomes small as the length of the sequence increases. The model of computation for real numbers within the framework of TTE, as mentioned in the introduction, fits our motivation, and this approach is unique in computational learning theory.
Self-similar sets can be viewed as a geometric interpretation of languages recognized by ω-automata (Perrin and Pin 2004), first introduced by Büchi (1960), and learning of such languages has been investigated by De La Higuera and Janodet (2001), Jain et al. (2011). Both works focus on learning ω-languages from their prefixes, i.e. texts (positive data), and show several learnable classes. This approach is different from ours since our motivation is to address computability issues in the field of machine learning from numerical data, and hence there is a gap between prefixes of ω-languages and positive data for learning in our setting as mentioned in the introduction. Moreover, we consider learning from both positive and negative data, which is a new approach in the context of learning of infinite words.
Recently, two of the authors, Sugiyama and Yamamoto (2010), have addressed discretization of real vectors in a computational approach and proposed a new similarity measure, called coding divergence. It evaluates the similarity between two sets of real vectors and can be applied to many machine learning tasks such as classification and clustering. However, it does not address the issue of the learnability or complexity of learning of continuous objects.
3 Formalization of learning
Notation
ℕ | The set of natural numbers including 0 |
ℕ^{+} | The set of positive natural numbers, i.e., ℕ^{+}=ℕ∖{0} |
ℚ | The set of rational numbers |
ℝ | The set of real numbers |
ℝ^{+} | The set of positive real numbers |
d | The number of dimensions (d∈ℕ^{+}) |
ℝ^{d} | d-dimensional Euclidean space |
\(\mathcal {K}^{*}\) | The set of figures (nonempty compact subsets of ℝ^{d}) |
\(\mathcal {I}^{d}\) | The unit interval [0,1]×…×[0,1] |
K, L | Figures (nonempty compact sets) |
#X | The number of elements in X |
\(\mathcal {F}\) | Set of figures |
φ | Contraction for real numbers |
C | Finite set of contractions |
Φ | Contraction for figures |
Σ | Alphabet |
Σ^{d} | The set of finite sequences whose length are d, i.e., Σ^{d}={a_{1}a_{2}…a_{d}∣a_{i}∈Σ} |
Σ^{∗} | The set of finite sequences |
Σ^{+} | The set of finite sequences without the empty string λ |
Σ^{ω} | The set of infinite sequences |
λ | The empty string |
u, v, w | Finite sequences |
w⊑p | w means a prefix of p (w⊏p is w⊑p and w≠p) |
↑w | The set {p∈Σ^{ω}∣w⊏p} |
〈⋅〉 | The tupling function, i.e., \(\langle p^{1}, p^{2}, \dots, p^{d}\rangle :=p_{0}^{1}p_{0}^{2}\dots p_{0}^{d} p_{1}^{1}p_{1}^{2}\dots p_{1}^{d} p_{2}^{1}p_{2}^{2}\dots p_{2}^{d}\dots\) |
|w| | The length of w. If w=〈w^{1},…,w^{d}〉∈(Σ^{d})^{∗}, |w|=|w^{1}|=…=|w^{d}| |
\(\operatorname {diam}(k)\) | The diameter of the set ρ(w) with |w|=k, i.e., \(\operatorname {diam}(k) = \sqrt{d} \cdot2^{-k}\) |
p, q | Infinite sequences |
V, W | Set of finite or infinite sequences |
ρ | Binary representation |
ξ, ζ | Representation, i.e., a mapping from finite or infinite sequences to some objects |
ξ≤ζ | ξ is reducible to ζ |
ξ≡ζ | ξ is equivalent to ζ |
\(\nu_{\mathbb {Q}^{d}}\) | Representation for rational numbers |
\(\nu_{\mathcal {Q}}\) | Representation for finite sets of rational numbers |
\(\mathcal {H}\) | The hypothesis space (The set of finite sets of finite sequences) |
H | Hypothesis |
h | Classifier of hypothesis H |
κ | The mapping from hypotheses to figures |
M | Learner |
σ | Presentation (informant or text) |
Pos(K) | The set of finite sequences of positive examples of K, i.e., \(\{w \mid\rho(w) \cap K \not= \emptyset\}\) |
Pos_{k}(K) | The set {w∈Pos(K)∣|w|=k} |
Neg(K) | The set of finite sequences of negative examples of K, i.e., {w∣ρ(w)∩K=∅} |
d_{E} | The Euclidean distance |
d_{H} | The Hausdorff distance |
ℌ | The Hausdorff measure |
dim_{H} | The Hausdorff dimension |
dim_{B} | The box-counting dimension |
dim_{S} | The similarity dimension |
dim_{VC} | The VC dimension |
Example 3.1
Lemma 3.2
(Soundness of hypotheses)
For every hypothesis\(H \in \mathcal {H}\), the setκ(H) defined by (6) is a self-similar set.
Proof
Lemma 3.3
(Representational power of hypotheses)
For anyδ∈ℝ and for every figure\(K \in \mathcal {K}^{*}\), there exists a hypothesisHsuch that\(\mbox {$\mathrm {GE}$}(K, H) < \delta\).
Proof
Lemma 3.4
(Computability of classifiers)
For every hypothesis\(H \in \mathcal {H}\), the classifierhof Hdefined by (7) is computable.
Proof
First we consider whether or not the boundary of an interval is contained in κ(H). Suppose d=1 and let C be a finite set of contractions and F be the self-similar set of C. We have the following property: Let \([x, y] = \varphi _{1} \circ \varphi _{2} \circ\dots\circ \varphi _{n} (\mathcal {I}^{1})\) for some φ_{1},φ_{2},…,φ_{n}∈C and let \(I = \varphi '_{1} \circ \varphi '_{2} \circ\dots\circ \varphi '_{n'} (\mathcal {I}^{1})\) for \(\varphi '_{1}, \varphi '_{2}, \dots, \varphi '_{n'} \in C\). Assume that, if n′ is large enough, there is no such I satisfying x∈I and minI<x (resp. maxI>y). Then, we have x∈F (resp. y∈F) if and only if \(0 \in \varphi (\mathcal {I}^{1})\) (resp. \(1 \in \varphi (\mathcal {I}^{1})\)) for some φ∈C. This means that if [x,y]=ρ(v) with a sequence v∈H^{k} (k∈ℕ) for a hypothesis H, where there is no sequence v′∈H^{k′} with x∈ρ(v′) and minρ(v′)<x (resp. maxρ(v′)>y) when k′ is large enough, we have x∈κ(H) (resp. y∈κ(H)) if and only if u∈{0}^{+} (resp. u∈{1}^{+}) for some u∈H.
- 1.
For some k∈ℕ, there exists v∈H^{k} such that w⊑v. This is because ρ(w)⊇ρ(v) and ρ(v)∩κ(H)≠∅.
- 2.
The above condition does not hold, but ρ(w)∩κ(H)≠∅.
The “only if” part: In Algorithm 1, if v∈H^{k} satisfies conditions in line 6 or line 8, h(w)∩κ(H)≠∅. Thus h(w)=1 holds. □
The set {κ(H)∣ H⊂(Σ^{d})^{∗} and the classifier h of H is computable} exactly corresponds to an indexed family of recursive concepts/languages discussed in computational learning theory (Angluin 1980), which is a common assumption for learning of languages. On the other hand, there exists some class of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) that is not an indexed family of recursive concepts. This means that, for some figure K, there is no computable classifier which classifies all data correctly. Therefore we address the problems of both exact and approximate learning of figures to obtain a computable classifier for any target figure.
Lemma 3.5
(Monotonicity of examples)
If (v,1) is an example ofK, then (w,1) is an example ofKfor all prefixesw⊑v, and (va,1) is an example ofKfor somea∈Σ^{d}. If (w,0) is an example ofK, then (wv,0) is an example ofKfor allv∈(Σ^{d})^{∗}.
Proof
Relationship between the conditions for each finite sequence w∈Σ^{∗} and the standard notation of binary classification
Target figure K | |||
---|---|---|---|
w∈Pos(K) | w∈Neg(K) | ||
(ρ(w)∩K≠∅) | (ρ(w)∩K=∅) | ||
Hypothesis H | h(w)=1 | True positive | False positive |
(ρ(w)∩κ(H)≠∅) | (Type I error) | ||
h(w)=0 | False negative | True negative | |
(ρ(w)∩κ(H)=∅) | (Type II error) |
Let h be the classifier of a hypothesis H. We say that the hypothesis H is consistent with an example (w,l) if l=1 implies h(w)=1 and l=0 implies h(w)=0, and consistent with a set of examples E if H is consistent with all examples in E.
A learning machine, called a learner, is a procedure, (i.e. a Turing machine that never halts) that reads a presentation of a target figure from time to time, and outputs hypotheses from time to time. In the following, we denote a learner by M and an infinite sequence of hypotheses produced by M on the input σ by M_{σ}, and M_{σ}(i−1) denotes the ith hypothesis produced by M. Assume that M receives j examples σ(0),σ(1),…,σ(j−1) so far when it outputs the ith hypothesis M_{σ}(i−1). We do not require the condition i=j, that is, the inequality i≤j usually holds since M can “wait” until it receives enough examples. We say that an infinite sequence of hypotheses M_{σ}converges to a hypothesis H if there exists n∈ℕ such that M_{σ}(i)=H for all i≥n.
4 Exact learning of figures
We analyze “exact” learning of figures. This means that, for any target figure K, there should be a hypothesis H such that the generalization error is zero (i.e., K=κ(H)), hence the classifier h of H can classify all data correctly with no error, that is, h satisfies (7). The goal is to find such a hypothesis H from examples (training data) of K.
4.1 Explanatory learning
The most basic learning criterion in Gold’s model is EX-learning (EX means EXplain), i.e., learning in the limit proposed by Gold (1967). We call these criteria FIGEX-INF- (INF means an informant) and FIGEX-TXT-learning (TXT means a text) for EX-learning from informants and texts, respectively. We introduce these criteria into the learning of figures, and analyze the learnability of figures.
Definition 4.1
(Explanatory learning)
A learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if for all figures \(K \in \mathcal {F}\) and all informants (resp. texts) σ of K, the outputs M_{σ} converge to a hypothesis H such that \(\mbox {$\mathrm {GE}$}(K, H) = 0\).
For every learning criterion CR introduced in the following, we say that a set of figures \(\mathcal {F}\) is CR-learnable if there exists a learner that CR-learns \(\mathcal {F}\), and denote by CR the collection of CR-learnable sets of figures following the standard notation of this field (Jain et al. 1999).
Theorem 4.2
The set of figures\(\kappa (\mathcal {H}) = \left \{\kappa(H) | H \in \mathcal {H}\right \}\)isFIGEX-INF-learnable.
Proof
Next, we consider FIGEX-TXT-learning. In learning of languages from texts, the necessary and sufficient conditions for learning have been studied in detail by Angluin (1980, 1982), Kobayashi (1996), Lange et al. (2008), Motoki et al. (1991), Wright (1989), and characterization of learnability using finite tell-tale sets is one of the crucial results. We adapt these results into the learning of figures and show the FIGEX-TXT-learnability.
Definition 4.3
(Finite tell-tale set, cf. Angluin 1980)
Let \(\mathcal {F}\) be a set of figures. For a figure \(K \in \mathcal {F}\), a finite subset \(\mathcal {T}\) of the set of positive examples Pos(K) is a finite tell-tale set ofKwith respect to\(\mathcal {F}\) if for all figures \(L \in \mathcal {F}\), \(\mathcal {T}\subset \mathrm {Pos}(L)\) implies \(\mathrm {Pos}(L) \not \subset \mathrm {Pos}(K)\) (i.e., \(L \not\subset K\)). If every figure \(K \in \mathcal {F}\) has finite tell-tale sets with respect to \(\mathcal {F}\), we say that \(\mathcal {F}\) has finite tell-tale sets.
Theorem 4.4
Let\(\mathcal {F}\)be a subset of\(\kappa (\mathcal {H})\). Then\(\mathcal {F}\)isFIGEX-TXT-learnable if and only if there is a procedure that, for every figure\(K \in \mathcal {F}\), enumerates a finite tell-tale setWofKwith respect to\(\mathcal {F}\).
This theorem can be proved in exactly the same way as that for learning of languages given by Angluin (1980). Note that such procedure does not need to stop. Using this theorem, we show that the set \(\kappa (\mathcal {H})\) is not FIGEX-TXT-learnable.
Theorem 4.5
The set\(\kappa (\mathcal {H})\)does not have finite tell-tale sets.
Proof
Fix a figure \(K = \kappa(H) \in \kappa (\mathcal {H})\), where there exists a pair v,w∈H such that \(\rho(vvv\dots) \not= \rho(www\dots)\), and fix a finite set \(T = \left \{w_{1}, w_{2}, \dots, w_{n}\right \}\) contained in Pos(K). Suppose that #Pos_{m}(K)>n holds for a natural number m. For each finite sequence w_{i}, there exists u_{i}∈Pos(K) such that |u_{i}|>m, w_{i}⊏u_{i}, and u_{i}∈H^{k} for some k. For the figure L=κ(U) with U={u_{1},u_{2},…,u_{n}}, T⊂Pos(L) and Pos(L)⊂Pos(K) hold. Therefore K has no finite tell-tale set with respect to \(\kappa (\mathcal {H})\). □
Corollary 4.6
The set of figures\(\kappa (\mathcal {H})\)is notFIGEX-TXT-learnable.
In any realistic scenarios of machine learning, however, this set \(\kappa (\mathcal {H})\) is too large to search for the best hypothesis since we usually want to obtain a “compact” representation of a target figure. Thus we (implicitly) have an upper bound on the number of elements in a hypothesis. Here we give a positive result for the above situation, that is, if we fix the number of elements #H in each hypothesis Ha priori, the resulting set of figures becomes FIGEX-TXT-learnable. Intuitively, this is because if we take k large enough, the set {w∈Pos(K)∣|w|≤k} becomes a finite tell-tale set of K. Here we denote by Red(H) the hypothesis in which for every pair v,w∈H with |v|≤|w|, w is removed if ρ(vvv…)=ρ(www…). For a finite subset of natural numbers N⊂ℕ, we define the set of hypotheses \(\mathcal {H}_{N} := \{H \in \mathcal {H}\mid\#\mathrm {Red}(H) \in N\}\).
Theorem 4.7
There exists a procedure that, for all finite subsetsN⊂ℕ and all figures\(K \in \kappa (\mathcal {H}_{N})\), enumerates a finite tell-tale set ofKwith respect to\(\kappa (\mathcal {H}_{N})\).
Proof
First, we assume that N={1}. It is trivial that there exists a procedure that, for an arbitrary figure \(K \in \kappa (\mathcal {H}_{N})\), enumerates a finite tell-tale set of K with respect to \(\kappa (\mathcal {H}_{N})\), since we always have \(L \not\subset K\) for all pairs of figures \(K, L \in \kappa (\mathcal {H}_{N})\).
We construct a tree as follows (the similar technique called d-explorer was used by Jain and Sharma (1997)). Each node has a pair (H,w) as its label, where κ(H)⊂K and w∈Pos(K)∖Pos(κ(H)). The root node is labeled (∅,v) with a finite sequence v∈Pos(K). The tree is constructed iteratively by adding children for each node of the tree, whose depth (the length to the root) is at most maxN−1. Let the label of such a node be (H,w). For every finite sequence w′ with |w′|≤|w|, if there exists a finite sequence w″ satisfying |w″|>|w| and w″∈Pos(K)∖κ(H∪{w′}), add a child labeled (H∪{w′},w″) to the node.
The above tree is bounded in depth maxN and the number of children for any node is always finite, hence the number of nodes of the tree is finite. Let m be the length of the longest w such that (H,w) is the label of a node of the tree. Then, we can easily check that there is no hypothesis H′ such that κ(H′)⊂K, #H′≤maxN, and Pos(κ(H′))⊃Pos_{m}(K). □
Corollary 4.8
For all finite subsets of natural numbersN⊂ℕ, the set of figures\(\kappa(\mathcal {H}_{N})\)isFIGEX-TXT-learnable.
4.2 Consistent learning
In a learning process, it is natural that every hypothesis generated by a learner is consistent with the examples received by it so far. Here we introduce FIGCONS-INF- and FIGCONS-TXT-learning (CONS means CONSistent). These criteria correspond to CONS-learning that was first introduced by Blum and Blum (1975).^{4} This model was also used (but implicitly) in the Model Inference System (MIS) proposed by Shapiro (1981), Shapiro (1983), and studied in the computational learning of formal languages and recursive functions (Jain et al. 1999).
Definition 4.9
(Consistent learning)
A learner MFIGCONS-INF-learns (resp. FIGCONS-TXT-learns) a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\) if MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\) and for all figures \(K \in \mathcal {F}\) and all informants (resp. texts) σ of K, each hypothesis M_{σ}(i) is consistent with E_{i} that is the set of examples received by M until just before it generates the hypothesis M_{σ}(i).
Assume that a learner M achieves FIGEX-INF-learning of \(\kappa (\mathcal {H})\) using Procedure 1. We can easily check that M always generates a hypothesis that is consistent with the received examples.
Corollary 4.10
FIGEX-INF=FIGCONS-INF.
Suppose that \(\mathcal {F}\subset \kappa (\mathcal {H})\) is FIGEX-TXT-learnable. We can construct a learner M in the same way as in the case of EX-learning of languages from texts (Angluin 1980), where M always outputs a hypothesis that is consistent with received examples.
Corollary 4.11
FIGEX-TXT=FIGCONS-TXT.
4.3 Reliable and refutable learning
In this subsection, we consider target figures that might not be represented exactly by any hypothesis since there are infinitely many such figures, and if we have no background knowledge, there is no guarantee of the existence of an exact hypothesis. Thus in practice this approach is more convenient than the explanatory or consistent learning considered in the previous two subsections.
To realize the above case, we use two concepts, reliability and refutability. The aim of the concepts is to introduce targets which cannot be exactly represented by any hypotheses. Reliable learning was introduced by Blum and Blum (1975), Minicozzi (1976) and refutable learning by Mukouchi and Arikawa (1995), Sakurai (1991) in computational learning of languages and recursive functions, and developed by Jain et al. (2001), Merkle and Stephan (2003), Mukouchi and Sato (2003). Here we introduce these concepts into the learning of figures and analyze learnability.
First, we treat reliable learning of figures. Intuitively, reliability requires that an infinite sequence of hypotheses only converges to a correct hypothesis.
Definition 4.12
(Reliable learning)
- 1.
The learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\).
- 2.
For any target figure \(K \in \mathcal {K}^{*}\) and its informants (resp. texts) σ, the infinite sequence of hypotheses M_{σ} does not converge to a wrong hypothesis H such that \(\mbox {$\mathrm {GE}$}(K, \kappa (H)) \not= 0\).
We analyze reliable learning of figures from informants. Intuitively, for any target figure \(K \in \mathcal {F}\), if a learner can judge whether or not the current hypothesis H is consistent with the target, i.e., κ(H)=K or not in finite time, then the set \(\mathcal {F}\) is reliably learnable.
Theorem 4.13
FIGEX-INF=FIGRELEX-INF.
Proof
In contrast, we have an interesting result on reliable learning from texts. We show in the following that FIGEX-TXT≠FIGRELEX-TXT holds and that a set of figures \(\mathcal {F}\) is reliably learnable from positive data only if any figure \(K \in \mathcal {F}\) is a singleton. Remember that \(\mathcal {H}_{N}\) denotes the set of hypotheses \(\{H \in \mathcal {H}\mid\# H \in N\}\) for a subset N⊂ℕ and, for simplicity, we denote \(\mathcal {H}_{\{n\}}\) by \(\mathcal {H}_{n}\) for a natural number n∈ℕ.
Theorem 4.14
The set of figures\(\kappa(\mathcal {H}_{N})\)isFIGRELEX-TXT-learnable if and only ifN={1}.
Proof
Corollary 4.15
FIGRELEX-TXT⊂FIGEX-TXT.
Sakurai (1991) proved that a set of concepts \(\mathcal{C}\) is reliably EX-learnable from texts if and only if \(\mathcal{C}\) contains no infinite concept (p. 182, Theorem 3.1).^{5} However, we have shown that the set \(\kappa (\mathcal {H}_{1})\) is FIGRELEX-TXT-learnable, though all figures \(K \in \kappa (\mathcal {H}_{1})\) correspond to infinite concepts since Pos(K) is infinite for all \(K \in \kappa (\mathcal {H}_{1})\). The monotonicity of the set Pos(K) (Lemma 3.5), which is a constraint naturally derived from the geometric property of examples, causes this difference.
Next, we extend FIGEX-INF- and FIGEX-TXT-learning by paying our attention to refutability. In refutable learning, a learner tries to learn figures in the limit, but it understands that it cannot find a correct hypothesis in finite time, that is, outputs the refutation symbol △ and stops if the target figure is not in the considered space.
Definition 4.16
(Refutable learning)
- 1.
The learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\).
- 2.
If \(K \in \mathcal {F}\), then for all informants (resp. texts) σ of K, M_{σ}(i)≠△ for all i∈ℕ.
- 3.
If \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\), then for all informants (resp. texts) σ of K, there exists m∈ℕ such that M_{σ}(i)≠△ for all i<m, and M_{σ}(i)=△ for all i≥m.
Conditions 2 and 3 in the above definition mean that a learner M refutes the set \(\mathcal {F}\) in finite time if and only if a target figure \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\). We compare FIGREFEX-INF-learnability with other learning criteria.
Theorem 4.17
\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\)and\(\mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\).
Proof
First we consider \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). We show an example of a set of figures \(\mathcal {F}\) with \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) and \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) in the case of d=2. Let K_{0}=κ({〈0,0〉,〈1,1〉}), K_{i}=κ({〈w,w〉∣w∈Σ^{i}∖{1}^{i}}) for every i≥1, and \(\mathcal {F}= \{K_{i} \mid i \in \mathbb {N}\}\). Note that K_{0} is the line y=x and K_{i}⊂K_{0} for all i≥1.
We prove that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). It is trivial that \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\), thereby assume that a target figure \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\). If a target figure K⊃K_{0}, it is trivial that, for any informant σ of K, the set of examples \(\operatorname {range}(\sigma [n])\) for some n∈ℕ is not consistent with any \(K_{i} \in \mathcal {F}\) (consider a positive example for a point x∈K∖K_{0}). Otherwise if K⊂K_{0}, there should exist a negative example 〈v,v〉∈Neg(K). Then we have \(K \not= K_{i}\) for all i>|v|. Thus a learner can refute candidates {K_{1},K_{2},…,K_{|v|}} in finite time. Therefore \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) holds.
Next we show that \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). Let K_{0} be the target figure. For any finite set of positive examples \(\mathcal {T}\subset \mathrm {Pos}(K_{0})\), there exists a figure \(K_{i} \in \mathcal {F}\) such that K_{i}⊂K_{0} and \(\mathcal {T}\) is consistent with K_{i}. Therefore it has no finite tell-tale set with respect to \(\mathcal {F}\) and hence \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) from Theorem 4.4.
Second we check \(\mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\). Assume that \(\mathcal {F}= \kappa (\mathcal {H}_{\{1\}})\) and a target figure K is a singleton {x} with \(K \notin \mathcal {F}\). It is clear that, for any informant σ of K and n∈ℕ, \(\operatorname {range}(\sigma [n])\) is consistent with some figure \(L \in \mathcal {F}\). Thus \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) whereas \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\). □
Corollary 4.18
\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\)and\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\).
Note that it is trivial that \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) since we have \(\kappa (\mathcal {H}_{\{1\}}) \notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) in the above proof and \(\kappa (\mathcal {H}_{\{1\}}) \in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) from Theorem 4.14. Moreover, the condition \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EL}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) holds since \(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\not\subseteq \mbox {\textup {\textbf {F{\scriptsize IG}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) and FIGRELEX-TXT⊂FIGEX-TXT. These results mean that both FIGREFEX-INF- and FIGRELEX-TXT-learning are difficult, but they are incomparable in terms of learnability. Furthermore, we have the following hierarchy.
Theorem 4.19
\(\mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\not= \emptyset\)andFIGREFEX-TXT⊂FIGREFEX-INF.
Proof
Let a set of figures \(\mathcal {F}\) be a singleton {K} such that K=κ(w) for some w∈(Σ^{d})^{∗}. Then there exists a learner M that FIGREFEX-TXT-learns \(\mathcal {F}\), i.e., \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\), since all M has to do is to check whether or not, for a given positive example (v,1), v⊑u for some u∈Pos(K)={x∣x⊑www…}.
Next, let \(\mathcal {F}= \{K\}\) such that K=κ(H) with #Red(H)≥2. We can easily check that \(\mathcal {F}\notin \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {T{\scriptsize XT}}}}\) because if a target figure L is a proper subset of K, no learner can refute \(\mathcal {F}\) in finite time. Conversely, \(\mathcal {F}\in \mbox {\textup {\textbf {F{\scriptsize IG}R{\scriptsize EF}E{\scriptsize X}}-\textbf {I{\scriptsize NF}}}}\) since for all L with L≠K, there exists an example with which the hypothesis H is not consistent. □
Corollary 4.20
FIGREFEX-TXT⊂FIGRELEX-TXT.
5 Effective learning of figures
In learning under the proposed criteria, i.e. explanatory, consistent, reliable, and refutable learning, each hypothesis is just considered as exactly “correct” or not, that is, for a target figure K and for a hypothesis H, H is correct if \(\mbox {$\mathrm {GE}$}(K, H) = 0\) and is not correct if \(\mbox {$\mathrm {GE}$}(K, H) \neq0\). Thus we cannot know the rate of convergence to the target figure and how far it is from the recent hypothesis to the target. It is therefore more useful if we consider approximate hypotheses by taking various generalization errors into account in the learning process.
We define novel learning criteria, FIGEFEX-INF- and FIGEFEX-TXT-learning (EF means EFfective), to introduce into learning the concept of effectivity, which has been analyzed in computation of real numbers in the area of computable analysis (Weihrauch 2000). Intuitively, these criteria guarantee that for any target figure, a generalization error becomes smaller and smaller monotonically and converges to zero. Thus we can know when the learner learns the target figure “well enough”. Furthermore, if a target figure is learnable in the limit, then the generalization error goes to zero in finite time.
Definition 5.1
(Effective learning)
- 1.
The learner MFIGEX-INF-learns (resp. FIGEX-TXT-learns) \(\mathcal {F}\).
- 2.For an arbitrary target figure \(K \in \mathcal {K}^{*}\) and all informants (resp. texts) σ of K, for all i∈ℕ,$$ \mbox {$\mathrm {GE}$}\bigl(K, \text {\textbf {M}}_{\sigma }(i) \bigr) \le2^{-i}. $$
This definition is inspired by the Cauchy representation of real numbers (Weihrauch 2000, Definition 4.1.5).
Effective learning is related to monotonic learning (Lange and Zeugmann 1993, 1994; Kinber 1994; Zeugmann et al. 1995) originally introduced by Jantke (1991), Wiehagen (1991), since both learning models consider monotonic convergence of hypotheses. In contrast to their approach, where various monotonicity over languages was considered, we geometrically measure the generalization error of a hypothesis by the Hausdorff metric. On the other hand, the effective learning is different from BC-learning developed in the learning of languages and recursive functions (Jain et al. 1999) since BC-learning only guarantees that generalization errors go to zero in finite time. This means that BC-learning is not effective.
Lemma 5.2
Proof
Theorem 5.3
The set of figures\(\kappa (\mathcal {H})\)isFIGEFEX-INF-learnable.
Proof
Assume that \(K \in \kappa (\mathcal {H})\). If M outputs a wrong hypothesis, there must be a positive or negative example that is not consistent with the hypothesis, and it changes the wrong hypothesis. If it produces a correct hypothesis, then it never changes the correct hypothesis, since every example is consistent with the hypothesis. Thus there exists n∈ℕ with \(\mbox {$\mathrm {GE}$}(K, \text {\textbf {M}}_{\sigma }(i)) = 0\) for all i≥n. Therefore MFIGEFEX-INF-learns \(\kappa (\mathcal {H})\). □
Corollary 5.4
FIGEFEX-INF=FIGRELEX-INF=FIGEX-INF.
Thus the learner with Procedure 2 can treat the set of all figures \(\mathcal {K}^{*}\) as learning targets, since for any figure \(K \in \mathcal {K}^{*}\), it can approximate the figure arbitrarily closely using only the figures represented by hypotheses in the hypothesis space \(\mathcal {H}\).
In contrast to FIGEX-TXT-learning, there is no set of figures that is FIGEFEX-TXT-learnable.
Theorem 5.5
FIGEFEX-TXT=∅.
Proof
We show a counterexample of a target figure which no learner M can approximate effectively. Assume that d=2 and a learner MFIGEFEX-TXT-learns a set of figures \(\mathcal {F}\subseteq \mathcal {K}^{*}\). Let us consider two target figures K={(0,0),(1,1)} and L={(0,0)}. For a text σ of L, for all examples \((w, 1) \in \operatorname {range}(\sigma )\), w∈{00}^{∗}. Since MFIGEFEX-TXT-learns \(\mathcal {F}\), it should output the hypothesis H as M_{σ}(2) such that \(\mbox {$\mathrm {GE}$}(L, H) < 1/4\). Suppose that M receives n examples before outputting the hypothesis H. Then there exists a presentation τ of the figure K such that τ[n−1]=σ[n−1], and M outputs the hypothesis H with receiving τ[n−1]. However, \(\mbox {$\mathrm {GE}$}(K, H) \ge\sqrt{2} - 1/4\) holds from the triangle inequality, contradicting our assumption that MFIGEFEX-TXT-learns \(\mathcal {F}\). This proof can be applied for any \(\mathcal {F}\subseteq \mathcal {K}^{*}\), thereby we have FIGEFEX-TXT=∅. □
6 Evaluation of learning using dimensions
Here we show a novel mathematical connection between fractal geometry and Gold’s learning under the proposed learning model described in Sect. 3. More precisely, we bound the number of positive examples, one of the complexities of learning, using the Hausdorff dimension and the VC dimension. The Hausdorff dimension is known as the central concept of fractal geometry, which measures the density of figures, and VC dimension is the central concept of Valiant’s model (PAC learning model) (Kearns and Vazirani 1994), which measures the complexity of classes of hypotheses.
6.1 Preliminaries for dimensions
First we introduce the Hausdorff dimension and related dimensions: the box-counting dimension, the similarity dimension, and also introduce the VC dimension.
6.2 Measuring the complexity of learning with dimensions
We show that the Hausdorff dimension of a target figure gives a lower bound to the number of positive examples. Remember that Pos_{k}(K)={w∈Pos(K)∣|w|=k} and the diameter \(\operatorname {diam}(k)\) of the set ρ(w) with |w|=k is \(\sqrt{d}2^{-k}\). Moreover, the size #{w∈(Σ^{d})^{∗}∣|w|=k}=2^{kd} for all k∈ℕ.
Theorem 6.1
Proof
Moreover, if a target figure K can be represented by some hypothesis, that is, \(K \in \kappa (\mathcal {H})\), we can use the exact dimension dim_{H} K as a bound for the number of positive examples #Pos_{k}(K).
Theorem 6.2
Proof
Example 6.3
Lemma 6.4
At each levelk, we have\(\mathrm {dim}_{\mathrm {VC}}\,{\mathcal {H}^{k}} = 2^{kd}\).
Proof
Therefore we can rewrite Theorems 6.1 and 6.2 as follows.
Theorem 6.5
These results demonstrate a relationship among the complexities of learning figures (numbers of positive examples), classes of hypotheses (VC dimension), and target figures (Hausdorff dimension).
6.3 Learning the box-counting dimension through effective learning
One may think that FIGEFEX-INF-learning can be achieved without the proposed hypothesis space. For instance, if a learner just outputs figures represented by a set of received positive examples, the generalization error becomes smaller and smaller. Here we show that one “quality” of a target figure, the box-counting dimension, is also learned in FIGEFEX-INF-learning, whereas if a learner outputs figures represented by a set of received positive examples, the box-counting dimension (and also the Hausdorff dimension) of any figure represented by a hypothesis is always d.
Recall that for all hypotheses \(H \in \mathcal {H}\), dim_{H} κ(H)=dim_{B} κ(H)=dim_{S} κ(H), since the set of contractions encoded by the hypothesis H meets the open set condition.
Theorem 6.6
Proof
7 Computational interpretation of learning
Recently, the concept of “computability” for continuous objects has been introduced in the framework of Type-2 Theory of Effectivity (TTE) (Schröder 2002b; Weihrauch 2000, 2008; Weihrauch and Grubba 2009; Tavana and Weihrauch 2011), where we treat an uncountable set X as objects for computing through infinite sequences over a given alphabet Σ. Using the framework, we analyze our learning model from the computational point of view. Some studies by de Brecht and Yamamoto (2009), de Brecht (2010) have already demonstrated a close connection between TTE and Gold’s model, and our analysis becomes an instance and extension of their analysis.
7.1 Preliminaries for Type-2 theory of effectivity
We prepare mathematical notations for TTE. In the following in this section, we assume Σ={0,1,[,],∥,♢}. A partial (resp. total) function g from a set A to a set B is denoted by g:⊆A→B (resp. g:A→B). A representation of a set X is a surjection ξ:⊆C→X, where C is Σ^{∗} or Σ^{ω}. We see \(p \in \operatorname {dom}(\xi)\) as a name of the encoded element ξ(p).
Computability of string functions f:⊆X→Y, where X and Y are Σ^{∗} or Σ^{ω}, is defined via a Type-2 machine, which is a usual Turing machine with one-way input tapes, some work tapes, and a one-way output tape (Weihrauch 2000). The function f_{M}:⊆X→Y computed by a Type-2 machine M is defined as follows: When Y is Σ^{∗}, f_{M}(p):=q if M with input p halts with q on the output tape, and when Y is Σ^{ω}, f_{M}(p):=q if M with input p writes step by step q onto the output tape. We say that a function f:⊆C→D is computable if there is a Type-2 machine that computes f, and a finite or infinite sequence p is computable if the constant function f which outputs p is computable. A Type-2 machine never changes symbols that have already been written onto the output tape, thus each prefix of the output depends only on a prefix of the input.
By treating a Type-2 machine as a translator between names of some objects, a hierarchy of representations is introduced. A representation ξ is reducible to ζ, denoted by ξ≤ζ, if there exists a computable function f such that ξ(p)=ζ(f(p)) for all \(p \in \operatorname {dom}(\xi)\). Two representations ξ and ζ are equivalent, denoted by ξ≡ζ, if both ξ≤ζ and ζ≤ξ hold. As usual, ξ<ζ means ξ≤ζ and not ζ≤ξ.
Computability for functions is defined through representations and computability of string functions.
Definition 7.1
Thus the abstract function f is “realized” by the concrete function (Type-2 machine) g through the two representations ξ and ζ.
Definition 7.2
(Standard representation of figures)
This representation κ_{H} is known to be an admissible representation of the space \((\mathcal {K}^{*}, d_{\mathrm {H}})\), which is the key concept in TTE (Schröder 2002b; Weihrauch 2000), and is also known as the \(\boldsymbol {\varSigma }_{1}^{0}\)-admissible representation proposed by de Brecht and Yamamoto (2009).
7.2 Computability and learnability of figures
First, we show computability of figures in \(\kappa (\mathcal {H})\).
Theorem 7.3
For every figure\(K \in \kappa (\mathcal {H})\), Kisκ_{H}-computable.
Proof
Thus a hypothesis H can be viewed as a “program” of a Type-2 machine that produces a κ_{H}-representation of the figure κ(H).
Both informants and texts are also representations (in the sense of TTE) of compact sets. Define the mapping η_{INF} by η_{INF}(σ):=K for every \(K \in \mathcal {K}^{*}\) and informant σ of K, and the mapping η_{TXT} by η_{TXT}(σ):=K for every \(K \in \mathcal {K}^{*}\) and text σ of K. Trivially η_{INF}<η_{TXT} holds, that is, some Type-2 machine can translate η_{INF} to η_{TXT}, but no machine can translate η_{TXT} to η_{INF}. Moreover, we have the following hierarchy of representations.
Lemma 7.4
η_{INF}<κ_{H}, \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\), and\(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\).
Proof
Second, we prove \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\). Assume that the opposite, η_{TXT}≤κ_{H} holds. Then there exists a computable function f such that η_{TXT}(σ)=κ_{H}(f(σ)) for every figure \(K \in \mathcal {K}^{*}\). Fix a figure K and its text \(\sigma \in \operatorname {dom}(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}})\). This means that for any small ε∈ℝ, f can pick up finite sequences w_{1},w_{2},…,w_{n} from Pos(K) such that \(d_{\mathrm {H}}(K, \nu_{\mathcal {Q}}(\iota(w_{1}, w_{2}, \dots, w_{n}))) \le \varepsilon \). However, if such f exists, we can easily check that {K}∈FIGEFEX-TXT, contradicting to our result (Theorem 5.5). It follows that \(\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}} \not\le\kappa_{\mathrm {H}}\).
Third, we prove \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}\) and \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\). There is a figure K such that K∩ρ(w)={x} for some w∈Σ^{∗}, i.e., K and ρ(w) intersect in only one point x. Such a w must be in σ as a positive example, that is, w∈Pos(K). However, a representation of K can be constructed without w. There exists an infinite sequence p∈κ_{H} with p=w_{0}♢w_{1}♢… such that \(x \notin\nu_{\mathcal {Q}}(w_{k})\) for all k∈ℕ. Thus, if there exists a computable f which outputs an example (w,1) from such a sequence after only seeing w_{0}♢w_{1}♢…♢w_{n}, one can extend the sequence in such a way for some figure L with w∉Pos(L), in contradiction to the reduction. Therefore there is no computable function that outputs an example (w,1) from p, meaning that \(\kappa_{\mathrm {H}} \not\le\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}\) and \(\kappa_{\mathrm {H}} \not \le\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}\). □
Theorem 7.5
A set\(\mathcal {F}\subseteq \mathcal {K}^{*}\)isFIGEX-INF-learnable (resp. FIGEX-TXT-learnable) if and only if the identity\(\mathrm {id}_{\mathcal {F}}\)is\((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-computable (resp. \((\eta_{\mathrm {T}{\scriptsize \mathrm {XT}}}, \kappa \circ\lim_{\mathcal {H}})\)-computable).
Proof
We only prove the case of FIGEX-INF-learning, since we can prove the case of FIGEX-TXT-learning in exactly the same way.
The “if” part: For some M, the above equation (9) holds for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). This means that M is a learner that FIGEX-INF-learns \(\mathcal {F}\). □
Here we consider two more learning criteria, FIGFIN-INF- and FIGFIN-TXT-learning, where the learner generates only one correct hypothesis and halts. This learning corresponds to finite learning or one shot learning introduced by Gold (1967), Trakhtenbrot and Barzdin (1970) and it is a special case of learning with a bound of mind change complexity, the number of changes of hypothesis, introduced by Freivalds and Smith (1993) and used to measure the complexity of learning classes (Jain et al. 1999). We obtain the following theorem.
Theorem 7.6
A set\(\mathcal {F}\subseteq \mathcal {K}^{*}\)isFIGFIN-INF-learnable (resp. FIGFIN-TXT-learnable) if and only if the identity\(\mathrm {id}_{\mathcal {F}}\)is (η_{INF},κ)-computable (resp. (η_{TXT},κ)-computable).
Proof
We only prove the case of FIGFIN-INF-learning, since we can prove the case of FIGFIN-TXT-learning in exactly the same way.
The “if” part: For some M, the above equation (10) holds for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). This means that M is a learner that FIGFIN-INF-learns \(\mathcal {F}\). □
Finally, we show a connection between effective learning of figures and the computability of figures. Since FIGEFEX-TXT=∅ (Theorem 5.5), we only treat effective learning from informants. We define the representation \(\gamma:\subseteq \mathcal {H}^{\omega} \to \mathcal {K}^{*}\) by γ(p):=K if p=H_{0},H_{1},… such that \(H_{i} \in \mathcal {H}\) and d_{H}(K,κ(H_{i}))≤2^{−i} for all i∈ℕ.
Lemma 7.7
γ≡κ_{H}.
Proof
By using this lemma, we interpret effective learning of figures as the computability of two identities (Fig. 5).
Theorem 7.8
A set\(\mathcal {F}\subseteq \mathcal {K}^{*}\)isFIGEFEX-INF-learnable if and only if there exists a computable functionfsuch thatfis a\((\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}}, \kappa \circ\lim_{\mathcal {H}})\)-realization of the identity\(\mathrm {id}_{\mathcal {F}}\), andfis also a (η_{INF},γ)-realization of the identity\(\mathrm {id}: \mathcal {K}^{*} \to \mathcal {K}^{*}\).
Proof
We prove the latter half of the theorem, since the former part can be proved exactly as for Theorem 7.5.
The “if” part: For some M, id∘η_{INF}(σ)=γ(M_{σ}) for all \(\sigma \in \operatorname {dom}(\eta_{\mathrm {I}{\scriptsize \mathrm {NF}}})\). It follows that M is a learner that FIGEFEX-INF-learns \(\mathcal {F}\). □
Thus in FIGEFEX-INF- and FIGEFEX-TXT-learning of a set of figures \(\mathcal {F}\), a learner M outputs a hypothesis H with κ(H)=K in finite time if \(K \in \mathcal {F}\), and M outputs the “standard” representation of K if \(K \in \mathcal {K}^{*} \setminus \mathcal {F}\) since we prove that γ≡κ_{H} in Lemma 7.7. Informally, this means that there is not too much loss of information of figures even if they are not explanatorily learnable.
8 Conclusion
We have proposed the learning of figures using self-similar sets based on Gold’s learning model towards a new theoretical framework of binary classification focusing on computability, and demonstrated a learnability hierarchy under various learning criteria (Fig. 3). The key to the computable approach is the amalgamation of discretization of data and the learning process. We showed a novel mathematical connection between fractal geometry and Gold’s model by measuring the lower bound of the size of training data with the Hausdorff dimension and the VC dimension. Furthermore, we analyzed our learning model using TTE (Type-2 Theory of Effectivity) and presented several mathematical connections between computability and learnability.
Many recent methods in machine learning are based on a statistical approach (Bishop 2007). The reason is that many data in the real world are in analog (real-valued) form, and the statistical approach can treat such analog data directly in theory. However, all learning methods are performed on computers. This means that all machine learning algorithms actually treat discretized digital data and, now, most research pays no attention to the gap between analog and digital data. In this paper we have proposed a novel and completely computable learning method for analog data, and have analyzed the method precisely. This work provides a theoretical foundation for computable learning from analog data, such as classification, regression, and clustering.
Müller (2001) and Schröder (2002a) give some interesting examples in the study of computation for real numbers.
Sugiyama et al. (2006, 2009) have also contributed to the area, but their work was only presented at closed workshops.
The reason for this notation is that σ can be viewed as a mapping from ℕ (including 0) to the set of examples.
Acknowledgements
The authors sincerely thank to the editor and anonymous reviewers for their lots of useful comments and suggestions which have led to invaluable improvements of this paper. This work was partly supported by Grant-in-Aid for Scientific Research (A) 22240010 and for JSPS Fellows 22⋅5714.