Rethinking statistical learning theory: learning using statistical invariants
 11k Downloads
 2 Citations
Abstract
This paper introduces a new learning paradigm, called Learning Using Statistical Invariants (LUSI), which is different from the classical one. In a classical paradigm, the learning machine constructs a classification rule that minimizes the probability of expected error; it is datadriven model of learning. In the LUSI paradigm, in order to construct the desired classification function, a learning machine computes statistical invariants that are specific for the problem, and then minimizes the expected error in a way that preserves these invariants; it is thus both data and invariantdriven learning. From a mathematical point of view, methods of the classical paradigm employ mechanisms of strong convergence of approximations to the desired function, whereas methods of the new paradigm employ both strong and weak convergence mechanisms. This can significantly increase the rate of convergence.
Keywords
Intelligent teacher Privileged information Support vector machine Neural network Classification Learning theory Regression Conditional probability Kernel function IllPosed problem Reproducing Kernel Hilbert space Weak convergenceMathematics Subject Classification
68Q32 68T05 68T30 83C321 Introduction
It is known that Teacher–Student interactions play an important role in human learning. An old Japanese proverb says “Better than thousand days of diligent study is one day with a great teacher.” What is it exactly that great Teachers do? This question remains unanswered.
At first glance, it seems that the information that a Student obtains from his interaction with a Teacher does not add too much to the standard textbook knowledge. Nevertheless, it can significantly accelerate the learning process.
This paper is devoted to mechanisms of machine learning, which include elements of Teacher–Student (or Intelligent Agent—Learning Machine) interactions. The paper demonstrates that remarks of a Teacher, which can sometimes seem as trivial (e.g., in the context of digit recognition, “in digit ‘zero’, the center of the image is empty” or “in digit ‘two’, there is a tail in the lower right part of the image”), can actually add a lot of information that turns out to be essential even for a large training data set.
The mechanism of Teacher–Student interaction presented in this paper is not based on some heuristic. Instead, it is based on rigorous mathematical analysis of the machine learning problem.
In 1960, Eugene Wigner published the paper “Unreasonable Effectiveness of Mathematics in Natural Sciences” Wigner (1960), in which he argued that mathematical structures “know” something about physical reality. Our paper might as well have the subtitle “Unreasonable Effectiveness of Mathematics in Machine Learning” since the idea of the proposed new mechanism originated in rigorous mathematical treatment of the problem, which only afterwards was interpreted as an interaction between an Intelligent Teacher and a Smart Student.^{1}
While analyzing the setting of the learning problem, we take into account some details that were, for simplicity, previously omitted in the classical approach. Here we consider the machine learning problem as a problem of estimating the conditional probability function rather than the problem of finding the function that minimizes a given loss functional. Using mathematical properties of conditional probability functions, we were able to make several steps towards the reinforcement of existing learning methods.
1.1 Content and organization of paper
Our reasoning consists of the following steps:
1. We define the pattern recognition problem as the problem of estimation of conditional probabilities \(P(y=kx),\; k=1,\ldots ,n\) (probability of class \(y=k\) given observation x): example \(x_*\) is classified as \(y=s\) if \(P(y=sx_*)\) is maximum. In order to estimate \(\max _{k}P(y=kx),\; k=1,\ldots ,n\), we consider n twoclass classification problems of finding \(P_k(y^*=1x), k=1,\ldots ,n\), where \(y^*=1\) if \(y=k\) and \(y^*=0\) otherwise.
The advantage of definition (1) is that it is based on the cumulative distribution function, the fundamental concept of probability theory. This definition does not directly require (as in the classical case) the existence of density functions and their ratio.
4. The estimation of conditional probability function by solving Fredholm integral equation \(Af(x')=F(x)\) is an illposed problem. For solving illposed problems, Tikhonov and Arsenin (1977) proposed the regularization method, which, under some general conditions, guarantees convergence (in the given metric) of the solutions to the desired function.
In our setting, we face a more difficult problem: we have to solve an illposed equation (1) where both the operator and the righthand side of the equation are defined approximately \(A_\ell f(x')\approx F_\ell (x)\). In 1978, Vapnik and Stefanyuk (1978) proved that, under some conditions on the operator A of the equation, the regularization method also converges to the desired solution in this case. Section 4.3 outlines the corresponding results.
 1.
the distance \(\rho (A_\ell f(x'),F_\ell (x))\) between the function \(A_\ell f(x')\) on the lefthand side and the function \(F_\ell (x)\) on the righthand side of the equation;
 2.
the set of functions \(\{f(x')\}\) in which one is looking for the solution of the equation;
 3.
the regularization functional \(W(f(x'))\) and regularization parameter \(\gamma >0\).
 1.
we use \(L_2\)metric for distance \(\rho ^2(A_\ell f(x'),F_\ell (x))=\int (Af_\ell (x')F_\ell (x))^2d\mu (x)\);
 2.
we solve the equations in the sets of functions \(f(x')\) that belong to Reproducing Kernel Hilbert Space (RKHS) of the kernel \(K(x,x')\) (see Sect. 5.2).
 3.
we use the square of a function’s norm \(f(x')^2\) as the regularization functional.

The Vmatrix in Eq. (3) has the following interpretation: when we are looking for the desired function, we take into account not only the residuals \(\varDelta _i=y_if(x_i),\;i=1,\ldots ,\ell \) at the observation points \(x_i\), but also the mutual positions V(i, j) of the observation points \(x_i\) and \(x_j\). The classical solution of the problem (i.e., the least squares method (4)) uses only information about the residuals \(\varDelta _i\).

\(\ell \)dimensional vector \(Y=(y_1,\ldots ,y_\ell )^T\),

\(\ell \)dimensional vector \(A=(a_1,\ldots ,a_\ell )^T\) of parameters \(a_i\),

\(\ell \)dimensional vectorfunction \(\mathcal{K}(x)=(K(x,x_1),\ldots ,K(x,x_\ell ))^T\),

\((\ell \times \ell )\)dimensional matrix V of elements \(V(x_i,x_j),~i,j=1,\ldots ,\ell \),

\((\ell \times \ell )\)dimensional matrix K of elements \(K(x_i,x_j),~i,j=1,\ldots ,\ell \).
9. Starting from Sect. 6, we consider the problem of the Teacher–Student interaction. We introduce the following mathematical model of interaction which we call Learning Using Statistical Invariants (LUSI).
The idea of LUSI model is to minimize the loss functional (8) in the subset of functions (7) satisfying (12). In order to find an accurate approximation of the desired function, we employ mechanisms of convergence that are based on both strong and weak modes. (The classical method employs only the strong mode mechanism).
The important role in LUSI mechanisms belongs to Teacher. According to the definition, the weak mode convergence requires convergence for all functions \(\psi (x)\in L_2\). Instead, the Teacher replaces the infinite set of functions \(\psi (x)\in L_2\) with a finite set \({{{\mathcal {F}}}}=\{\psi _s(x),~s=1,\ldots ,m\}\).
Let the Teacher define functions \(\psi _s(x),\;s=1,\ldots ,m\) in (11), which we call predicates. Suppose that the values \(C_{\psi _s}\), which we call expressions of predicate, are known. This fact has the following interpretation: equations (11) describe d (integral) properties of the desired conditional probability function. Our goal is thus to find the approximation that has these properties.
The idea is to identify the subset of functions \(P(y=1x)\) for which expressions of predicates \(\psi _s(x)\) are equal to \(C_{\psi _s}\), and then to select the desired approximation from this subset.
 (1)
the estimate \(A_V\), which is obtained from the standard learning scenario by the vSVM method (datadriven part of estimate), and
 (2)
the term shown in parentheses, which is the correction term based on invariants (13) (intelligencedriven part of estimate). We call the sum A the vSVM withminvariants estimate, denoted as vSVM & \(\hbox {I}_m\).
There is an important difference between invariants and features in classical learning models. With increasing number of invariants, the capacity of the set of functions from which Student has to choose the desired one decreases (and as a result, according to VC bounds, this leads to a more accurate estimate). In contrast to that, with increasing number of features, the capacity of the set of admissible functions increases (and thus, according to VC bounds, this requires more training examples for an accurate estimate^{5}).
12. Section 6.4 contains examples that illustrate the effectiveness of the ideas of LUSI and remarks on implementation of SVM& \(\hbox {I}_d\) algorithms.
1.2 Phenomenological model of learning
 In some environment X, there exists a generator G of random events \(x_i\in X\). This generator G generates events x randomly and independently, according to some unknown probability measure P(x). In this environment, a classifier A operates, which labels the events x; in other words, on any event \(x_i\) produced by generator G, classifier A reacts with a binary signal \(y_i\in \{0,1\}\). The classification \(y_i\) of events \(x_i\) is produced by classifier A according to some unknown conditional probability function \(P(y=1x)\). The problem of learning is formulated as follows: given \(\ell \) pairscontaining events \(x_i\) produced by generator G and classifications \(y_i\) produced by classifier A, find, in a given set of indicator functions, the one that minimizes probability of discrepancy between classifications of this function and classifier A.$$\begin{aligned} (x_1,y_1),\ldots ,(x_\ell ,y_\ell ) \end{aligned}$$(14)
1.3 Risk minimization framework
The first attempt to construct the general learning theory was undertaken in the late 1960s—beginning of 1970s. At that time, the learning was mathematically formulated as the problem of expected risk minimization (Vapnik and Chervonenkis 1974; Wapnik and Tscherwonenkis 1979).
1.3.1 Expected risk minimization problem
1.3.2 Empirical loss minimization solution
1.3.3 Problems of theory
 1.
When is the method of empirical loss minimization consistent?
 2.
How close is the value of expected loss to the minimum across the given set of functions?
 3.
Is it possible to formulate a general principle that is better than the empirical loss minimization?
 4.
How to construct algorithms for estimation of the desired function?
1.3.4 Main results of VC theory
 1.
The VC theory defines the necessary and sufficient conditions of consistency for both (i) the case when the probability measure P(x) of the generator G in the phenomenological model is unknown (in this case, the necessary and sufficient conditions are valid for any probability measure P(x)), and (ii) for the case when the probability measure P(x) is known (in this case, the necessary and sufficient conditions are valid for a given probability measure P(x)). In both of these cases, the conditions for consistency are described in terms of the capacity of the set of functions. In the first case (consistency for any probability measure), the VC dimension of the set of indicator functions is defined (it has to be finite). In the second case (consistency for the given probability measure), the VC entropy of the set of function for the given probability measure is defined (VC entropy over number of observations has to converge to zero).
 2.When VC dimension of a set of indicator functions is finite, the VC theory provides bounds on the difference between the real risk that exists for the function \(f_\ell \) that minimizes empirical risk functional (15) (the value \(R(f_\ell )\)) and its empirical estimate (17) (i.e., the value \(R_\ell (f_\ell )\)). That difference depends on the ratio of VC dimension h to number of observations \(\ell \). Specifically, with probability \(1\eta \), the boundholds, which implies the bound$$\begin{aligned} R(f_\ell )R_\ell (f_\ell )\le T\left( \frac{h\ln \eta }{\ell }\right) \end{aligned}$$(21)In (21) and (22), \(T(h/\ell )\) and \(T_*(h/\ell )\) are known monotonically decreasing functions.$$\begin{aligned} R(f_\ell )\inf _{f} R(f)\le T_*\left( \frac{h\ln \eta }{\ell }\right) . \end{aligned}$$(22)
 3.VC theory introduces a generalization of the empirical risk minimization method, the socalled method of Structural Risk Minimization (SRM), which is the basic instrument in statistical inference methods. In the SRM method, a nested set of functionsis constructed on the set of given function \(\{f\}\); here \(S_k\) is the subset of all functions \(\{f\}\) that have VC dimension \(h^{(k)}\). For subset \(S_k\), the bound (21) has the form$$\begin{aligned} S_1\subset S_2\subset \ldots \subset S_n\subset \ldots \end{aligned}$$(23)In order to find the function that provides the smallest guaranteed risk for the given observations, one has to find both the subset \(S_k\) of the structure (23) and the function f in this subset that provides the smallest (over k and \(f\in S_k\)) value of the righthand side of inequality (24). Structural Risk Minimization (SRM) method is a strongly universally consistent risk minimization method. One can consider the SRM method as a mathematical realization of the following idea: Chose the simplest^{6}function (one from the subset with the smallest VC dimension) that classifies training data well.$$\begin{aligned} R(f_\ell ^{(S_k)})\le R_\ell (f^{(S_k)}_\ell )+ T\left( \frac{h^{(k)}\ln \eta }{\ell }\right) . \end{aligned}$$(24)
 4.
Based on VC theory, an algorithm for solving pattern recognition problems was developed—it was the Support Vector Machine (SVM) algorithm which realized the SRM method. In this paper, we consider an SVM type of algorithm for square loss function (16) as the baseline algorithm for comparisons. We introduce two new ideas which form the basis for a new approach. One idea (presented in Part One of paper) is technical and another one (presented in Part Two) is conceptual.
2 Part One: The Vmatrix estimate
In this first part, we describe our technical innovations that we use in the second part for constructing algorithms of inference.
3 The observation that defines the Vmatrix method
Our first idea is related to the mathematical understanding of a learning model that is different from the phenomenological scheme described in Sect. 1.2.
In the classical approach to the learning methods (analyzed by the VC theory), we ignored the fact that the desired decision rule is related to the conditional probability function used by Classifier A. We just introduced some set of indicator functions and defined the goal of learning as finding the one that guarantees the smallest expected risk of loss (15) in this set of functions.
We consider the problem of estimating the conditional probability function \(P(y=sx)\) using data (14) as the main problem of learning. The construction of the classification rule (25) based on the obtained conditional probability function is then a trivial corollary of this solution.
3.1 Standard definitions of conditional probability and regression functions
In this paper, we estimate the conditional probability function. We assume that \(x\in {{{\mathcal {R}}}}^d\), where \(\mathcal{R}^d=[a_1,c_1]\times \cdot \cdot \cdot \times [a_d,c_d]\). To simplify the notations, we assume, without loss of generality, that \(a_1=\ldots =a_s=\ldots =a_d=0\), so that \({{{\mathcal {R}}}}^d=\{x\in [0,\mathbf{c}]^d\}\), where \(\mathbf {c}=(c_1,\ldots ,c_d)^T\).
3.2 Direct definitions of conditional probability function
In order to estimate the conditional probability function or the regression function, we need to find the solution of Eqs. (30) or (31), Note that cumulative distribution functions that define these equations are unknown but corresponding iid data \((x_1,y_1),\ldots ,(x_\ell ,y_\ell )\) are given.
3.3 Estimation of conditional probability function for classification rule
 1.
The conditional probability function directly controls many different statistical invariants that manifest themselves in the training data. Preserving these invariants in the rule is equivalent to incorporating some prior knowledge about the solution. Technically, this allows the learning machine to extract additional information from the data—the information that cannot be extracted directly by existing classical methods. Section 6 is devoted to the realization of this idea.
 2.
Since the goal of pattern recognition is to estimate the classification rule (32) (not the conditional probability function), one can relax the requirements of accuracy of the conditional probability estimation. We really need to have an accurate estimate of the function in the area \(x\in X\) where values \(P(y=1x)\) are close to 0.5; conversely, we can afford to have less accurate estimates in the area where \(P(y=1x)0.5\) is large. This means that the cost of error of deviation of the estimate \(P_\ell (y=1x)\) from the actual function \(P(y=1x)\) can be monotonically connected to the variance \(\phi (x)=P(y=1x)(1P(y=1x))\) (the larger is the variance, the bigger is the cost of error). This fact can be taken into account when one estimates conditional probability functions.
3.4 Problem of inference from data
Note that the solution of operator equations in a given set of functions is, generally speaking, an illposed problem. The problem of statistical inference is to find the solution defined by the corresponding equation [(30) or (31)] when both the operator and the righthand side of the equation are unknown and have to be approximated from the given data (14).

In the remainder of the paper, we use equation (38) only for estimating conditional probability function^{7} (where\(y\in \{0,1\}\)).

Find the approximation to the desired function by solving Eq. (38).
 1.
Why is the replacement of Cumulative Distribution Functions with their approximation (30) a good idea?
 2.
How to solve the illposed problem of statistical inference when both the operator and the righthand side of the equation are defined approximately?
 3.
How to incorporate the existing statistical invariants into solutions?
 4.
What are constructive algorithms for inference?
4 Main claim of statistics: Glivenko–Cantelli theorem
Theorem
5 Solution of illposed problems
In this section, we outline regularization principles that we apply to the solution of the described problems.
5.1 Wellposed and illposed problems
5.2 Regularization of illposed problems
The solution of illposed problems is based on the following lemma.
Lemma
(Lemma about inverse operator) If A is a continuous onetoone operator A defined on compact set\({{{\mathcal {M}}}}\) of functions \(\{f\}\), then the inverse operator \(A^{1}\) is continuous on the set \({{{\mathcal {N}}}}=A{{{\mathcal {M}}}}\).
In 1963, Tikhonov proved the following theorem:
Theorem
5.3 Generalization for approximately defined operator
Theorem
Corollary
6 Solution of stochastic illposed problems
 1.
The distance \(\rho ^2_{E_2}(A_\ell f,F_\ell )\) in \(E_2\) space: We select\(L_2(\phi )\)metric.
 2.
The set of functions \(\{f(x)\},~x\in [0,1]^n\) containing the solution \(f_\ell \): We select Reproducing Kernel Hilbert Space (RKHS) of kernel\(K(x,x_*)\).
 3.
The regularization functional \( W_{E_1}(f)\) in space \(E_1\): We select the square of the norm of function in the RKHS.
6.1 Distance in space \(L_2\)
Matrix V is a symmetric nonnegative matrix; the maximum value of any column (or any row of V) is the value on the intersection of that column (row) with the diagonal of V.
Algorithmic Remark. With increase of dimensionality, it becomes more difficult to realize the advantages of the Vmatrix method. This is due to the fact that mutual positions of the vectors in highdimensional spaces are not expressed as well as in lowdimensional spaces.^{9} The mathematical manifestation of this fact is that, in a highdimensional space, Vmatrix can be illconditioned and, therefore, require regularization. We used the following regularization method: (1) transform the Vmatrix to its diagonal form in the basis of its eigenvectors using the appropriate orthonormal mapping T; (2) add a small regularizing value to its diagonal elements (in our experiments, 0.001 was usually sufficient), and (3) use inverse mapping \(T^{1}\) to transform the regularized Vmatrix to its original basis.
6.2 Reproducing Kernel Hilbert space
6.2.1 Properties of RKHS
 1.Functions from RKHS with bounded square of normsbelong to a compact set and therefore the square of the norm of function can be used as regularization functional (see Lemma in Sect. 4.2).$$\begin{aligned} \sum _{i=1}^\infty \frac{b_i^2}{\lambda _i}\le C \end{aligned}$$(64)
 2.The function that minimizes empirical loss functional (51) in RKHS, along with its parametric representation (61) in infinitedimensional space of parameters c, has another parametric representation in \(\ell \)dimensional space \(\alpha =(\alpha _1,\ldots ,\alpha _\ell )\in R^\ell \), where \(\ell \) is the number of observations:(This fact constitutes the content of the socalled Representer Theorem).$$\begin{aligned} f(x,\alpha )=\sum _{i=1}^\ell \alpha _i K(x_i,x). \end{aligned}$$(65)
 3.The square of the norm of the chosen function, along with representation (63), has the representation of the formThis representation of the function in RKHS is used to solve the inference problem in highdimensional space.$$\begin{aligned} f(x,\alpha )^2=<f(x,\alpha ),f(x,\alpha )>=\sum _{i,j=1}^\ell \alpha _i\alpha _jK(x_i,x_j). \end{aligned}$$(66)
6.2.2 Properties of Kernels
 (1)Linear combination of kernels \(K_1(x,x')\) and \(K_2(x,x')\) with nonnegative weights is the kernel$$\begin{aligned} K(x,x')=\alpha _1K_1(x,x')+\alpha _2K_2(x,x'),\;\; \alpha _1\ge 0, \alpha _2\ge 0. \end{aligned}$$(67)
 (2)Product of the kernels \(K_1(x,x')\) and \(K_2(x,x')\) is the kernelIn particular, the product of kernels \(K(x^k,x'^{k})\) defined on coordinates \(x^k\) of vectors \(x=(x^1,\ldots ,x^m)\) is a multiplicative kernel in mdimensional vector space \(x\in R^m\):$$\begin{aligned} K(x,x')=K_1(x,x')K_2(x,x'). \end{aligned}$$(68)$$\begin{aligned} K(x,x')=\prod _{s=1}^mK_s(x^s,x'^{s}). \end{aligned}$$(69)
 (3)Normalized kernel is the kernel$$\begin{aligned} K_*(x,x')=\frac{K(x,x')}{{\sqrt{K(x,x)K(x',x')}}}. \end{aligned}$$(70)
6.2.3 Examples of Mercer Kernels
 (1)Gaussian kernel in \(x\in R^1\) has the formwhere \(\delta >0\) is a free parameter of the kernel. In mdimensional space \(x\in R^m\), Gaussian kernel has the form (69)$$\begin{aligned} K(x,x')=\exp \{\delta (xx')^2\}, \quad x,x'\in R^1, \end{aligned}$$(71)$$\begin{aligned} K(x,x')=\prod _{k=1}^m\exp \{\delta (x^kx'^{k})^2\}= \exp \{\delta xx'^2\},\quad x,x'\in R^n. \end{aligned}$$(72)
 (2)INKspline kernel (spline with infinite numbers of knots). INKspline kernel of order d was introduced in Vapnik (1995). For \(x\in [0,c]\), it has the formwhere we have denoted \((z)_+=\max (z,0)\). In particular, INKspline kernel of order 0 has the form$$\begin{aligned} K(x,x')= \int _0^c(xt)^d_+(x't)^d_+dt= \sum _{r=0}^d\frac{C^r_d}{2dr+1}[\min (x,x')]^{2dr+1}xx'^r,\nonumber \\ \end{aligned}$$(73)This INKspline is used to approximate piecewise continuous functions. INKspline kernel of order 1 has form$$\begin{aligned} K_0(x,x')=\min (x,x'). \end{aligned}$$(74)This INKspline is used to approximates smooth functions. Its properties are similar to those of cubic splines that are used in the classical theory of approximations.$$\begin{aligned} K_1(x,x')= \frac{1}{2}xx'\min (x,x')^2+\frac{\min (x,x')^3}{3}. \end{aligned}$$(75)
6.3 Basic solution of inference problems
 1.\((\ell \times 1)\)dimensional matrix A of elements \(\alpha \):$$\begin{aligned} A=(\alpha _1,\ldots ,\alpha _\ell )^T; \end{aligned}$$
 2.\(\ell \)dimensional vectorfunction \({{{\mathcal {K}}}}(x)\):$$\begin{aligned} {{{\mathcal {K}}}}(x)=(K(x_1,x),\ldots ,K(x_\ell ,x))^T; \end{aligned}$$
 3.\((\ell \times \ell )\)dimensional matrix K with elements \(K(x_i,x_j)\):$$\begin{aligned} K=K(x_i,x_j),\quad i,j=1,\ldots ,\ell ; \end{aligned}$$
 4.\((\ell \times \ell )\)dimensional matrix V of elements V(i, j):$$\begin{aligned} V=V(i,j),\quad i,j=1,\ldots ,\ell ; \end{aligned}$$
 5.\(\ell \)dimensional vector of elements \(y_i\) of training set:$$\begin{aligned} Y=(y_1,\ldots ,y_\ell )^T; \end{aligned}$$
 6.\(\ell \)dimensional vector of 1:$$\begin{aligned} 1_\ell =(1,\ldots ,1)^T; \end{aligned}$$
6.4 Closedform solution of minimization problem
Remark
6.5 Dual estimate of conditional probability
When one estimates the conditional probability function, this equality forms a prior knowledge.
Note that dual estimate takes into account how well the matrix K is conditioned: the solution of optimization problem uses expressions \(K^+K\) and \(K^+1_\ell \) [see formulas (96), (97)].
Remark
7 Part Two: Intelligencedriven learning: learning using statistical invariants
In this section, we introduce a new paradigm of learning called Learning Using Statistical Invariants (LUSI) which is different from the classical datadriven paradigm.
The new learning paradigm considers a model of Teacher–Student interaction. In this model, the Teacher helps the Student to construct statistical invariants that exist in the problem. While selecting the approximation of conditional probability, the Student chooses a function that preserves these invariants.
8 Strong and weak modes of convergence
 1.The distance between functionsthat is defined by the metric of the space \(L_2\) and$$\begin{aligned} \rho (f_1,f_2)= f_i(x)f_2(x) \end{aligned}$$
 2.The inner product between functionsthat has to satisfy the corresponding requirements.$$\begin{aligned} R(f_1,f_2)=(f_1(x),f_2(x)) \end{aligned}$$
 1.The strong mode of convergence (convergence in metrics)$$\begin{aligned} \lim _{\ell \rightarrow \infty }f_\ell (x)f_0(x)=0 \quad \forall x \end{aligned}$$
 2.The weak mode of convergence (convergence in inner products)(note that convergence has to take place for all functions \(\psi (x)\in L_2\)).$$\begin{aligned} \lim _{\ell \rightarrow \infty }(f_\ell (x)f_0(x),\psi (x))=0, \quad \forall \psi (x)\in L_2 \end{aligned}$$
In the first part of the paper, we showed that in classical (datadriven) paradigm, we can estimate the conditional probability function solving the illposed problem of the approximatively defined Fredholm equation (30), obtaining the Vmatrix estimate (see Sect. 5.4).
In the second part of the paper, we consider new learning opportunities using interaction with Teacher. The goal of that interaction is to include mechanisms of both weak and strong convergence in learning algorithms.
Here is the essence of the new mechanism. According to the definition, weak convergence has to take into account all functions\(\psi (x)\in L_2\). The role of Teacher in our model is to replace this infinite set of functions with a specially selected finite set of predicatefunctions \(\mathcal{{P}}=\{\psi _1(x),\ldots ,\psi _m(x)\}\) that can describe property of the desired conditional probability function and restrict the scope of weak convergence only to the set of predicate functions \({{{\mathcal {P}}}}\).
8.1 Method of learning using statistical invariants
Let us describe a method of extracting specific intelligent information from data by preserving statistical invariants.
 1.
Given pairs \((\psi _k(x),a_k),~k=1,\ldots ,m\) (predicates and their expectation for desired condition probability function), find the set of conditional probability functions \({{{\mathcal {F}}}}=\{P(y=1x)\}\) satisfying equalities (102) (preserving the invariants).
 2.
Select, in the set of functions \({{{\mathcal {F}}}}\) satisfying invariants (102), the function that is the solution of our estimation problem (the function of form (88) with parameters that minimize (80) (or (92))).
Remark
8.2 Closedform solution of intelligencedriven learning
When estimating conditional probability function, one can also take into account additional prior knowledge (87). To find parameters \(A_V\) and c of approximation that take into account this information, one has to solve the following quadratic optimization problem: minimize the functional (80) subject to m equality constraints (107) and \(\ell \) inequality constraints (87).
8.3 Dual intelligencedriven estimate of conditional probability
Dual intelligencedriven estimate of conditional probability requires minimization of functional (93) subject to equality type constraints (105).
8.4 LUSI methods: illustrations and remarks
 1.
SVM estimate
 2.
vSVM estimate.^{11}
 3.SVM&\(\hbox {I}_{n}\) estimate (SVM with n invariants). In this section, invariants are defined by the simple predicate functions^{12}$$\begin{aligned} \psi _0(x)=1\quad \text{ and }\quad \psi _1(x)=x. \end{aligned}$$
 4
vSVM& \(\hbox {I}_{n}\) estimate (vSVM with n invariants).
The first row of Fig. 1 shows results of SVM estimates, (square loss SVM); the second row shows results of vSVM estimates; the third row shows results of SVM&\(\hbox {I}_2\) estimates; the fourth row shows results of vSVM&\(\hbox {I}_2\) estimates. The last two rows of Fig. 1 show results obtained by modified vSVM&\(\hbox {I}_2\) and modified vSVM&\(\hbox {I}_2\), which are described in the subdivision 4 after Table 1.
From Fig. 1, one can see that vSVM estimates are consistently better than SVM estimates and that adding the invariants consistently improves the quality of approximations. One can also see that approximations obtained using vSVM provide smoother functions.
 1.The distance between the estimate and the true function in \(L_2\) metric$$\begin{aligned} \rho _{L_2}(P,P_\ell )=\left( \int (P(y=1x)P_\ell (y=1x))^2d\mu (x)\right) ^{1/2}. \end{aligned}$$
 2.
The error rate \(P_{err}(r_\ell )\) of rule \(r(x)=\theta (P(y=1x)0.5)\) in percents, and
 3.The values of relative (with respect to Bayesian rule) losses$$\begin{aligned} \kappa _{err}(r_\ell )=\frac{P_{err}(r_\ell )P_{err}(r_B)}{P_{err}(r_B)}. \end{aligned}$$
Experiments with onedimensional case
Training  SVM  vSVM  SVM&\(\hbox {I}_2\)  vSVM&\(\hbox {I}_2\)  mSVM&\(\hbox {I}_2\)  mvSVM&\(\hbox {I}_2\) 

Distance to the true conditional probability in \(L_2\) metric  
48  0.3756  0.2166  0.1432  0.1070  0.1064  0.0940 
96  0.3212  0.1808  0.1207  0.0778  0.0950  0.0863 
192  0.2273  0.1072  0.0689  0.0609  0.0461  0.0557 
Error rate in %  
48  59.79  11.00  11.38  11.02  11.00  11.14 
96  23.41  13.28  12.23  11.44  11.39  11.29 
192  16.57  11.68  11.54  11.33  11.11  11.11 
Relative error  
48  4.50  0.02  0.05  0.01  0.01  0.02 
96  1.15  0.22  0.12  0.05  0.05  0.04 
192  0.52  0.07  0.06  0.04  0.02  0.02 
4. From Fig. 1 and Table 1, one can see that, with increasing sophistication of conditional probability estimation methods, the obtained results are getting monotonically closer (in \(L_2\) metric) to the true conditional probability function. However, this monotonic convergence does not always hold for the error rates provided by the constructed conditional probability functions. This is because our mathematical goal was to approximate the conditional probability function in \(L_2\) metric (not the decision rule). A good estimate of conditional probability for a classification rule is the function that crosses the line \(y=0.5\) close to the point where this line is crossed by the true conditional probability; it does not have to be close to the true function in \(L_2\) metric (see Fig. 1).
In reality, we do not know the function \(P(y=1x)\). However, we can use its estimate \(P_\ell (y=1x)\) to compute \(\sigma (x_i)\). Therefore, in order to construct a special conditional probability function for subsequent creation of a decision rule, we can use the following twostage procedure: in the first stage, we estimate (SVM&\(\hbox {I}_n\) or vSVM&\(\hbox {I}_n\)) approximations of \(P_\ell (y=1x)\) and \(\sigma ^2(x)\), and, in the second stage, we obtain an estimate of the specialized conditional probability function (mSVM&\(\hbox {I}_n\) or mvSVM&\(\hbox {I}_n\)) for the decision rules.
The last two rows of Fig. 1 and last two columns of Table 1 compare the rules obtained using SVM & \(\hbox {I}_2\) estimates and vSVM&\(\hbox {I}_2\) estimates with approximations obtained using mSVM & \(\hbox {I}_2\) and mvSVM&\(\hbox {I}_2\) estimates (in both cases we used function \(\phi (x)=\sigma ^2(x)\)). It is interesting to note that estimates based on modified weightfunction \(\phi (x)\) not only improve the error rates of corresponding classification rules but also provide better approximations of conditional probability functions in \(L_2\) metric.
 1.
vSVM results are more accurate than SVM results.
 2.
Learning using statistical invariants can significantly improve performance. Both vSVM&\(\hbox {I}_2\) and SVM&\(\hbox {I}_2\) algorithms using 48 training examples achieve much better performance (which are very close to Bayesian) than just SVM or vSVM algorithms that use 192 examples. The result obtained based on mSVM&\(\hbox {I}_2\) method are consistently better than the results obtained based on SVM&\(\hbox {I}_2\) method.
 3.
The effect from adding invariants appears to be more significant than the effect of upgrading of SVM to vSVM.
Experiments with multidimensional data
Data set  Training  Test  Features  SVM (%)  SVM&\(\hbox {I}_{(n+1)}\) (%) 

Diabetes  562  206  8  30.94  22.73 
Bank marketing  445  4076  16  12.06  10.58 
MAGIC  1005  18,015  10  19.03  15.10 
Parkinsons  135  60  22  7.26  6.67 
Sonar  160  48  60  12.48  12.40 
Ionosphere  271  80  33  5.66  5.55 
WPBC  134  60  33  25.48  23.02 
WDBC  419  150  30  2.64  2.50 
Experiments with different sizes of training data
Diabetes  MAGIC  

Training  SVM (%)  SVM&\(\hbox {I}_9\) (%)  Training  SVM (%)  SVM & \(\hbox {I}_{11}\) (%) 
71  32.42  27.52  242  20.51  17.35 
151  29.97  24.56  491  20.93  15.91 
304  31.35  23.78  955  18.89  15.19 
612  30.43  23.30  1903  18.03  14.25 
Remark
When analyzing the experiments, it is important to remember that the Intelligent Teacher suggests to the Student “meaningful” functions for invariants instead of simple ones that we used here for illustrative purposes. In many of our experiments, the invariants were almost satisfied: the correcting parameters \(\mu _s\) in (119) were close to zero. Some ideas about what form these “meaningful” invariants can take are presented in the next Sections.
8.5 Examples of invariants with nonlinear predicatefunctions
Example
Table 2 shows that, using nine simple invariants (moments of zeroth and first order), SVM&I\({_9}\) achieved error rate 22.73% on “Diabetes”. In order to obtain a better performance, we introduce an additional invariant constructed in some twodimensional subspace of eightdimensional space X of the problem (Fig. 2). We choose (using training data) the square box \({{{\mathcal {B}}}}\) (it is shown in Fig. 2) and consider the following predicate: \(\psi _3(z)=1\) if \(x\in {{{\mathcal {B}}}}\) and \(\psi _3(z)=0\) otherwise. Our additional invariant is defined by this predicate \(\psi _3(x)\). As a result, when using all 10 invariants, we decrease the error rate to 22.07%.
Note that function \(\psi _3(x)\) was obtained due some intelligent input: we selected the square box intuitively, just looking at the positions of training points.^{13} Perhaps, we can continue to improve performance (if we are still far from the Bayesian rate) by adding more invariants (choosing different subspaces, different boxes, and different functions^{14}).
As we have already noted, there exists an important difference between idea of features that is used in classical algorithms and idea of predicates that is use for constructing invariants.
In order to construct an accurate rule, the classical paradigm recommends the introduction of special functions called features and construction of the rule based on these features. With increasing number of features one increases the capacity (the VC dimension) of the set of admissible functions from which one chooses the solution. According to the VC bounds, the larger is the capacity of that set, the more training examples are required for an accurate estimate.
With the increase of the number of invariants, one decreases the capacity of admissible set of functions, since the functions from which one chooses the solution preserve all the invariants. According to the same VC bounds, the decrease of the capacity of admissible functions improves the performance while using the same number of training examples.
8.6 More examples of intelligencedriven invariants
The role of Intelligent Teacher in training processes described in this paper is to introduce functions \(\psi _1(x),\ldots , \psi _n(x)\) for the problem of interest. The Student has to understand which functions the Teacher suggested and use them to create invariants. Below are some examples of functions that the Teacher can suggest for invariants.
Example 1
Suppose that the Teacher teaches Student to recognize digits by providing a number of examples and also suggesting the following heuristics: “In order to recognize the digit zero, look at the center of picture—it is usually light; in order to recognize the digit 2, look at the bottom of the picture—it usually has a dark tail” and so on.
From the theory above, the Teacher wants the Student to construct specific predicates \(\psi (x)\) to use them for invariants. However, the Student does not necessarily construct exactly the same predicate that the Teacher had in mind (the Student’s understanding of concepts “center of the picture” or “bottom of the picture” can be different). Instead of \(\psi (x)\), the Student constructs function \(\widehat{\psi (x)}\). However, this is acceptable, since any function from \( L_2\) can serve as a predicate for an invariant.
Generally speaking, the vector z could be defined as any realvalued defining any function that is linear in the pixel space.
Example 2
Next invariants are inspired by elements of machine learning techniques. In order to use this technique, we do not have to introduce the explicit form of predicate function. Instead, we introduce the algorithm of computing (using training data) the predicate function at any point of interest.
Example 3
Let us compute, for any vector \(x_i\) of the training set the following characteristics: the number of examples of the first class that belong to sphere of radius \(\rho \) with center \(x_i\). Let this value be \(\psi _\rho (x_i)\). Consider vector \(\varPhi _\rho =(\psi _\rho (x_1),\ldots ,\psi _\rho (x_\ell ))^T\) to define the invariant. Choosing different values of radius \(\rho \), one constructs different local characteristics of desired function.
Invariants with vectors \(\varPhi _{\rho _k},~k=1,\ldots ,n\) provide a description of the structure of the conditional probability function \(P(y=1x)\). They can be taken into account when one estimates \(P(y=1x)\).
Example 3a
8.7 Example of a simple LUSI algorithm
 Step 0.

Construct vSVM (or SVM) estimate of conditional probability function (see Sects. 5.5 and 6.3).
 Step 1.
 Find the maximal disagreement value \({{{\mathcal {T}}}}_s\) (108) for vectors$$\begin{aligned} \varPhi _k=(\psi _k(x_1),\ldots ,\psi _k(x_\ell ))^T, \quad k=1,\ldots ,s,\ldots . \end{aligned}$$
 Step 2.
 If the value \({{{\mathcal {T}}}}_s\) is large, i.e.,add the invariant (107) with \(\varPhi _s\); otherwise stop.$$\begin{aligned} {{{\mathcal {T}}}}_s=\text{ argmax }({{{\mathcal {T}}}}_1,\ldots ,{{{\mathcal {T}}}}_s\ldots )>\delta , \end{aligned}$$
 Step 3.

Find the new approximation of the conditional probability function and go to step 1; otherwise stop.
9 Conclusion
In this paper, we introduced LUSI paradigm of learning which, in addition to the standard datadriven mechanism of minimizing risk, leverages an intelligencedriven mechanism of preserving statistical invariants (constructed using training data and given predicates). In this new paradigm, one first selects (using invariants) an admissible subset of functions which contains the desired solution and then chooses the solution using standard training procedures.
The important properties of LUSI are as follows: if the number \(\ell \) of observations is sufficiently large, then (1) the admissible subset of functions always contains a good approximation to the desired solution regardless of the number of invariants used, and (2) the approximation to the desired solution is chosen by the methods that provide global minima to the guarantee risk^{16}.
LUSI method can be used to increase the accuracy of the obtained solution and to decrease the number of necessary training examples.
Footnotes
 1.The idea of the new approach is based on analysis of two mathematical facts:
 (1)
Direct definition of conditional probability and regression functions (Sect. 2.2).
 (2)
Existence of both strong and weak modes of convergence in Hilbert space, which became the foundation for two different mechanisms of generalization: the classical datadriven mechanism and the new intelligencedriven mechanism (Sect. 6).
 (1)
 2.The standard definitions of conditional probability function for continuous x is as follows. Let the probability distribution be defined on pairs (x, y). If y takes discrete values from \(\{0,1,\ldots ,k\}\), the conditional probability \(P(y=tx)\) of \(y=t\) given the vector x is defined as the ratio of two density functions \(p(y=t,x)\) and p(x):$$\begin{aligned} P(y=tx)=\frac{p(y=t,x)}{p(x)},\quad y=\{0,1,\ldots ,n\}. \end{aligned}$$
 3.
In Sect. 5.3, we consider a more general set of functions \(f(x)=A^T{{{\mathcal {K}}}}(x)+c\), where c is a constant. In this introduction, in order to simplify the notations, we set \(c=0\).
 4.
In other words, if one wants to find a rule for identification of ducks, the first thing one has to do is to find a set of rules that do not contradict the basic “duck test” of identification, i.e., birds that look like a duck, swim like a duck, and quack like a duck. Then one has to select the rule for identification of ducks within this set of rules.
 5.
In Vapnik and Izmailov (2017), we showed that, in datadriven estimates, \(\ell \) examples \((x_i,y_i), i=1,\ldots ,\ell \) can provide no more than \(\ell \) bits of information. However, using one invariant in intelligencedriven estimates, \(\ell \) examples can provide more than \(\ell \) bits of information (Sect. 6.3).
 6.
The definition of function simplicity is not trivial. In order to formalize the concept of simplicity for a set of functions, K. Popper introduced the concept of falsifiability of the set of functions (Popper 1934). However, the mathematical formalization of his idea contained an error. The corrected formalization of the falsifiability concept leads to the concept of VCdimension—see Corfield et al. (2005, 2009) for details.
 7.
 8.
Note that this idea of solving illposed problems is the same as in structural risk minimization in VC theory. In both cases, a structure is defined on the set of functions. When solving wellposed problems, elements of structure should have finite VCdimension. When solving illposed problems, elements of structure should be compact sets.
 9.
Recall that almost all the points in a highdimensional ball belong to an area where all the points are close to the surface of the ball.
 10.Instead of (107), it is more accurate to consider the inequality constraintsIn this case, one has to solve the following quadratic optimization problem: to minimize (80) (or (92)) subject to these inequality constraints.$$\begin{aligned} \delta {{{\mathcal {T}}}}\le \varPhi _s^TKA+c\varPhi ^T_s1_\ell \varPhi _s^TY\le \delta {{{\mathcal {T}}}}, \quad s=1,\ldots ,m. \end{aligned}$$
 11.
 12.
These functions define values of zeroth order and first order moments of conditional probability function \(P(y=1x)\).
 13.
This is the same methodology that physicists use: “Find a situation (the box \({{{\mathcal {B}}}}\) in Fig. 2), where the existing model of the Nature (the approximation \(P_n(y=1x)\)) contradicts the reality (contradicts data \((x_i,y_i)\) from the box \({{{\mathcal {B}}}}\)) and then fix the model (obtain a new approximation \(P_{n+1}(y=1x)\)) which does not have the contradictions.” Note that the most difficult part in model refinement is to find a contradiction situation.
 14.
The choice the best position of the box can be done algorithmically.
 15.
Student can choose any appropriate weights.
 16.
The idea of twostage learning is also realized in deep neural networks (DNN), where, at the first stage (using “deep architecture”), an appropriate network is constructed and then, at the second stage, using standard for NN training procedures, the solution is obtained. DNN, however, cannot guarantee either that the constructed network contains a good approximation to the desired function or that it can find the best solution for the given network.
Notes
Acknowledgements
This material is based upon work partially supported by AFRL under contract FA95501510502. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. We thank the anonymous reviewers, S. Sugrim, and P. Toccaceli for their careful reading of our paper and their remarks, which helped to improve the paper’s readability.
References
 Corfield, D., Schölkopf, B., & Vapnik, V. (2005). Popper, falsification and the VCdimension. Technical Report 145, Max Planck Institute for Biological Cybernetics.Google Scholar
 Corfield, D., Schölkopf, B., & Vapnik, V. (2009). Falsificationism and statistical learning theory: Comparing the Popper and Vapnik–Chervonenkis dimensions. Journal for General Philosophy of Science, 40(1), 51–58.CrossRefGoogle Scholar
 Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Feb 2018.
 Popper, K. (1934). The logic of scientific discovery. London: Hutchinson.zbMATHGoogle Scholar
 Tikhonov, A., & Arsenin, V. (1977). Solutions of Illposed problems. Washington: W.H. Winston.zbMATHGoogle Scholar
 Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.CrossRefzbMATHGoogle Scholar
 Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition. Moscow: Nauka. (in Russian).zbMATHGoogle Scholar
 Vapnik, V., & Izmailov, R. (2017). Knowledge transfer in SVM and neural networks. Annals of Mathematics and Artificial Intelligence, 81(1–2), 3–19.MathSciNetCrossRefzbMATHGoogle Scholar
 Vapnik, V., & Stefanyuk, A. (1978). Nonparametric methods for estimating probability densities. Automation and Remote Control, 8, 38–52.zbMATHGoogle Scholar
 Wapnik, W., & Tscherwonenkis, A. (1979). Theorie der Zeichenerkennung. Berlin: AkademieVerlag.zbMATHGoogle Scholar
 Wigner, E. (1960). The unreasonable effectiveness of mathematics in the natural sciences. Communications in Pure and Applied Mathematics, 13(1), 1–14.CrossRefzbMATHGoogle Scholar