Abstract
This paper introduces a new learning paradigm, called Learning Using Statistical Invariants (LUSI), which is different from the classical one. In a classical paradigm, the learning machine constructs a classification rule that minimizes the probability of expected error; it is datadriven model of learning. In the LUSI paradigm, in order to construct the desired classification function, a learning machine computes statistical invariants that are specific for the problem, and then minimizes the expected error in a way that preserves these invariants; it is thus both data and invariantdriven learning. From a mathematical point of view, methods of the classical paradigm employ mechanisms of strong convergence of approximations to the desired function, whereas methods of the new paradigm employ both strong and weak convergence mechanisms. This can significantly increase the rate of convergence.
Introduction
It is known that Teacher–Student interactions play an important role in human learning. An old Japanese proverb says “Better than thousand days of diligent study is one day with a great teacher.” What is it exactly that great Teachers do? This question remains unanswered.
At first glance, it seems that the information that a Student obtains from his interaction with a Teacher does not add too much to the standard textbook knowledge. Nevertheless, it can significantly accelerate the learning process.
This paper is devoted to mechanisms of machine learning, which include elements of Teacher–Student (or Intelligent Agent—Learning Machine) interactions. The paper demonstrates that remarks of a Teacher, which can sometimes seem as trivial (e.g., in the context of digit recognition, “in digit ‘zero’, the center of the image is empty” or “in digit ‘two’, there is a tail in the lower right part of the image”), can actually add a lot of information that turns out to be essential even for a large training data set.
The mechanism of Teacher–Student interaction presented in this paper is not based on some heuristic. Instead, it is based on rigorous mathematical analysis of the machine learning problem.
In 1960, Eugene Wigner published the paper “Unreasonable Effectiveness of Mathematics in Natural Sciences” Wigner (1960), in which he argued that mathematical structures “know” something about physical reality. Our paper might as well have the subtitle “Unreasonable Effectiveness of Mathematics in Machine Learning” since the idea of the proposed new mechanism originated in rigorous mathematical treatment of the problem, which only afterwards was interpreted as an interaction between an Intelligent Teacher and a Smart Student.^{Footnote 1}
While analyzing the setting of the learning problem, we take into account some details that were, for simplicity, previously omitted in the classical approach. Here we consider the machine learning problem as a problem of estimating the conditional probability function rather than the problem of finding the function that minimizes a given loss functional. Using mathematical properties of conditional probability functions, we were able to make several steps towards the reinforcement of existing learning methods.
Content and organization of paper
Our reasoning consists of the following steps:
1. We define the pattern recognition problem as the problem of estimation of conditional probabilities \(P(y=kx),\; k=1,\ldots ,n\) (probability of class \(y=k\) given observation x): example \(x_*\) is classified as \(y=s\) if \(P(y=sx_*)\) is maximum. In order to estimate \(\max _{k}P(y=kx),\; k=1,\ldots ,n\), we consider n twoclass classification problems of finding \(P_k(y^*=1x), k=1,\ldots ,n\), where \(y^*=1\) if \(y=k\) and \(y^*=0\) otherwise.
We start with introducing direct definitions of conditional probability function which differs from standard definition.^{Footnote 2} Let \(x\in R^n\). In Sect. 2.2, we define the conditional probability as the solution \(f(x)=P(y=1x)\) of the Fredholm integral equation of the first kind:
where kernel \(\theta (z)\) of the integral operator is defined as the stepfunction
In order to estimate conditional probability from data, we solve equation (1) in the situation where the cumulative distribution functionsP(x) and\(P(y=1,x)\)are unknown but iid data
generated according toP(y, x) are given.
The advantage of definition (1) is that it is based on the cumulative distribution function, the fundamental concept of probability theory. This definition does not directly require (as in the classical case) the existence of density functions and their ratio.
2. In order to estimate the desired solution from data, we use the following standard inductive step (heuristics): we replace the unknown cumulative distribution functions P(x) and \(P(y=k,x)\) with their empirical estimates
thus obtaining an empirical equation \(A_\ell f=F_\ell \) in the form
3. The inductive step of replacing the cumulative distribution function with the corresponding empirical distribution function is the main instrument of statistical methods. Justification of this step and analysis of its accuracy are main topics of classical statistics described in Glivenko–Cantelli Theorem (1933) and Kolmogorov–Dvoretzky–Kiefer–Wolfowitz–Massard bounds (1933–1990). The generalization of Glivenko–Cantelli theory, the VapnikChervonenkis theory (VCtheory 1968) plays an important part in justification of learning methods. Results of these theories are outlined in Sect. 1.3.
4. The estimation of conditional probability function by solving Fredholm integral equation \(Af(x')=F(x)\) is an illposed problem. For solving illposed problems, Tikhonov and Arsenin (1977) proposed the regularization method, which, under some general conditions, guarantees convergence (in the given metric) of the solutions to the desired function.
In our setting, we face a more difficult problem: we have to solve an illposed equation (1) where both the operator and the righthand side of the equation are defined approximately \(A_\ell f(x')\approx F_\ell (x)\). In 1978, Vapnik and Stefanyuk (1978) proved that, under some conditions on the operator A of the equation, the regularization method also converges to the desired solution in this case. Section 4.3 outlines the corresponding results.
5. According to Tikhonov’s regularization method (Sect. 4.2), in order to solve the operator equation \(A_\ell f(x')=F_\ell (x)\), one has to minimize the functional
where one has to define the following:

1.
the distance \(\rho (A_\ell f(x'),F_\ell (x))\) between the function \(A_\ell f(x')\) on the lefthand side and the function \(F_\ell (x)\) on the righthand side of the equation;

2.
the set of functions \(\{f(x')\}\) in which one is looking for the solution of the equation;

3.
the regularization functional \(W(f(x'))\) and regularization parameter \(\gamma >0\).
In this paper, we define the following elements:

1.
we use \(L_2\)metric for distance \(\rho ^2(A_\ell f(x'),F_\ell (x))=\int (Af_\ell (x')F_\ell (x))^2d\mu (x)\);

2.
we solve the equations in the sets of functions \(f(x')\) that belong to Reproducing Kernel Hilbert Space (RKHS) of the kernel \(K(x,x')\) (see Sect. 5.2).

3.
we use the square of a function’s norm \(f(x')^2\) as the regularization functional.
6. In Sect. 5.1, using \(L_2\)distance for estimating the approximation of the solution of the Fredholm equation based on the regularization method, we obtain the nonregularized (with \(\gamma =0\)) empirical loss functional
which has to be minimized. Here V(i, j) are elements of the socalled Vmatrix that are computed from the data. The classical (nonregularized) empirical loss functional has the form
which defines the least squares method. The least squares method is a special case of (3), where Vmatrix is replaced with the identity matrix I.

The Vmatrix in Eq. (3) has the following interpretation: when we are looking for the desired function, we take into account not only the residuals \(\varDelta _i=y_if(x_i),\;i=1,\ldots ,\ell \) at the observation points \(x_i\), but also the mutual positions V(i, j) of the observation points \(x_i\) and \(x_j\). The classical solution of the problem (i.e., the least squares method (4)) uses only information about the residuals \(\varDelta _i\).
7. Section 5.2.1 shows that, for the Reproducing Kernel Hilbert Space of kernel \(K(x,x')\), the function f(x) has the representation (Representer theorem)
The square of norm of this function (which we use as a regularization functional) has the form
In this paper, we use the following vectormatrix notations:

\(\ell \)dimensional vector \(Y=(y_1,\ldots ,y_\ell )^T\),

\(\ell \)dimensional vector \(A=(a_1,\ldots ,a_\ell )^T\) of parameters \(a_i\),

\(\ell \)dimensional vectorfunction \(\mathcal{K}(x)=(K(x,x_1),\ldots ,K(x,x_\ell ))^T\),

\((\ell \times \ell )\)dimensional matrix V of elements \(V(x_i,x_j),~i,j=1,\ldots ,\ell \),

\((\ell \times \ell )\)dimensional matrix K of elements \(K(x_i,x_j),~i,j=1,\ldots ,\ell \).
In these notations, (5) and (6) can be written as
8. Section 5.3 shows that, in order to estimate the conditional probability function^{Footnote 3}
using the regularization method, one has to find the vector A that minimizes the functional
where the coordinates of vector Y are binary (one or zero).
The minimum of this functional has the form
where I is the identity matrix. We call such estimate the vSVM method. The classical (square loss) SVM estimate has the form
The difference between classical SVM approximation (10) and new vSVM approximation (9) is in its use of identity matrix I instead of Vmatrix. When using Vmatrix, one takes into account mutual positions of observed vectors \(x\in X\).
9. Starting from Sect. 6, we consider the problem of the Teacher–Student interaction. We introduce the following mathematical model of interaction which we call Learning Using Statistical Invariants (LUSI).
Let \(P(y=1x),\;x\in R^n,\;y\in \{0,1\}\) be the desired conditional probability function. Consider m functions \(\psi _s(x),\;s=1,\ldots ,m\). There exist constants \(C_{\psi _s}\) such that the equalities
hold true. We consider as estimates (approximations) \(P(y=1x)\) the functions satisfying the equations
We say that a sequence of approximations \(P_\ell (y=1x)\) converges to \(P(y=1x)\)in strong mode if
in the metric of functions \(P(y=1x)\). We also say that the sequence of approximations \(P_\ell (y=1x)\) converges to \(P(y=1x)\) in weak mode if the sequence of values \(C^\ell _\psi \) converges to the value \(C_\psi \) with \(\ell \rightarrow \infty \) for any\(\psi (x)\in L_2\):
It is known that strong convergence implies weak convergence.
The idea of LUSI model is to minimize the loss functional (8) in the subset of functions (7) satisfying (12). In order to find an accurate approximation of the desired function, we employ mechanisms of convergence that are based on both strong and weak modes. (The classical method employs only the strong mode mechanism).
The important role in LUSI mechanisms belongs to Teacher. According to the definition, the weak mode convergence requires convergence for all functions \(\psi (x)\in L_2\). Instead, the Teacher replaces the infinite set of functions \(\psi (x)\in L_2\) with a finite set \({{{\mathcal {F}}}}=\{\psi _s(x),~s=1,\ldots ,m\}\).
Let the Teacher define functions \(\psi _s(x),\;s=1,\ldots ,m\) in (11), which we call predicates. Suppose that the values \(C_{\psi _s}\), which we call expressions of predicate, are known. This fact has the following interpretation: equations (11) describe d (integral) properties of the desired conditional probability function. Our goal is thus to find the approximation that has these properties.
The idea is to identify the subset of functions \(P(y=1x)\) for which expressions of predicates \(\psi _s(x)\) are equal to \(C_{\psi _s}\), and then to select the desired approximation from this subset.
In reality, the values \(C_{\psi _s}\) are not known. However, these values can be estimated from the data \((x_1,y_1),\ldots ,(x_\ell ,y_\ell )\). According to the Law of Large Numbers, the values \(C_{\psi _s}\) for the corresponding functions \(\psi _s(x)\) can be estimated as
where we have denoted \({ \varPhi }_s=({\psi }_s(x_1),\ldots , {\psi _s}(x_\ell ))^T\).
Since the structure of conditional probability has the form of (7), we can also estimate the lefthand side of (12). That estimate is \((\ell )^{1}A^TK\varPhi \). From estimates of left and righthand sides in (12), one obtains equations (we call them invariants)
The method of LUSI is to find the vector A that defines the approximation (7) by minimizing functional (8) in the set of vectors A satisfying^{Footnote 4} (13).
10. The problem of minimizing the functional (8) subject to the constraints (13) has a closedform solution. In order to obtain it, one has to do the following:

(1)
Compute vector \(A_V\) as in (9).

(2)
Compute d vectors
$$\begin{aligned} A_s=(VK+\gamma I)^{1}{ \varPhi }_s,~s=1,\ldots ,m. \end{aligned}$$ 
(3)
Compute vectorparameters A of estimate (7)
$$\begin{aligned} A=A_V\left( \sum _{s=1}^m \mu _s A_s\right) , \end{aligned}$$
where vectorparameters \(\mu _s\) are the solution of some linear system of equations determined by the data (Sect. 6.2). Vector A is the sum of two estimates:

(1)
the estimate \(A_V\), which is obtained from the standard learning scenario by the vSVM method (datadriven part of estimate), and

(2)
the term shown in parentheses, which is the correction term based on invariants (13) (intelligencedriven part of estimate). We call the sum A the vSVM withminvariants estimate, denoted as vSVM & \(\hbox {I}_m\).
11. The introduction of predicatefunctions \(\psi _s(x),~s=1,\ldots ,m\) for constructing invariants reflects the Teacher’s intellectual contribution to the Teacher–Student interaction; some examples of possible functions for invariants are provided in Sect. 6.5. The first task of the Student in this interaction is to understand which invariants are suggested by the Teacher, and the second task is to choose the best one from the admissible set of functions that preserve these invariants.
There is an important difference between invariants and features in classical learning models. With increasing number of invariants, the capacity of the set of functions from which Student has to choose the desired one decreases (and as a result, according to VC bounds, this leads to a more accurate estimate). In contrast to that, with increasing number of features, the capacity of the set of admissible functions increases (and thus, according to VC bounds, this requires more training examples for an accurate estimate^{Footnote 5}).
12. Section 6.4 contains examples that illustrate the effectiveness of the ideas of LUSI and remarks on implementation of SVM& \(\hbox {I}_d\) algorithms.
Phenomenological model of learning
The general mathematical theory of learning was first introduced about fifty years ago. That theory was created for the following phenomenological model:

In some environment X, there exists a generator G of random events \(x_i\in X\). This generator G generates events x randomly and independently, according to some unknown probability measure P(x). In this environment, a classifier A operates, which labels the events x; in other words, on any event \(x_i\) produced by generator G, classifier A reacts with a binary signal \(y_i\in \{0,1\}\). The classification \(y_i\) of events \(x_i\) is produced by classifier A according to some unknown conditional probability function \(P(y=1x)\). The problem of learning is formulated as follows: given \(\ell \) pairs
$$\begin{aligned} (x_1,y_1),\ldots ,(x_\ell ,y_\ell ) \end{aligned}$$(14)containing events \(x_i\) produced by generator G and classifications \(y_i\) produced by classifier A, find, in a given set of indicator functions, the one that minimizes probability of discrepancy between classifications of this function and classifier A.
This phenomenological model describes the basic learning problem, namely, the socalled twoclass pattern recognition problem. Its generalizations to an arbitrary finite number \(n>2\) of classes \(y\in \{a_1,\ldots ,a_n\}\) are just technical developments of this simplified model; they are addressed later in the paper.
Risk minimization framework
The first attempt to construct the general learning theory was undertaken in the late 1960s—beginning of 1970s. At that time, the learning was mathematically formulated as the problem of expected risk minimization (Vapnik and Chervonenkis 1974; Wapnik and Tscherwonenkis 1979).
Expected risk minimization problem
In a given set of functions \(\{f(x)\}\), find the one that minimizes the expected risk functional
(here L(y, f(x)) is a given Loss Function), under the assumption that probability measure P(x, y) is unknown however iid data (14) generated according to the phenomenological model \(P(x,y)=P(yx)P(x)\) described above is available. In this paper, we focus on the special case of \(\{f(x)\}\) being a set of indicator functions (this case is called twoclass classification rule) and then generalize the obtained results for estimating conditional probability functions (realvalued functions \(0\le f\le 1\)). In this paper, we use the loss function
For pattern recognition case, the function f(x) that minimizes the functional (16) belongs to a set of indicator functions. The function \(f \in \{f(x)\}\) that minimizes functional (15) with loss (16) will have the smallest probability of error among all functions in \(\{f(x)\}\).
Empirical loss minimization solution
In order to find such functions, the method of minimizing the empirical risk functional
was introduced. The function \(f_\ell (x)\in \{f(x)\}\) which minimizes (17) is considered as the function f for which the value R(f) of functional (14) (i.e., the value of the expected error) is close to the minimum: \(R(f_\ell )\approx \min _{f}R(f)\).
The choice of empirical risk minimization method is based on the following consideration: since the cumulative distribution function P(y, x) is unknown but data (14) is given, one approximates the unknown cumulative distribution function by its empirical estimate, the joint empirical cumulative distribution function
where the onedimensional step function is defined as
and the multidimensional step function (\(x=(x^1,\ldots ,x^n)\)) is defined as
Using approximation (18) in the expected loss functional (14), we obtain the empirical loss functional (17). In Sect. 3, we argue that empirical cumulative distribution function is a good approximation for the cumulative distribution function. The minimization of functional (17) is the main inductive instrument in estimation theory.
Problems of theory
The theory of empirical risk minimization was developed to address the following questions:

1.
When is the method of empirical loss minimization consistent?

2.
How close is the value of expected loss to the minimum across the given set of functions?

3.
Is it possible to formulate a general principle that is better than the empirical loss minimization?

4.
How to construct algorithms for estimation of the desired function?
Main results of VC theory
Forty years ago, a theory that addresses all four questions above was developed; it was the socalled Statistical Learning Theory or VC Theory.

1.
The VC theory defines the necessary and sufficient conditions of consistency for both (i) the case when the probability measure P(x) of the generator G in the phenomenological model is unknown (in this case, the necessary and sufficient conditions are valid for any probability measure P(x)), and (ii) for the case when the probability measure P(x) is known (in this case, the necessary and sufficient conditions are valid for a given probability measure P(x)). In both of these cases, the conditions for consistency are described in terms of the capacity of the set of functions. In the first case (consistency for any probability measure), the VC dimension of the set of indicator functions is defined (it has to be finite). In the second case (consistency for the given probability measure), the VC entropy of the set of function for the given probability measure is defined (VC entropy over number of observations has to converge to zero).

2.
When VC dimension of a set of indicator functions is finite, the VC theory provides bounds on the difference between the real risk that exists for the function \(f_\ell \) that minimizes empirical risk functional (15) (the value \(R(f_\ell )\)) and its empirical estimate (17) (i.e., the value \(R_\ell (f_\ell )\)). That difference depends on the ratio of VC dimension h to number of observations \(\ell \). Specifically, with probability \(1\eta \), the bound
$$\begin{aligned} R(f_\ell )R_\ell (f_\ell )\le T\left( \frac{h\ln \eta }{\ell }\right) \end{aligned}$$(21)holds, which implies the bound
$$\begin{aligned} R(f_\ell )\inf _{f} R(f)\le T_*\left( \frac{h\ln \eta }{\ell }\right) . \end{aligned}$$(22)In (21) and (22), \(T(h/\ell )\) and \(T_*(h/\ell )\) are known monotonically decreasing functions.

3.
VC theory introduces a generalization of the empirical risk minimization method, the socalled method of Structural Risk Minimization (SRM), which is the basic instrument in statistical inference methods. In the SRM method, a nested set of functions
$$\begin{aligned} S_1\subset S_2\subset \ldots \subset S_n\subset \ldots \end{aligned}$$(23)is constructed on the set of given function \(\{f\}\); here \(S_k\) is the subset of all functions \(\{f\}\) that have VC dimension \(h^{(k)}\). For subset \(S_k\), the bound (21) has the form
$$\begin{aligned} R(f_\ell ^{(S_k)})\le R_\ell (f^{(S_k)}_\ell )+ T\left( \frac{h^{(k)}\ln \eta }{\ell }\right) . \end{aligned}$$(24)In order to find the function that provides the smallest guaranteed risk for the given observations, one has to find both the subset \(S_k\) of the structure (23) and the function f in this subset that provides the smallest (over k and \(f\in S_k\)) value of the righthand side of inequality (24). Structural Risk Minimization (SRM) method is a strongly universally consistent risk minimization method. One can consider the SRM method as a mathematical realization of the following idea: Chose the simplest^{Footnote 6}function (one from the subset with the smallest VC dimension) that classifies training data well.

4.
Based on VC theory, an algorithm for solving pattern recognition problems was developed—it was the Support Vector Machine (SVM) algorithm which realized the SRM method. In this paper, we consider an SVM type of algorithm for square loss function (16) as the baseline algorithm for comparisons. We introduce two new ideas which form the basis for a new approach. One idea (presented in Part One of paper) is technical and another one (presented in Part Two) is conceptual.
Part One: The Vmatrix estimate
In this first part, we describe our technical innovations that we use in the second part for constructing algorithms of inference.
The observation that defines the Vmatrix method
Our first idea is related to the mathematical understanding of a learning model that is different from the phenomenological scheme described in Sect. 1.2.
In the classical approach to the learning methods (analyzed by the VC theory), we ignored the fact that the desired decision rule is related to the conditional probability function used by Classifier A. We just introduced some set of indicator functions and defined the goal of learning as finding the one that guarantees the smallest expected risk of loss (15) in this set of functions.
Now we take into account the fact that the best rule for classification within n classes has the form
where \(P(y=sx)\) is the probability of class \(y=s\) given observation x. For a twoclass pattern recognition problem (where \(y\in \{0,1\}\)), this rule has the structure
where \(\theta (z)\) is the step function. Thus, the rule r(x) classifies vector x as belonging to the first class (\(y=1\)) if the conditional probability of the first class exceeds 0.5.
We consider the problem of estimating the conditional probability function \(P(y=sx)\) using data (14) as the main problem of learning. The construction of the classification rule (25) based on the obtained conditional probability function is then a trivial corollary of this solution.
Standard definitions of conditional probability and regression functions
In statistical theory, the basic properties of random events \(x\in R^n\) are described by their cumulative distribution function. The cumulative distribution function of variable \(x=(x^1,\ldots ,x^n)\) is defined as the probability that a random vector \(X=(X^1,\ldots ,X^n)\) does not exceed the vector x (coordinatewise):
The probability density function (if it exists) is defined as derivative of P(x):
Consider pair (x, y), where \(x\in R^n\) and \(y\in R^1\) are continuous variables. The ratio of two density functions
is called conditional density function. The expectation function
over y for any fixed x is called regression function.
If y is discrete, say \(y\in \{0,1\}\), the ratio of two densities p(x, y) and p(x)
is called conditional probability function. Function \(p(y=1x)\) defines probability of classification \(y=1\) given vector x. In this definition, the factor \(p(y=1)\) is the probability of event \(y=1\) in trials with (x, y).
The basic problem of statistics is as follows: given the set of iid pairs
generated according to P(x, y), estimate the conditional probability function.
In this paper, we estimate the conditional probability function. We assume that \(x\in {{{\mathcal {R}}}}^d\), where \(\mathcal{R}^d=[a_1,c_1]\times \cdot \cdot \cdot \times [a_d,c_d]\). To simplify the notations, we assume, without loss of generality, that \(a_1=\ldots =a_s=\ldots =a_d=0\), so that \({{{\mathcal {R}}}}^d=\{x\in [0,\mathbf{c}]^d\}\), where \(\mathbf {c}=(c_1,\ldots ,c_d)^T\).
Direct definitions of conditional probability function
In this section, we consider another definitions of the conditional probability function. We define the conditional probability function f(x) as the solution of the Fredholm equation
Formally, this definition of the conditional probability function does not require the existence of density function. If derivatives of functions of P(x) and \(P(y=1,x)\) exist, taking derivative over x on the left and righthand sides of Eq. (30), we obtain
which is the definition of conditional probability (29).
Similarly, we define the regression function f(x) as the solution of Fredholm equation
Indeed, taking derivatives of (31) over x, we obtain
which is the definition of regression (28). In this paper, however, we consider only conditional probability estimation problem.
In order to estimate the conditional probability function or the regression function, we need to find the solution of Eqs. (30) or (31), Note that cumulative distribution functions that define these equations are unknown but corresponding iid data \((x_1,y_1),\ldots ,(x_\ell ,y_\ell )\) are given.
Estimation of conditional probability function for classification rule
This paper is devoted to the estimation of the solutions of Eq. (30). To find the classification rule r(x) in twoclass classification problem, we use the estimated conditional probability function
At first glance, such approach to pattern recognition problem seems to be an overkill since we connect the solution of a relatively simple problem of finding an indicator function to a much more difficult (illposed) problem of estimating the conditional probability functions. The explanation why it is nevertheless a good idea is the following:

1.
The conditional probability function directly controls many different statistical invariants that manifest themselves in the training data. Preserving these invariants in the rule is equivalent to incorporating some prior knowledge about the solution. Technically, this allows the learning machine to extract additional information from the data—the information that cannot be extracted directly by existing classical methods. Section 6 is devoted to the realization of this idea.

2.
Since the goal of pattern recognition is to estimate the classification rule (32) (not the conditional probability function), one can relax the requirements of accuracy of the conditional probability estimation. We really need to have an accurate estimate of the function in the area \(x\in X\) where values \(P(y=1x)\) are close to 0.5; conversely, we can afford to have less accurate estimates in the area where \(P(y=1x)0.5\) is large. This means that the cost of error of deviation of the estimate \(P_\ell (y=1x)\) from the actual function \(P(y=1x)\) can be monotonically connected to the variance \(\phi (x)=P(y=1x)(1P(y=1x))\) (the larger is the variance, the bigger is the cost of error). This fact can be taken into account when one estimates conditional probability functions.
As we will see in Sect. 6, the model of estimating the conditional probability function controls more important factors of learning than the classical expected risk minimization model considered in the VC theory.
Problem of inference from data
Statistical inference problem defined in Sect. 2.2 requires solving the Fredholm integral equation
where operator A has the form
This operator maps functions \(f\in E_1\) to functions \(Af\in E_2\). In the pattern recognition problem, the operator A maps the set \({{{\mathcal {P}}}}\) of nonnegative bounded functions \(0\le f(x)\le 1\) into the set of nonnegative bounded functions \(0\le F(x)=P(y=1,x)\le P(y=1)\), where \(P(y=1)\) is the probability of class \(y=1\). The problem is, for any function \(F(x)=P(y = 1, x)\) in the righthand side of (30), to find the solution of this equation in the set \({{{\mathcal {P}}}}\).
Note that the solution of operator equations in a given set of functions is, generally speaking, an illposed problem. The problem of statistical inference is to find the solution defined by the corresponding equation [(30) or (31)] when both the operator and the righthand side of the equation are unknown and have to be approximated from the given data (14).
In all cases, we approximate the unknown cumulative distribution functions P(x), function \(P(y=1,x)=P(xy=1)P(y=1)\) and P(x, y) using Empirical Cumulative Distribution Functions
Putting (36) and (37) into Eq. (30) instead of unknown elements P(x) and \(P(y=1,x)\), and putting (35) and (37) into Eq. (31) instead of unknown elements P(x) and P(y, x), we obtain (taking into account that the derivative of \(\theta (xx_i)\) is \(\delta \)function \(\delta (xx_i)\)) the following approximation to the Eq. (30) and the same approximation to the Eq. (31):

In the remainder of the paper, we use equation (38) only for estimating conditional probability function^{Footnote 7} (where\(y\in \{0,1\}\)).
Similarly, in order to estimate the conditional probability \(P(y=0x)\), one has to solve the equation
The problem of the new approach to learning is:

Find the approximation to the desired function by solving Eq. (38).
Theory of this learning paradigm has to answer the following four questions:

1.
Why is the replacement of Cumulative Distribution Functions with their approximation (30) a good idea?

2.
How to solve the illposed problem of statistical inference when both the operator and the righthand side of the equation are defined approximately?

3.
How to incorporate the existing statistical invariants into solutions?

4.
What are constructive algorithms for inference?
The next sections are devoted to the answers to these questions.
Main claim of statistics: Glivenko–Cantelli theorem
Consider the following approximation of a cumulative distribution function P(z) obtained for iid observations \(z_1,\ldots ,z_\ell \) generated according to P(z):
In 1933, Glivenko and Cantelli proved the main theorem of statistics:
Theorem
Empirical Distribution Functions \(P_\ell (z)\) converge to the actual cumulative distribution function P(z) uniformly with probability 1:
In the same 1933, Kolmogorov found the asymptotically exact rate of convergence for the case \(z\in R^1\). He showed that the equality
holds true. Later (in 1956–1990), Dvoretzky–Kiefer–Wolfowitz–Massard found the sharp nonasymptotic bound with its righthand side coinciding with the first term of Kolmogorov equality:
The generalization of the bound for \(z\in R^n\) was obtained in 1970 using the VC theory:
In order to obtain constructive equations for pattern recognition problem, we replaced the cumulative distribution functions in (30), (31) with the corresponding empirical distribution functions.
Solution of illposed problems
In this section, we outline regularization principles that we apply to the solution of the described problems.
Wellposed and illposed problems
Let A be a linear operator which maps the elements f of a metric space \(E_1\) into the elements F of a metric space \(E_2\). We say that problem of solving operator equation
in the set \(\{f\}\) is wellposed if the solution exists, is unique, and is continuous. That is, if the functions \(F_1\) and \(F_2\) of the righthand side of Eq. (43) are close in the metric of space \(E_2\) (i.e., \(\rho _{E_2}(F_1,F_2)\le \varepsilon \)), they correspond to solutions \(f_1\) and \(f_2\) that are close in the metric of space \(E_1\) (i.e., \(\rho _{E_1}(f_1,f_2)\le \delta \)). The problem is called illposed if at least one of the three above conditions is violated. Below we consider the illposed problems where unique solutions exist, but the inverse operator
could be discontinuous. Solving of inference problems defined by the Fredholm equation
is illposed. The problem of statistical inference requires to solve illposed Fredholm equations when both righthand sideFand operatorAof Eq. (43) are defined approximately.
Regularization of illposed problems
The solution of illposed problems is based on the following lemma.
Lemma
(Lemma about inverse operator) If A is a continuous onetoone operator A defined on compact set\({{{\mathcal {M}}}}\) of functions \(\{f\}\), then the inverse operator \(A^{1}\) is continuous on the set \({{{\mathcal {N}}}}=A{{{\mathcal {M}}}}\).
Consider a continuous nonnegative functional W(f) and the set of functions defined by the constant \(C>0\) as
Let the set of functions \({{{\mathcal {M}}}}_C\) be convex and compact for any C. Suppose that the solution of operator equation belongs to compact sets \({{{\mathcal {M}}}}_C\) with \(C\ge C_0\).
The idea of solving illposed problems is to choose an appropriate compact set (i.e., to choose a constant \(C^*\)) and then solve Eq. (43) on the compact set of functions^{Footnote 8} defined by \(C^*\). In other words, to minimize the square of distance in \(E_2\) space
over functions f subject to the constraint
The equivalent form of this optimization problem is Tikhonov’s regularization method. In this regularization method, the functional
is minimized, where \(\gamma >0\) is the regularization constant.
The expression (47) is the Lagrangian functional for the optimization problem minimizing (45) subject to (46), where parameter \(\gamma \) is defined by the parameter C that defines the chosen compact set (44). The parameter \(\gamma =\gamma ^*\) in (47) should chosen be such that the equality
holds true for the solution \(f^*\) of the minimization problem.
In 1963, Tikhonov proved the following theorem:
Theorem
Let \(E_1\) and \(E_2\) be metric spaces and suppose that, for \(F\in E_2\), there exists a solution of the equation
that belongs to the set \(f\in \{W_{E_1}(f)\le C\}\) for \(C>C_0\). Let the righthand side F of this equation be approximated with \(F_\delta \) such that \(\rho (F,F_\delta )\le \delta \). Suppose that the values of (regularization) parameters \(\gamma (\delta )\) are chosen such that
Then the elements \(f_{\gamma (\delta )}\) minimizing the functional
converge to the exact solution as \(\delta \longrightarrow 0\).
Generalization for approximately defined operator
Let our goal be to solve the operator equation
when we are given a sequence of random approximations \(F_\ell \) and random operators \(A_\ell ,~\ell =1,2,\ldots \).
Consider, as solutions of the problem, the sequence of functions \(f_\ell ,~\ell =1,2,\ldots \) minimizing the regularization functional
We define the distance between operators as
Theorem
(Vapnik and Stefanyuk 1978). For any \(\varepsilon >0\) and any \(C_1,C_2>0\) there exists a value \(\gamma _0>0\) such that the inequality
holds true for any \(\gamma _\ell \le \gamma _0\).
Corollary
Let the righthand side \(F_\ell \) of the equation converge to F with rate \(r_F(\ell )\) and let operator \(A_\ell \) converge to A with rate \(r_A(\ell )\). Then there exists a function
such that the sequence of solutions converges if
and \(\gamma _\ell \longrightarrow 0\) as \(\ell \longrightarrow \infty \).
Solution of stochastic illposed problems
In order to estimate conditional probability function, we find the solution of approximation (38) of Fredholm integral equation (30) using a regularization method (Sect. 4.2). In order to do this, we have to specify three elements of the minimization functional
We specify them in the following manner.

1.
The distance \(\rho ^2_{E_2}(A_\ell f,F_\ell )\) in \(E_2\) space: We select\(L_2(\phi )\)metric.

2.
The set of functions \(\{f(x)\},~x\in [0,1]^n\) containing the solution \(f_\ell \): We select Reproducing Kernel Hilbert Space (RKHS) of kernel\(K(x,x_*)\).

3.
The regularization functional \( W_{E_1}(f)\) in space \(E_1\): We select the square of the norm of function in the RKHS.
Distance in space \(L_2\)
We define the square of distance between two functions \(F_1\) and \(F_2\) that denote the left and righthand sides of Eq. (38) in the following way:
where \(\phi (x)\ge 0\) is a given weight function and \(\mu (x)\) is a probability measure defined on ddimensional domain \({{{\mathcal {C}}}}\) consisting of vectors \((x_1,\ldots ,x_d)\), where \(0\le x_i\le c_i\) for each \(i=1,\ldots ,d\) and \(c_1,\ldots ,c_d\) are nonnegative constants. We select both the weight function and the probability measure later. Using this metric, we compute the square of the distance between the left and righthand sides of our approximations (38) of Fredholm equations:
where \(f\in \{f\}\), \(y\in \{0,1\}\). For estimation of conditional probability function, this can be rewritten as follows:
where the constant V(i, j) denotes
In the ddimensional case \(x=(x^1,\ldots ,x^d)\), we have
For the special case where \(x\in [(0,\ldots ,0),{{{\mathcal {C}}}}]\), \(\mu (x)=x\), and \(\phi (x)=1\), this expression has the form
Expression (54) defines Vmatrix in multiplicative form. For highdimensional problems, computation of (54) may be difficult; therefore we also define an additive form of Vmatrix along with (54). Consider, instead of distance (50), the following expression
where \(x^k\) is the kth coordinate of vector x and \(f(x_i)\) is the value of function in the point \(x_i=(x^1_i,\ldots ,x^d_i)\). For this distance, elements V(i, j) of Vmatrix have the form
In our computations we using multiplicative form of Vmatrix.
Matrix V is a symmetric nonnegative matrix; the maximum value of any column (or any row of V) is the value on the intersection of that column (row) with the diagonal of V.
Algorithmic Remark. With increase of dimensionality, it becomes more difficult to realize the advantages of the Vmatrix method. This is due to the fact that mutual positions of the vectors in highdimensional spaces are not expressed as well as in lowdimensional spaces.^{Footnote 9} The mathematical manifestation of this fact is that, in a highdimensional space, Vmatrix can be illconditioned and, therefore, require regularization. We used the following regularization method: (1) transform the Vmatrix to its diagonal form in the basis of its eigenvectors using the appropriate orthonormal mapping T; (2) add a small regularizing value to its diagonal elements (in our experiments, 0.001 was usually sufficient), and (3) use inverse mapping \(T^{1}\) to transform the regularized Vmatrix to its original basis.
Reproducing Kernel Hilbert space
We are looking for solutions of our inference problems in the set of functions \(f(x,\alpha ),\alpha \in \Lambda \) that belong to Reproducing Kernel Hilbert Space associated with kernel \(K(x,x')\), where \(K(x,x')\) is a continuous positive semidefinite function of variables \(x,x'\in X\subset R^n\):
for any \(\{x_1,\ldots ,x_n\}\) and \(\{c_1,\ldots ,c_n\}\). Consider linear operator
mapping elements f(s) into elements Af(x) in space H.
In 1909, Mercer showed that, for any continuous positive semidefinite kernel, there exists an orthonormal basis \({e_i}(x)\) consisting of eigenfunctions of \(K(x,x')\) of operator (58), and the corresponding sequence of nonnegative eigenvalues \(\lambda _i\) such that kernel \(K(x,x')\) has the representation
where the convergence of the sequence is absolute and uniform.
We say that the set \(\varPhi \) of functions f(x) belongs to Reproducing Kernel Hilbert Space (RKHS) associated with kernel \(K(x,x')\) if the inner product \(<f_1,f_2>\) between functions \(f_1\), and \(f_2\) of this set is such that for any function \(f(x)\in \varPhi \) the equality
holds true. That is, the inner product of functions from \(\varPhi \) with kernel \(K(x,x')\) (where variable \(x'\) is fixed) has the reproducing property.
Consider the parametric set \(\varPhi \) of functions
According to representation (61), kernel \(K(x,x')\) as a function of variable x belongs to set \(\varPhi \) (the values \(\lambda _i\phi _i (x')\) can be considered as parameters \(c_i\) of expansion.)
In order to define Reproducing Kernel Hilbert Space for set (61), we introduce the following inner product between two functions \(f_b(x),~f_d(x)\), defined by parameters \(b=(b_1,b_2,\ldots )\) and \(d=(d_1,d_2,\ldots )\) in (61):
It is easy to check that, for such inner product, reproducing property (60) of functions from \(\varPhi \) holds true and the square of the norm of function \(f_b(x)\in \varPhi \) is equal to
Properties of RKHS
The following three properties of functions from RKHS make them useful for function estimation problems in highdimensional spaces:

1.
Functions from RKHS with bounded square of norms
$$\begin{aligned} \sum _{i=1}^\infty \frac{b_i^2}{\lambda _i}\le C \end{aligned}$$(64)belong to a compact set and therefore the square of the norm of function can be used as regularization functional (see Lemma in Sect. 4.2).

2.
The function that minimizes empirical loss functional (51) in RKHS, along with its parametric representation (61) in infinitedimensional space of parameters c, has another parametric representation in \(\ell \)dimensional space \(\alpha =(\alpha _1,\ldots ,\alpha _\ell )\in R^\ell \), where \(\ell \) is the number of observations:
$$\begin{aligned} f(x,\alpha )=\sum _{i=1}^\ell \alpha _i K(x_i,x). \end{aligned}$$(65)(This fact constitutes the content of the socalled Representer Theorem).

3.
The square of the norm of the chosen function, along with representation (63), has the representation of the form
$$\begin{aligned} f(x,\alpha )^2=<f(x,\alpha ),f(x,\alpha )>=\sum _{i,j=1}^\ell \alpha _i\alpha _jK(x_i,x_j). \end{aligned}$$(66)This representation of the function in RKHS is used to solve the inference problem in highdimensional space.
Properties of Kernels
Kernels \(K(x,x')\) (also called Mercer kernels) have the following properties:

(1)
Linear combination of kernels \(K_1(x,x')\) and \(K_2(x,x')\) with nonnegative weights is the kernel
$$\begin{aligned} K(x,x')=\alpha _1K_1(x,x')+\alpha _2K_2(x,x'),\;\; \alpha _1\ge 0, \alpha _2\ge 0. \end{aligned}$$(67) 
(2)
Product of the kernels \(K_1(x,x')\) and \(K_2(x,x')\) is the kernel
$$\begin{aligned} K(x,x')=K_1(x,x')K_2(x,x'). \end{aligned}$$(68)In particular, the product of kernels \(K(x^k,x'^{k})\) defined on coordinates \(x^k\) of vectors \(x=(x^1,\ldots ,x^m)\) is a multiplicative kernel in mdimensional vector space \(x\in R^m\):
$$\begin{aligned} K(x,x')=\prod _{s=1}^mK_s(x^s,x'^{s}). \end{aligned}$$(69) 
(3)
Normalized kernel is the kernel
$$\begin{aligned} K_*(x,x')=\frac{K(x,x')}{{\sqrt{K(x,x)K(x',x')}}}. \end{aligned}$$(70)
Examples of Mercer Kernels

(1)
Gaussian kernel in \(x\in R^1\) has the form
$$\begin{aligned} K(x,x')=\exp \{\delta (xx')^2\}, \quad x,x'\in R^1, \end{aligned}$$(71)where \(\delta >0\) is a free parameter of the kernel. In mdimensional space \(x\in R^m\), Gaussian kernel has the form (69)
$$\begin{aligned} K(x,x')=\prod _{k=1}^m\exp \{\delta (x^kx'^{k})^2\}= \exp \{\delta xx'^2\},\quad x,x'\in R^n. \end{aligned}$$(72) 
(2)
INKspline kernel (spline with infinite numbers of knots). INKspline kernel of order d was introduced in Vapnik (1995). For \(x\in [0,c]\), it has the form
$$\begin{aligned} K(x,x')= \int _0^c(xt)^d_+(x't)^d_+dt= \sum _{r=0}^d\frac{C^r_d}{2dr+1}[\min (x,x')]^{2dr+1}xx'^r,\nonumber \\ \end{aligned}$$(73)where we have denoted \((z)_+=\max (z,0)\). In particular, INKspline kernel of order 0 has the form
$$\begin{aligned} K_0(x,x')=\min (x,x'). \end{aligned}$$(74)This INKspline is used to approximate piecewise continuous functions. INKspline kernel of order 1 has form
$$\begin{aligned} K_1(x,x')= \frac{1}{2}xx'\min (x,x')^2+\frac{\min (x,x')^3}{3}. \end{aligned}$$(75)This INKspline is used to approximates smooth functions. Its properties are similar to those of cubic splines that are used in the classical theory of approximations.
The described two types of kernels reflect different ideas of function approximation: local approximation of the desired function (Gaussian kernel) and global approximation of the desired function (INKspline kernels).
Basic solution of inference problems
We solve our inference problem using functions
where \(\psi (x)\) belongs to RKHS of kernel \(K(x,x')\) and \(c\in R^1\) is the bias.
In order to find the solution of our integral equation, we minimize the regularization functional (48). For the solution in RKHS with bias, we use the representation (see Sect. 5.2.1)
where \(x_i,~i=1,\ldots ,\ell \) are vectors from the training set. For the regularization term, we use the square of the norm of the function \(\phi (x)\) in RKHS described in representation (66) as
In order to solve our illposed problem of inference, we minimize the functional (48). Taking into account explicit expression of distance (51) and regularizer (66), we obtain
in the set of functions (77).
Matrixvector notations. We use the following notations:

1.
\((\ell \times 1)\)dimensional matrix A of elements \(\alpha \):
$$\begin{aligned} A=(\alpha _1,\ldots ,\alpha _\ell )^T; \end{aligned}$$ 
2.
\(\ell \)dimensional vectorfunction \({{{\mathcal {K}}}}(x)\):
$$\begin{aligned} {{{\mathcal {K}}}}(x)=(K(x_1,x),\ldots ,K(x_\ell ,x))^T; \end{aligned}$$ 
3.
\((\ell \times \ell )\)dimensional matrix K with elements \(K(x_i,x_j)\):
$$\begin{aligned} K=K(x_i,x_j),\quad i,j=1,\ldots ,\ell ; \end{aligned}$$ 
4.
\((\ell \times \ell )\)dimensional matrix V of elements V(i, j):
$$\begin{aligned} V=V(i,j),\quad i,j=1,\ldots ,\ell ; \end{aligned}$$ 
5.
\(\ell \)dimensional vector of elements \(y_i\) of training set:
$$\begin{aligned} Y=(y_1,\ldots ,y_\ell )^T; \end{aligned}$$ 
6.
\(\ell \)dimensional vector of 1:
$$\begin{aligned} 1_\ell =(1,\ldots ,1)^T; \end{aligned}$$
Using these notations, we rewrite the functional (79) in the form
Closedform solution of minimization problem
In order to find the solution
of our inference problem in the set of functions that belong to RKHS of kernel \(K(x,x')\), we have to minimize functional (80) over parameters A and c (since the term \(Y^TVY\) does not depend on the parameters A, c, we drop it). The necessary conditions of minimum are
From the first equation of (82) we obtain
so
We then compute vectors
According to (83), the desired vector A has the form
From second equation of (82) and (85), we obtain the equation to define c:
where \(A_b\) and \(A_c\) are defined by (84). We have
and the value of bias c is thus
Putting c in (85), we obtain the desired parameters A. The desired function f has the form (81).
Remark
There exists one more prior knowledge about the conditional probability function: this function is nonnegative and bounded by 1. We can easily construct a function that is nonnegative and bounded at all the points of observation. In order to enforce such a solution, it is sufficient to minimize the functional (80) subject to additional constraints
This inequality is defined coordinatewise where \(0_\ell \) is \(\ell \)dimensional vectors of zeros and \(1_\ell \) is \(\ell \)dimensional vector of ones. Unfortunately, one cannot obtain closedform solution of the optimization problem with these constraints.
Dual estimate of conditional probability
Let \(y\in \{0,1\}\). The estimate of the conditional probability \(P(y=1x)\) has the form
Similarly, the conditional probability for class \(y=0\) is
where parameters \(A^*\) and \(c^*\) are obtained using formulas (85), (86), with binary vector \(Y_0=1_\ell Y\) (the vectorindicator of class \(y=0\)).
According to the definition, for these conditional probability functions, the equality
holds for all \(x\in X\).
When one estimates the conditional probability function, this equality forms a prior knowledge.
Below we estimate two conditional probabilities for which equality (90) holds true only for vectors x of the training set. We replace equality (90) with the following equality:
Taking into account (91), we estimate parameters \(A,A^*,c,c^*\) of Eqs. (88), (89). In order to do this, we minimize the functional which is the sum of two functionals of type (80) for estimation of parameters of conditional probabilities (88), (89):
where \(\gamma ,\gamma _*>0\) and \(Y^*=1_\ell Y\). Using the property \(KK^+K=K\) of pseudoinverse \(K^+\) of matrix K and expression [obtained from (91)]
we can rewrite the last term of functional (92) as
and the functional (92) itself as
In order to find the solution of this optimization problem, we minimize functional (93) with respect to A, c, and \(c^*\). The necessary conditions of minimum are
In order to simplify the expressions, we introduce the following notations:
The equivalent expression is
We then compute vectors
(these vectors can be computed in one run) and define A as
Using expression (98) for A in (97) and using the notations (95), we obtain
From equations (99), we find parameters \(c,c^*\) that define vector A in (98) and the desired conditional probability (88).
Note that dual estimate takes into account how well the matrix K is conditioned: the solution of optimization problem uses expressions \(K^+K\) and \(K^+1_\ell \) [see formulas (96), (97)].
Remark
Since condition (91) holds true for the dual estimate of conditional probability, in order to satisfy inequalities (87), it is sufficient to satisfy the inequality
Therefore, in order to obtain the dual estimate of conditional probability using available general prior knowledge, one has to minimize the functional (93) subject to constraint (100).
Part Two: Intelligencedriven learning: learning using statistical invariants
In this section, we introduce a new paradigm of learning called Learning Using Statistical Invariants (LUSI) which is different from the classical datadriven paradigm.
The new learning paradigm considers a model of Teacher–Student interaction. In this model, the Teacher helps the Student to construct statistical invariants that exist in the problem. While selecting the approximation of conditional probability, the Student chooses a function that preserves these invariants.
Strong and weak modes of convergence
The idea of including invariants into a learning scheme is based on the mathematical structure of Hilbert space. In Hilbert space, relations between two functions \((f_1(x)\) and \(f_2(x)\) have two numerical characteristics:

1.
The distance between functions
$$\begin{aligned} \rho (f_1,f_2)= f_i(x)f_2(x) \end{aligned}$$that is defined by the metric of the space \(L_2\) and

2.
The inner product between functions
$$\begin{aligned} R(f_1,f_2)=(f_1(x),f_2(x)) \end{aligned}$$that has to satisfy the corresponding requirements.
The existence of two different numerical characteristics implies two different modes of convergence of the sequence of functions from \(f_\ell (x)\in L_2\) to the desired function \(f_0(x)\):

1.
The strong mode of convergence (convergence in metrics)
$$\begin{aligned} \lim _{\ell \rightarrow \infty }f_\ell (x)f_0(x)=0 \quad \forall x \end{aligned}$$ 
2.
The weak mode of convergence (convergence in inner products)
$$\begin{aligned} \lim _{\ell \rightarrow \infty }(f_\ell (x)f_0(x),\psi (x))=0, \quad \forall \psi (x)\in L_2 \end{aligned}$$(note that convergence has to take place for all functions \(\psi (x)\in L_2\)).
It is known that the strong mode of convergence implies the weak one. Generally speaking, the reverse is not true.
In the first part of the paper, we showed that in classical (datadriven) paradigm, we can estimate the conditional probability function solving the illposed problem of the approximatively defined Fredholm equation (30), obtaining the Vmatrix estimate (see Sect. 5.4).
In the second part of the paper, we consider new learning opportunities using interaction with Teacher. The goal of that interaction is to include mechanisms of both weak and strong convergence in learning algorithms.
Here is the essence of the new mechanism. According to the definition, weak convergence has to take into account all functions\(\psi (x)\in L_2\). The role of Teacher in our model is to replace this infinite set of functions with a specially selected finite set of predicatefunctions \(\mathcal{{P}}=\{\psi _1(x),\ldots ,\psi _m(x)\}\) that can describe property of the desired conditional probability function and restrict the scope of weak convergence only to the set of predicate functions \({{{\mathcal {P}}}}\).
Method of learning using statistical invariants
Let us describe a method of extracting specific intelligent information from data by preserving statistical invariants.
Estimation of conditional probability function. Suppose that Teacher has chosen m predicate functions
The Teacher believes that these functions describe important (integral) properties of desired conditional probability function defined by the equalities (invariants)
where \(a_s\) is the expected value of \(\psi _s(x)\) with respect to measure \(P(y=1,x)\).
Suppose now that values \(a_1,\ldots ,a_m\) are known. Then we can formulate the following twostage procedure of estimating the desired function [solving Fredholm equation (30)]:

1.
Given pairs \((\psi _k(x),a_k),~k=1,\ldots ,m\) (predicates and their expectation for desired condition probability function), find the set of conditional probability functions \({{{\mathcal {F}}}}=\{P(y=1x)\}\) satisfying equalities (102) (preserving the invariants).

2.
Select, in the set of functions \({{{\mathcal {F}}}}\) satisfying invariants (102), the function that is the solution of our estimation problem (the function of form (88) with parameters that minimize (80) (or (92))).
In other words, the idea is to use data to look for approximation \(P_\ell (y=1x)\) in the subset of functions that preserve m invariants (102) formulated by the Teacher’s predicates.
In our model, the Teacher suggests only functions (101) and does not provide values \(a_s,~s=1,\ldots ,m\). However, for any function \(\psi _s(x)\), using training data \((x_i,y_i),~i=1,\ldots ,\ell \), we can estimate the corresponding value \(a_s\), by obtaining the following approximation for invariants (102):
Remark
Expression (103) defines equality only approximately. When one adds invariants sequentially, one can take this fact into account. Suppose that we have constructed the following approximation using m predicates:
In order to chose the next predicate \(\psi _{m+1}(x)\), we consider the value
If \({{{\mathcal {T}}}}\ge \delta \) (where \(\delta \) is a small positive threshold), we consider the new invariant defined by predicate \(\psi _{m+1}(x)\). Otherwise, we treat this approximation as equality and do not add this invariant.
To use vectormatrix notations, we define the set of \(\ell \)dimensional vectors
In these notations, the expression (103) with m predicates chosen for invariant as
Consider a new predicate \(\psi _{m+1}(x)\), the corresponding vector \(\varPhi _{m+1}\) and expression (105) as
We treat the expression (107) as equality^{Footnote 10} if \({{{\mathcal {T}}}}\le \delta \).

The method of LUSI suggests to use an approximation function of the form (88), where parameters A and c are the solution of the following quadratic optimization problem: minimize functional (80) (or (92) for dual estimate) subject to constraints (107).
Rules fortclass pattern recognition problem. Consider tclass pattern recognition problem. Our goal is to estimate n conditional probability functions \(P(y=sx),~s=1,\ldots ,t\) in order to combine them into the rule
In order to define appropriate invariants for the conditional probability function \(P(y=kx)\), consider the following twoclass classification problem: we define the elements of class \(y=s\) as class \(y^*=1\) and define elements of all other classes \(y\ne s\) as class \(y^*=0\). One obtains parameters \(A_V^s\) and \(c^s\) for estimating the corresponding conditional probability by using formulas described above with the correponding vector \(Y_s=(y_1^*,\ldots ,y^*_\ell )^T\).
Closedform solution of intelligencedriven learning
The optimization problem of minimizing functional (80) subject to the constraints (105) has a closedform solution. Consider Lagrangian for this problem:
As before, we obtain the following Lagrangian conditions:
From the first line of (110), we obtain the expression
and the expression
We compute \((m+2)\) vectors (this can be done simultaneously in one run)
The desired vector A has the expression
Putting expression (116) back into the last two lines of (110), we note that, in order to find coefficient c and m coefficients \(\mu _s\) of expansion (116), we have to solve the following system of \(m+1\) linear equations:
Using estimated vector A and bias c, we obtain the desired function (88).
Summary. In this section, we have obtained a method for estimating conditional probability functions using invariants. The estimate has the form
where the vector of coefficients of expansion A has the structure
Vectors \(A_V,A_c\) and \(A_{s}, s=1,\ldots ,m\) are obtained using formulas (113), (114), (115), and coefficients c and \(\mu _s\) of composition (119) are the solutions of linear equations (117), (118). We call this algorithm the vSVM algorithm with m invariants vSVM&\(\hbox {I}_m\).
When estimating conditional probability function, one can also take into account additional prior knowledge (87). To find parameters \(A_V\) and c of approximation that take into account this information, one has to solve the following quadratic optimization problem: minimize the functional (80) subject to m equality constraints (107) and \(\ell \) inequality constraints (87).
Dual intelligencedriven estimate of conditional probability
Dual intelligencedriven estimate of conditional probability requires minimization of functional (93) subject to equality type constraints (105).
In order to solve this optimization problem, we define the Lagrangian
where \(\mu _s\) are Lagrange multipliers. In order to find the solution of this optimization problem, we minimize this functional with respect to A, c, and \(c^*\). The necessary conditions of minimum using notations (95) are
From the first equality of (121) we obtain
We then compute vectors
(these vectors can be computed in one run) and define A as
Putting expression for A in the last three other lines of (121), we obtain the following linear system of equations for computing the unknown parameters c, \(c^*\), and \(\mu _s\).
LUSI methods: illustrations and remarks
In this section, we compare different methods of conditional probability estimate for twoclass pattern recognition problems. First, we consider four estimation methods on onedimensional examples; then we move to multidimensional examples. We apply the following four methods:

1.
SVM estimate

2.
vSVM estimate.^{Footnote 11}

3.
SVM&\(\hbox {I}_{n}\) estimate (SVM with n invariants). In this section, invariants are defined by the simple predicate functions^{Footnote 12}
$$\begin{aligned} \psi _0(x)=1\quad \text{ and }\quad \psi _1(x)=x. \end{aligned}$$ 
4
vSVM& \(\hbox {I}_{n}\) estimate (vSVM with n invariants).
Invariants (107) with predicate \(\psi _0(x)\) define the set of conditional probability estimates for which the frequency of elements of class \(y=1\) is the same as it was observed on the training data. Invariants (107) with predicate \(\psi _1(x)\) define the set of conditional probability estimates for which the center of mass of elements x belonging to class \(y=1\) is the same as it was observed on the training data. In this section, we consider experiments with simple invariants, i.e, with invariants obtained based on functions \(\psi _0(x)\) and \(\psi _1(x)\) (preserving frequencies and means, respectively).
1. The experiments in onedimensional space are presented in Fig. 1 (vertical axes denote conditional probability, while horizontal axes denote values of x). The first four rows of that figure show results of 12 experiments, where each row corresponds to one of four methods described above, and each column corresponds to one of three sizes of training data (48, 96, 192). In all the images, the actual conditional probability function \(P(y=1x)\) is shown in blue color, while the obtained estimate \(P_\ell (y=1x)\) is shown in black color. The images also show training data: representatives of class \(y=1\) are shown in red color, while representatives of class \(y=0\) are shown in green color. In order to demonstrate the robustness of the proposed methods, in all the experiments we use twice as many representatives of the class \(y=0\) than representatives of class \(y=1\): (\(48=16+32; 96=32+64; 192=64+128\)). All the estimates were obtained using the INKspline kernel (75).
The first row of Fig. 1 shows results of SVM estimates, (square loss SVM); the second row shows results of vSVM estimates; the third row shows results of SVM&\(\hbox {I}_2\) estimates; the fourth row shows results of vSVM&\(\hbox {I}_2\) estimates. The last two rows of Fig. 1 show results obtained by modified vSVM&\(\hbox {I}_2\) and modified vSVM&\(\hbox {I}_2\), which are described in the subdivision 4 after Table 1.
From Fig. 1, one can see that vSVM estimates are consistently better than SVM estimates and that adding the invariants consistently improves the quality of approximations. One can also see that approximations obtained using vSVM provide smoother functions.
2. Using approximation \(P_\ell (y=1x)\) of the conditional probability function, we obtain the classification rule
(Fig. 1 shows horizontal line \(y=1/2\). Geometrically, the rule classifies x as class \(y=1\) where curve \(P_\ell (y=1x)\) is above this line; otherwise, it classifies x as class \(y=0)\).
The quality of rule \(r_\ell (x)\) is measured by the value of probability of error
When \(P_\ell (y=1x)\) coincides with the actual conditional probability function \(P(y=1x)\), the obtained rule \(r_B(x)\) is called Bayesian. It gives the smallest value for the problem’s error rate \(P_{err}(r_B)\). For the problem shown in Fig. 1, Bayesian loss is \(P_{err}(r_B)= 10.87\% \). Losses of rules \(r_\ell (x)\) based on estimates of conditional probability are larger than that.
3. Numerical characteristics of accuracy of approximations are given in Table 1. It shows, for different sizes of training data, the following:

1.
The distance between the estimate and the true function in \(L_2\) metric
$$\begin{aligned} \rho _{L_2}(P,P_\ell )=\left( \int (P(y=1x)P_\ell (y=1x))^2d\mu (x)\right) ^{1/2}. \end{aligned}$$ 
2.
The error rate \(P_{err}(r_\ell )\) of rule \(r(x)=\theta (P(y=1x)0.5)\) in percents, and

3.
The values of relative (with respect to Bayesian rule) losses
$$\begin{aligned} \kappa _{err}(r_\ell )=\frac{P_{err}(r_\ell )P_{err}(r_B)}{P_{err}(r_B)}. \end{aligned}$$
4. From Fig. 1 and Table 1, one can see that, with increasing sophistication of conditional probability estimation methods, the obtained results are getting monotonically closer (in \(L_2\) metric) to the true conditional probability function. However, this monotonic convergence does not always hold for the error rates provided by the constructed conditional probability functions. This is because our mathematical goal was to approximate the conditional probability function in \(L_2\) metric (not the decision rule). A good estimate of conditional probability for a classification rule is the function that crosses the line \(y=0.5\) close to the point where this line is crossed by the true conditional probability; it does not have to be close to the true function in \(L_2\) metric (see Fig. 1).
Estimation of conditional probability for constructing classification rule. We can take into account our ultimate goal of constructing the classification rule if we estimate the conditional probability more accurately in the area of x where \(P(y=1x) 0.5 < \epsilon \) at the expense of accuracy of this estimate in the area x where \(P(y=1x)0.5\gg \epsilon \). In order to do this, we note that square value of variance for a fixed x is
It achieves the largest value 1 / 4 when \(P(y=1x)=1/2\) and monotonically decreases (up to zero) when deviation \(P(y=1x)0.5\) increases. We can use this fact when estimating distance (50). Suppose we know the function
(or some other monotonically increasing function of \(\sigma ^2(x)\), for instance, \(\phi (\sigma ^2(x)+\nu )\)). Consider (50) in the equivalent form
and let us minimize the functional with the square of distance defined as
Using techniques of Sect. 5.1, we obtain modified \(V_M\)matrix with the elements
instead of elements (54) (or (56)).
We can also construct modified \(I_M\)matrix, which is a diagonal matrix with elements
and \(I(i,j)=0\) for \(i\ne j\).
In reality, we do not know the function \(P(y=1x)\). However, we can use its estimate \(P_\ell (y=1x)\) to compute \(\sigma (x_i)\). Therefore, in order to construct a special conditional probability function for subsequent creation of a decision rule, we can use the following twostage procedure: in the first stage, we estimate (SVM&\(\hbox {I}_n\) or vSVM&\(\hbox {I}_n\)) approximations of \(P_\ell (y=1x)\) and \(\sigma ^2(x)\), and, in the second stage, we obtain an estimate of the specialized conditional probability function (mSVM&\(\hbox {I}_n\) or mvSVM&\(\hbox {I}_n\)) for the decision rules.
The last two rows of Fig. 1 and last two columns of Table 1 compare the rules obtained using SVM & \(\hbox {I}_2\) estimates and vSVM&\(\hbox {I}_2\) estimates with approximations obtained using mSVM & \(\hbox {I}_2\) and mvSVM&\(\hbox {I}_2\) estimates (in both cases we used function \(\phi (x)=\sigma ^2(x)\)). It is interesting to note that estimates based on modified weightfunction \(\phi (x)\) not only improve the error rates of corresponding classification rules but also provide better approximations of conditional probability functions in \(L_2\) metric.
5. Our onedimensional experiments demonstrate the following:

1.
vSVM results are more accurate than SVM results.

2.
Learning using statistical invariants can significantly improve performance. Both vSVM&\(\hbox {I}_2\) and SVM&\(\hbox {I}_2\) algorithms using 48 training examples achieve much better performance (which are very close to Bayesian) than just SVM or vSVM algorithms that use 192 examples. The result obtained based on mSVM&\(\hbox {I}_2\) method are consistently better than the results obtained based on SVM&\(\hbox {I}_2\) method.

3.
The effect from adding invariants appears to be more significant than the effect of upgrading of SVM to vSVM.
The same conclusion can be derived from analysis of experiments for highdimensional problems where we use the \((n+1)\) invariants: one zeroth order moment and n first order moments, where n is dimensionality of the problem (see Table 2). In this table, for eight calibration datasets from Lichman (2013) (listed in the first column), we conducted 20 experiments. In each we create a random partition of the given dataset into training and test sets (their sizes are listed in the second and third column, respectively), and then apply two algorithms ( SVM and SVM&\(\hbox {I}_{(n+1)}\)) for estimating conditional probability function. We then used tehse conditional probability functions to construct the classification rule. For all experiments we used RBF kernel (72). As the results show, SVM&\(\hbox {I}_{(n+1)}\) is consistently better than SVM.
6. Table 3 compares results of SVM&\(\hbox {I}_n\) with SVM for two datasets Diabetes and MAGIC using training data of different sizes. Experiments show that incorporation of invariants significantly improve SVM for all sizes of training sets. In order to achieve accurate results in these problems, SVM&\(\hbox {I}_{(n+1)}\) requires significantly fewer training examples than SVM.
Remark
When analyzing the experiments, it is important to remember that the Intelligent Teacher suggests to the Student “meaningful” functions for invariants instead of simple ones that we used here for illustrative purposes. In many of our experiments, the invariants were almost satisfied: the correcting parameters \(\mu _s\) in (119) were close to zero. Some ideas about what form these “meaningful” invariants can take are presented in the next Sections.
Examples of invariants with nonlinear predicatefunctions
Example
Table 2 shows that, using nine simple invariants (moments of zeroth and first order), SVM&I\({_9}\) achieved error rate 22.73% on “Diabetes”. In order to obtain a better performance, we introduce an additional invariant constructed in some twodimensional subspace of eightdimensional space X of the problem (Fig. 2). We choose (using training data) the square box \({{{\mathcal {B}}}}\) (it is shown in Fig. 2) and consider the following predicate: \(\psi _3(z)=1\) if \(x\in {{{\mathcal {B}}}}\) and \(\psi _3(z)=0\) otherwise. Our additional invariant is defined by this predicate \(\psi _3(x)\). As a result, when using all 10 invariants, we decrease the error rate to 22.07%.
Note that function \(\psi _3(x)\) was obtained due some intelligent input: we selected the square box intuitively, just looking at the positions of training points.^{Footnote 13} Perhaps, we can continue to improve performance (if we are still far from the Bayesian rate) by adding more invariants (choosing different subspaces, different boxes, and different functions^{Footnote 14}).
As we have already noted, there exists an important difference between idea of features that is used in classical algorithms and idea of predicates that is use for constructing invariants.
In order to construct an accurate rule, the classical paradigm recommends the introduction of special functions called features and construction of the rule based on these features. With increasing number of features one increases the capacity (the VC dimension) of the set of admissible functions from which one chooses the solution. According to the VC bounds, the larger is the capacity of that set, the more training examples are required for an accurate estimate.
With the increase of the number of invariants, one decreases the capacity of admissible set of functions, since the functions from which one chooses the solution preserve all the invariants. According to the same VC bounds, the decrease of the capacity of admissible functions improves the performance while using the same number of training examples.
More examples of intelligencedriven invariants
The role of Intelligent Teacher in training processes described in this paper is to introduce functions \(\psi _1(x),\ldots , \psi _n(x)\) for the problem of interest. The Student has to understand which functions the Teacher suggested and use them to create invariants. Below are some examples of functions that the Teacher can suggest for invariants.
Example 1
Suppose that the Teacher teaches Student to recognize digits by providing a number of examples and also suggesting the following heuristics: “In order to recognize the digit zero, look at the center of picture—it is usually light; in order to recognize the digit 2, look at the bottom of the picture—it usually has a dark tail” and so on.
From the theory above, the Teacher wants the Student to construct specific predicates \(\psi (x)\) to use them for invariants. However, the Student does not necessarily construct exactly the same predicate that the Teacher had in mind (the Student’s understanding of concepts “center of the picture” or “bottom of the picture” can be different). Instead of \(\psi (x)\), the Student constructs function \(\widehat{\psi (x)}\). However, this is acceptable, since any function from \( L_2\) can serve as a predicate for an invariant.
The Teacher explains which group of pixels constitutes the center of picture and the Student chooses “approximately” these pixels. Suppose that Student chose pixels \(p_{t_1},\ldots ,P_{t_p}\) and lightness of pixel \( p_{t_1}\) is \(x(p_{t_1})\). The lightness of center picture \(x_i\) is measured by the sum of lightness value of chosen pixels defined by inner product
where \(z_0\) is a binary vector which has coordinates that are equal to one for the chosen group of pixels that define the center of picture, while all other coordinates are equal to zero.^{Footnote 15} (Similarly, binary vector \(z_2\) defines the concept of darkness of tail in the bottom of picture.) Using these functions and a number of examples, Student defines the invariants
where \({\widehat{\varPhi }}=({{{\widehat{\psi }}}}(x_1),\ldots ,\widehat{\psi }(x_\ell ))^T\). This equation describes idea of lightness of center of digit 0 (or darkness of tail in digit 2). Using these invariants, Student constructs a rule that takes Teacher’s heuristics into account.
Generally speaking, the vector z could be defined as any realvalued defining any function that is linear in the pixel space.
Example 2
Suppose that a medical doctor teaches a medical student to classify some disease (\(y=1\)). Suppose that the doctor believes that this disease is related to blood circulation. He suggests that some nonlinear predicate function called “Blood Circulation Index”
(the difference between systolic and diastolic blood pressures times heartbeat frequency) somehow expresses this disease. While demonstrating patients with disease \(y=1\) to the student, the doctor recommends that the student to pay attention to this index. The student observes \(\ell \) patients (described by vectors \(x_i,~i=1,\ldots ,\ell \)) and constructs vector \(\varPhi \) with coordinates \(\varPhi (x_i)=\psi (x_i),~i=1,\ldots ,\ell \). The expression for the corresponding invariant has the form
Next invariants are inspired by elements of machine learning techniques. In order to use this technique, we do not have to introduce the explicit form of predicate function. Instead, we introduce the algorithm of computing (using training data) the predicate function at any point of interest.
Example 3
(local structure predicate) Consider the following predicatefunction. For point \(x'\), function \(\psi _\rho (x_0)\) computes the expectation of the vectors of the first class in the sphere S with center \(x'\) and given radius \(\rho \)
Then the value
defines the local characteristics of conditional probability function \(P(y=1x)\).
Let us compute, for any vector \(x_i\) of the training set the following characteristics: the number of examples of the first class that belong to sphere of radius \(\rho \) with center \(x_i\). Let this value be \(\psi _\rho (x_i)\). Consider vector \(\varPhi _\rho =(\psi _\rho (x_1),\ldots ,\psi _\rho (x_\ell ))^T\) to define the invariant. Choosing different values of radius \(\rho \), one constructs different local characteristics of desired function.
Invariants with vectors \(\varPhi _{\rho _k},~k=1,\ldots ,n\) provide a description of the structure of the conditional probability function \(P(y=1x)\). They can be taken into account when one estimates \(P(y=1x)\).
Example 3a
Consider the following values \(\psi _k(x_i)\). For any vector \(x_i\) of training vector \(x_i\), we compute the number of vectors of the first class among k nearest neighbors of vector \(x_i\). In order to construct invariants, we use the vectors
Example of a simple LUSI algorithm
Given data
consider the following learning method based on predicates \(\psi _k(x)\) (for example, those defined in Example 3a).
 Step 0. :

Construct vSVM (or SVM) estimate of conditional probability function (see Sects. 5.5 and 6.3).
 Step 1. :

Find the maximal disagreement value \({{{\mathcal {T}}}}_s\) (108) for vectors
$$\begin{aligned} \varPhi _k=(\psi _k(x_1),\ldots ,\psi _k(x_\ell ))^T, \quad k=1,\ldots ,s,\ldots . \end{aligned}$$  Step 2. :

If the value \({{{\mathcal {T}}}}_s\) is large, i.e.,
$$\begin{aligned} {{{\mathcal {T}}}}_s=\text{ argmax }({{{\mathcal {T}}}}_1,\ldots ,{{{\mathcal {T}}}}_s\ldots )>\delta , \end{aligned}$$add the invariant (107) with \(\varPhi _s\); otherwise stop.
 Step 3. :

Find the new approximation of the conditional probability function and go to step 1; otherwise stop.
Conclusion
In this paper, we introduced LUSI paradigm of learning which, in addition to the standard datadriven mechanism of minimizing risk, leverages an intelligencedriven mechanism of preserving statistical invariants (constructed using training data and given predicates). In this new paradigm, one first selects (using invariants) an admissible subset of functions which contains the desired solution and then chooses the solution using standard training procedures.
The important properties of LUSI are as follows: if the number \(\ell \) of observations is sufficiently large, then (1) the admissible subset of functions always contains a good approximation to the desired solution regardless of the number of invariants used, and (2) the approximation to the desired solution is chosen by the methods that provide global minima to the guarantee risk^{Footnote 16}.
LUSI method can be used to increase the accuracy of the obtained solution and to decrease the number of necessary training examples.
Notes
 1.
The idea of the new approach is based on analysis of two mathematical facts:

(1)
Direct definition of conditional probability and regression functions (Sect. 2.2).

(2)
Existence of both strong and weak modes of convergence in Hilbert space, which became the foundation for two different mechanisms of generalization: the classical datadriven mechanism and the new intelligencedriven mechanism (Sect. 6).

(1)
 2.
The standard definitions of conditional probability function for continuous x is as follows. Let the probability distribution be defined on pairs (x, y). If y takes discrete values from \(\{0,1,\ldots ,k\}\), the conditional probability \(P(y=tx)\) of \(y=t\) given the vector x is defined as the ratio of two density functions \(p(y=t,x)\) and p(x):
$$\begin{aligned} P(y=tx)=\frac{p(y=t,x)}{p(x)},\quad y=\{0,1,\ldots ,n\}. \end{aligned}$$  3.
In Sect. 5.3, we consider a more general set of functions \(f(x)=A^T{{{\mathcal {K}}}}(x)+c\), where c is a constant. In this introduction, in order to simplify the notations, we set \(c=0\).
 4.
In other words, if one wants to find a rule for identification of ducks, the first thing one has to do is to find a set of rules that do not contradict the basic “duck test” of identification, i.e., birds that look like a duck, swim like a duck, and quack like a duck. Then one has to select the rule for identification of ducks within this set of rules.
 5.
In Vapnik and Izmailov (2017), we showed that, in datadriven estimates, \(\ell \) examples \((x_i,y_i), i=1,\ldots ,\ell \) can provide no more than \(\ell \) bits of information. However, using one invariant in intelligencedriven estimates, \(\ell \) examples can provide more than \(\ell \) bits of information (Sect. 6.3).
 6.
The definition of function simplicity is not trivial. In order to formalize the concept of simplicity for a set of functions, K. Popper introduced the concept of falsifiability of the set of functions (Popper 1934). However, the mathematical formalization of his idea contained an error. The corrected formalization of the falsifiability concept leads to the concept of VCdimension—see Corfield et al. (2005, 2009) for details.
 7.
 8.
Note that this idea of solving illposed problems is the same as in structural risk minimization in VC theory. In both cases, a structure is defined on the set of functions. When solving wellposed problems, elements of structure should have finite VCdimension. When solving illposed problems, elements of structure should be compact sets.
 9.
Recall that almost all the points in a highdimensional ball belong to an area where all the points are close to the surface of the ball.
 10.
Instead of (107), it is more accurate to consider the inequality constraints
$$\begin{aligned} \delta {{{\mathcal {T}}}}\le \varPhi _s^TKA+c\varPhi ^T_s1_\ell \varPhi _s^TY\le \delta {{{\mathcal {T}}}}, \quad s=1,\ldots ,m. \end{aligned}$$In this case, one has to solve the following quadratic optimization problem: to minimize (80) (or (92)) subject to these inequality constraints.
 11.
 12.
These functions define values of zeroth order and first order moments of conditional probability function \(P(y=1x)\).
 13.
This is the same methodology that physicists use: “Find a situation (the box \({{{\mathcal {B}}}}\) in Fig. 2), where the existing model of the Nature (the approximation \(P_n(y=1x)\)) contradicts the reality (contradicts data \((x_i,y_i)\) from the box \({{{\mathcal {B}}}}\)) and then fix the model (obtain a new approximation \(P_{n+1}(y=1x)\)) which does not have the contradictions.” Note that the most difficult part in model refinement is to find a contradiction situation.
 14.
The choice the best position of the box can be done algorithmically.
 15.
Student can choose any appropriate weights.
 16.
The idea of twostage learning is also realized in deep neural networks (DNN), where, at the first stage (using “deep architecture”), an appropriate network is constructed and then, at the second stage, using standard for NN training procedures, the solution is obtained. DNN, however, cannot guarantee either that the constructed network contains a good approximation to the desired function or that it can find the best solution for the given network.
References
Corfield, D., Schölkopf, B., & Vapnik, V. (2005). Popper, falsification and the VCdimension. Technical Report 145, Max Planck Institute for Biological Cybernetics.
Corfield, D., Schölkopf, B., & Vapnik, V. (2009). Falsificationism and statistical learning theory: Comparing the Popper and Vapnik–Chervonenkis dimensions. Journal for General Philosophy of Science, 40(1), 51–58.
Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Feb 2018.
Popper, K. (1934). The logic of scientific discovery. London: Hutchinson.
Tikhonov, A., & Arsenin, V. (1977). Solutions of Illposed problems. Washington: W.H. Winston.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition. Moscow: Nauka. (in Russian).
Vapnik, V., & Izmailov, R. (2017). Knowledge transfer in SVM and neural networks. Annals of Mathematics and Artificial Intelligence, 81(1–2), 3–19.
Vapnik, V., & Stefanyuk, A. (1978). Nonparametric methods for estimating probability densities. Automation and Remote Control, 8, 38–52.
Wapnik, W., & Tscherwonenkis, A. (1979). Theorie der Zeichenerkennung. Berlin: AkademieVerlag.
Wigner, E. (1960). The unreasonable effectiveness of mathematics in the natural sciences. Communications in Pure and Applied Mathematics, 13(1), 1–14.
Acknowledgements
This material is based upon work partially supported by AFRL under contract FA95501510502. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. We thank the anonymous reviewers, S. Sugrim, and P. Toccaceli for their careful reading of our paper and their remarks, which helped to improve the paper’s readability.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Alexander Gammerman, Vladimir Vovk, Henrik Boström, Lars Carlsson.
Appendices
Appendix 1: Invariants with respect to linear transformations
Let us estimate conditional probability function of the form
Consider the linear transformation \(x'={{{\mathcal {U}}}}x\) of elements \(x\in X\) into elements \(x'\in X\) and let \({{{\mathcal {U}}}}^{1}x'\) be inverse transformation. Suppose that our goal is to construct a function for which \(P(y=1x)=P(y=1\mathcal{U}^{1}x)\). For such a function,
Consider the functional
By changing variables \(x=\mathcal{U}^{1}x'\), we obtain
Examples of transformation.

(1)
Shift by vector h
$$\begin{aligned} R=\int P(y=1xh)\psi (x)dP(x)= \int P(y=1x)\psi (x+h)dP(x+h). \end{aligned}$$ 
(2)
Similarity transformation \(z={{{\mathcal {A}}}}x=(a_1x^1,\ldots ,a_nx^n)\), where \(a_1,\ldots ,a_n\) are parameters of coordinate transformation.
$$\begin{aligned} R=\int P(y=1{{{\mathcal {A}}}}^{1}x)\psi (x)dP(x)= {{{\mathcal {A}}}}\int P(y=1x)\psi (\mathcal{A} x)dP(\mathcal{A} x), \end{aligned}$$where \({{{\mathcal {A}}}}=a_1a_2\ldots a_n\) (n is the dimensionality of X space).

(3)
Rotation transformation. Let \({{{\mathcal {R}}}}x\) be a rotation transformation. For this transformation, \({{{\mathcal {R}}}}=1\). Then
$$\begin{aligned} R=\int P(y=1{{{\mathcal {R}}}}^{1}x)\psi (x)dP(x)=\int f(x)\psi (\mathcal{R}x)P(\mathcal{R} x). \end{aligned}$$
Consider the invariant defined by predicatefunction \(\psi _s(x),~s=1,\ldots ,m\)
and suppose we would like to preserve this invariant for linearly transformed vectors \({{{\mathcal {U}}}}^{1}x\). That is, according to (125), along with invariant (126), we would like to preserve the following invariant
We can rewrite this equation as
where \(\psi _s({{{\mathcal {U}}}}x)\) is the corresponding predicatefunction.
To find the approximation of function that preserves both statistical invariants defined by predicatefunctions \(\psi _s(x)\) and invariants defined by the same predicate with geometrically transformed vectors \(\psi _s(\mathcal{U}x)\), we have to minimize the functional (80) subject to constraint
and constraints
where \(\varPhi ^{{{\mathcal {U}}}}_s=\left( \psi \left( \mathcal{U}x_1\right) ,\ldots ,\psi \left( {{{\mathcal {U}}}}x_\ell \right) \right) ^T\).
Appendix 2: Pattern recognition in rebalanced environment
The classical setting of pattern recognition problem given in Sect. 1.3 considers the situation when generator G of vectors x is the same in both training and test sessions. One of the important modifications of this setting is the problem of obtaining a classification rule for generator G that is rebalanced for the test session.
Suppose that the new generator of random event x is such that conditional probabilities \(P(y=k\mid x),~k=1,\ldots ,n\) remain the same for both training and test data. However, the probabilities of classes \(p_*(y=k)\) in the test session are different from the probabilities of classes \(p(y=k)\) in the training session. The probabilities\(p_*(y=k),~k=1,\ldots ,n\)are given and \(\sum _kp_*(k)=1\). The goal is, by using the data generated by P(x), to find the rule of classification of vectors x produced by generator
that was constructed using given probabilities \(p_*(y=k)\). This replacement of distribution function P(x) defined for training session with the function \(P^*(x)\) defined for the test session we call rebalancing of data (environment).
The necessity of rebalancing the data appears, for example, when one tries to classify a rare disease (with small p) using real observations. In order to control the value of the error in a classification of rare diseases, one can rebalance the probability of the class of interest by increasing the value p. As we will see, this leads to a special estimate of Vmatrix.
Vmatrix for rebalanced pattern recognition problem
For simplicity, we consider twoclass classification problem: suppose we are given training data
defined by the probability distribution function
However, our goal is, using data (128), to construct a rule for classification in the environment defined by the probability distribution function
which is different from (129). While data (128) are generated according to distribution function \(P(y, x) = P(yx)P(x)\), we would like to estimate the conditional probability function for distribution \(P_*(y,x)=P(yx)P_*(x)\).
Let \(\ell (1)\) and \(\ell (0)\) be the number of elements of training data (128) belonging to the class \(y = 1\) and to the class \(y = 0\), respectively. Consider the empirical estimate of cumulative distribution function for the elements of the first class
and for the elements of the second class
Using (131) and (132), we construct rebalanced cumulative distribution function (129)
The estimate of probability function \(P_{emp}(x, y = 1)\) is
We introduce the notations
In order to estimate the desired conditional probability functions for rebalanced data, we use the corrected estimates \(P_{emp}(x)\) and \(P_{emp}(y=1,x)\) instead of P(x) and \(P(y=1,x)\) in (129). We obtain the equation
In order to keep our notations in the matrix form, we define the diagonal matrix S with diagonal elements \(s_{ii}=a(y_i)~ (s_{i,j}=0\) if \(i\ne j\)). Solving Eq. (132) using the regularization method described in Sect. 5.3, we obtain the solution which is different from the one described in Sect. 5.4 solution only in its form of V matrix. For the rebalanced solution, \(V_r\)matrix has the form
where V is a standard estimate of Vmatrix and matrix S defines the rebalancing effect. In order to preserve the invariants for the rebalanced data, the solution has to satisfy the equalities
which are equivalent to the equalities
Therefore, in order to estimate the conditional probability function that maintains invariants (134), one has to minimize the functional
subject to constrains (134). This problem has a closedform solution
where
Coefficients b, c and \(\mu _s\) are the solutions of the system of linear equations
Rights and permissions
About this article
Cite this article
Vapnik, V., Izmailov, R. Rethinking statistical learning theory: learning using statistical invariants. Mach Learn 108, 381–423 (2019). https://doi.org/10.1007/s1099401857420
Received:
Accepted:
Published:
Issue Date:
Keywords
 Intelligent teacher
 Privileged information
 Support vector machine
 Neural network
 Classification
 Learning theory
 Regression
 Conditional probability
 Kernel function
 IllPosed problem
 Reproducing Kernel Hilbert space
 Weak convergence
Mathematics Subject Classification
 68Q32
 68T05
 68T30
 83C32