Typically, the statistical physics based computation of learning curves in supervised learning proceeds along the following steps:
-
1)
A student and teacher scenario is defined, which parameterizes the target rule and fixes the complexity of the student hypothesis.
-
2)
It is assumed that training examples and test instances are generated according to a specific input density, while target labels are provided by the teacher network.
-
3)
The study of large systems in the thermodynamic limit allows to describe systems in terms of relatively few macroscopic quantities or order parameters.
-
4)
The outcome of stochastic training processes is interpreted as a formal thermal equilibrium, in which thermal averages can be considered.
-
5)
An additional disorder average over a randomly generated set of training data is performed in order to obtain typical results independent of the actual training set.
The following sections illustrate the above points in the context of learning a linearly separable rule [20,21,22], before two concrete example scenarios are analysed in Sect. 2.6.
2.1 Learning a Linearly Separable Rule: Student and Teacher
We consider the supervised learning of a linearly separable classification of N-dimensional data. In our model, the target rule is defined through a teacher perceptron with fixed weight vectors \(\mathbf {w^*} \in \mathbb {R}^N\) and output
$$\begin{aligned} S^*(\boldsymbol{\xi }) = \mathrm {sign}\left[ \mathbf {w^*}\cdot \boldsymbol{\xi }\right] = \pm 1 \text{ for } \text{ any } \boldsymbol{\xi } \in \mathbb {R}^N. \end{aligned}$$
(1)
Here, the feature vector \(\boldsymbol{\xi }\) represents N numerical inputs to the system and \(S^*\) corresponds to the correct output. The teacher weight vector parametrizes an \((N-1)\)-dim. hyperplane which separates positive from negative responses.
We note that the norm \(|\mathbf {w^*}|\) of the weights is irrelevant for the perceptron response (1). Throughout the following, we therefore consider normalized teacher weights with \(\mathbf {w^*}\cdot \mathbf {w^*}=N.\)
In the learning scenario, information about the rule is only available in the form of a data set which comprises P examples:
$$\begin{aligned} \mathbb {D} = \left\{ \boldsymbol{\xi }^\mu , S^*(\boldsymbol{\xi }^\mu ) \right\} _{\mu =1,2,\ldots , P}. \end{aligned}$$
(2)
Here we assume that the labels \(S^{*\mu }=S^*(\boldsymbol{\xi }^\mu )\) provided in \(\mathbb {D}\) are reliable and represent the rule (1) faithfully. We refrain from considering corruption by different forms of noise, for simplicity, and refer the reader to the literature for the corresponding extensions of the analysis [20, 21].
A second simple perceptron serves as the student network in our model. Its adaptive weights \(\mathbf {w}\in \mathbf {R}^N\) parameterize a linearly separable function
$$\begin{aligned} S(\mathbf {\xi }) = \mathrm {sign}\left[ \mathbf {w}\cdot \boldsymbol{\xi }\right] . \end{aligned}$$
(3)
The weight vector \(\mathbf {w}\) is chosen in a data-driven training process which is based on the available data \(\mathbb {D}\) and corresponds to the student hypothesis about the unknown target. As a consequence of the invariance
$$ \mathrm{sign} \left[ \, (\lambda \mathbf {w})\cdot \boldsymbol{\xi }\right] = \mathrm{sign}[\, \mathbf {w}\cdot \boldsymbol{\xi }] \text{ for } \text{ arbitrary } \lambda >0$$
we will also consider normalized student weights with \(\mathbf {w}\cdot \mathbf {w}=N\) in the following.
2.2 The Density of Input Data
In realistic learning situations it is expected that the density of input features is correlated with the actual task to a certain extent. In real world classification problems, for instance, one would expect a more or less pronounced cluster structure which reflects the class memberships already. Clustered or more generally structured input densities have been considered in the statistical physics literature, see [26] for a recent discussion and further references. Here, however, we follow the most frequent approach and resort to the simplifying assumption of an isotropic input density which generates input vectors independently. In a sense, this constitutes a worst case in which the only information about the target rule is contained in the assigned training labels \(S^*(\boldsymbol{\xi })\), while no gap or region of low density in feature space marks the class boundaries.
Specifically, we assume that components of example vectors \(\boldsymbol{\xi }^\mu \) in \(\mathbb {D}\) consist of independent, identically distributed (i.i.d.) random quantities with means and covariances
$$\begin{aligned} \left\langle \xi _j^\mu \right\rangle \, =\, 0, \left\langle \xi _j^\mu \xi _k^\nu \right\rangle \, = \, \delta _{\mu \nu } \, \delta _{jk} \end{aligned}$$
(4)
with the Kronecker symbol \(\delta _{mn}\!=\! 1\) if \(m\!\ne \! n\) and \(\delta _{mm}\!= \! 0.\)
2.3 Generalization Error and the Perceptron Order Parameter
The performance of a given weight vector \(\mathbf {w}\) in the student teacher model can be evaluated with respect to a test input \(\boldsymbol{\xi } \notin \mathbb {D}.\) If we assume that the test input follows the same statistics as the training examples, i.e.
$$\begin{aligned} \left\langle \xi _j\right\rangle \, =\, 0, \left\langle \xi _j \xi _k \right\rangle \, = \, \delta _{jk}, \end{aligned}$$
(5)
we can define the so-called generalization error as the expectation value
$$\begin{aligned} \epsilon _g (\mathbf {w},\mathbf {w^*}) = \left\langle \epsilon \left( S(\boldsymbol{\xi }, S^*(\boldsymbol{\xi })\right) \right\rangle \text{ where } \epsilon (S,S^*) = \left\{ \begin{array}{ll} 1 &{} \text{ if } S\ne S^* \\ 0 &{} \text{ else, } \\ \end{array} \right. \end{aligned}$$
(6)
serves as a binary error measure. Hence, the generalization error quantifies the probability for disagreement between student and teacher for a random input vector. It is instructive to work out \(\epsilon _g\) explicitly under the assumption of i.i.d. inputs. To this end, we consider the arguments of the threshold function in student and teacher perceptron:
$$ x= \mathbf {w}\cdot \boldsymbol{\xi }/\sqrt{N} \text{ and } x^* = \mathbf {w^*} \cdot \boldsymbol{\xi } / \sqrt{N}. $$
Assuming that the random input vector \(\boldsymbol{\xi }\) satisfies Eq. (5), x and \(x^*\) correspond to sums of N random quantities. By means of the Central Limit Theorem (CLT) there density is given by a two-dim. Gaussian, which is fully specified by first and second moments. These can be obtained immediately as
$$ \left\langle x \right\rangle = \left\langle x^* \right\rangle = 0, \left\langle x^2 \right\rangle = \frac{1}{N} \, \sum _{i,j} \, w_i \, w_j \, \left\langle \xi _i \xi _j \right\rangle \, = \ \frac{\mathbf {w}^2}{N} = 1, \left\langle (x^*)^2 \right\rangle = \frac{(\mathbf {w^*})^2}{N} = 1 $$
$$\begin{aligned} \text{ and } \left\langle x \, x^* \right\rangle = \frac{1}{N} \, \sum _{i,j}\, w_i \, w_j^* \, \left\langle \xi _i \xi _j \right\rangle \, = \ \frac{\mathbf {w}\cdot \mathbf {w^*}}{N} \equiv R, \end{aligned}$$
(7)
where we have exploited the normalization of weight vectors. The covariance \(\langle x x^* \rangle \) is given by the scalar product of student and teacher weights. The moments (7) fully specify the two-dimensional normal density \(P(x,x^*)\) and we obtain the generalization error as the probability of observing \(x x^* < 0\):
$$\begin{aligned} \epsilon _g (\mathbf {w},\mathbf {w^*}) \, = \, \left[ \int _{-\infty }^0 \int _{0}^{\infty } + \int _{0}^{\infty } \int _{-\infty }^0 \right] P(x,x^*) dx dx^* =\, \frac{1}{\pi } \, \arccos (R). \end{aligned}$$
(8)
This result can be obtained immediately by an intuitive argument: The probability for a random vector \(\boldsymbol{\xi }\) to fall into the hypersegments between the hyperplanes defined by \(\mathbf {w}\) and \(\mathbf {w^*}\) is directly given by \(\angle \left( \mathbf {w}, \mathbf {w}^* \right) / \pi \) which corresponds to the right hand side of Eq. (8).
In the following, the overlap \(R=\mathbf {w}\cdot \mathbf {w^*}/N\) plays the role of an order parameter. This macroscopic quantity summarizes essential properties of the N microscopic degrees of freedom, i.e. the adaptive student weights \(w_j.\) It is also the central quantity in the following analysis of the training outcome.
2.4 Training as a Stochastic Process and Thermal Equilibrium
The outcome of any practical training process will clearly depend on the actual choice of an algorithm and its parameters that is used to infer a suitable weight vector \(\mathbf {w}\) from a given data set \(\mathbb {D}\). Generically, the training process is guided by a cost function, such as the quadratic deviation of the student output from the target in regression systems or the number of incorrect responses in a classification problem.
Frequently, gradient based methods can be used for the optimization of continuous weights \(\mathbf {w}\in \mathbb {R}^N\), often incorporating some form of noise as in the popular stochastic gradient descent. The search for optimal weights in a discrete space with, e.g., \(\mathbf {w} \in \{-1,+1\}^N\) could be performed by means of a Metropolis Monte Carlo method, as an example.
The degree to which the system is forced to approach the actual minimum of the cost function is controlled implicitly or explicitly in the training algorithm. Example control parameters are the learning rate in gradient descent or the temperature parameter in Metropolis like schemes. In the statistical physics approach to learning, this concept is taken into account by considering a formal thermal equilibrium situation as outlined below.
In the context of the perceptron student teacher scenario we consider a cost function of the form
$$\begin{aligned} H(\mathbf {w}) = \sum _{\mu =1}^P \epsilon (S^\mu , S^{*\mu }) \text{ with } S^\mu =\mathrm{sign}[\mathbf {w}\cdot \boldsymbol{\xi }^\mu ], S^{*\mu }=\mathrm{sign}[\mathbf {w^*}\cdot \boldsymbol{\xi }^\mu ]. \end{aligned}$$
(9)
With the binary error measure of Eq. (6), the cost function represents the number of disagreements between student and teacher for a given data set.
Without referring to a particular training prescription we can describe the outcome of suitable stochastic procedures in terms of a Gibbs-Boltzmann density of weight vectors
$$\begin{aligned} P_{eq} (\mathbf {w}) = \frac{e^{-\beta H(\mathbf {w})}}{Z} \text{ with } Z=\int d\mu (\mathbf {w}) \,\, e^{-\beta H(\mathbf {w})}. \end{aligned}$$
(10)
It describes a canonical ensemble of trained networks in thermal equilibrium at formal inverse temperature \(\beta =1/T\). The cost function \(E(\mathbf {w})\) plays the role of the energy of state \(\mathbf {w}\) and the normalization Z is known as the partition function. The measure \(d\mu (\mathbf {w})\) is implicitly understood to incorporate restrictions of the N-dimensional integration such as the normalization \(\mathbf {w}^2=N\). Similarly, Z can be written as a sum over all possible weight configurations for systems with \(\mathbf {w}\in \{-1,+1\}^N.\)
In the limit \(\beta \rightarrow \infty , T\rightarrow 0\), only the groundstate with minimal energy can be observed in the ensemble, as any other state will have an exponentially smaller \(P_{eq}\). On the contrary, for \(\beta \rightarrow 0, T\rightarrow \infty \), the energy becomes irrelevant and every state \(P_{eq}\) can occur with the same probability. In general, the parameter \(\beta \) controls the mean energy of the system which can be written as a thermal average of the form
$$\begin{aligned} \left\langle H \right\rangle _\beta \, = \, \int d\mu (\mathbf {w}) \, H(\mathbf {w}) \, \frac{e^{-\beta H(\mathbf {w})}}{Z} \, = \, -\frac{\partial }{\partial \beta } \ln Z. \end{aligned}$$
(11)
Quite generally, thermal averages can be written as appropriate derivatives of the so-called free energy \(F=-\frac{1}{\beta } \ln Z\), which is also in the center of the following analysis. Introducing the microcanonical entropy S(E) we can rewrite
$$\begin{aligned} Z = \int dE \, e^{-\beta E + S(E)} \text{ where } S(E)\, = \, \ln \, \int d\mu (\mathbf {w}) \, \delta [H(\mathbf {w})-E] \end{aligned}$$
(12)
with the Dirac delta-function \(\delta [\ldots ].\) For large systems in the thermodynamic limit \(N\rightarrow \infty \) we assume that entropy and energy are extensive, i.e. that \(S=N\, s\) and \(E=N\, e\) with \(e,s=\mathcal{O}(1)\). A saddle-point integration yields
$$\begin{aligned} \lim _{N\rightarrow \infty } \, \left( {-\ln Z}/{N} \right) \, = \, \beta F/N \,=\, \beta \, e - s(e) \end{aligned}$$
(13)
where the right hand side \(\beta F/N\) has to be evaluated in its minimum with respect to e for a given \(\beta \).
2.5 Disorder Average and High-Temperature Limit
The consideration of a formal thermal equilibrium in the previous section refers to a particular data set \(\mathbb {D}\), since the energy function \(H(\mathbf {w})\) is defined with respect to the given example data. In order to obtain typical results independent of the particularities of a specific data set, an additional average over randomly generated \(\mathbb {D}\) has to be performed.
In the simplest case, we consider data sets which comprise P independent vectors \(\boldsymbol{\xi }^\mu \) with i.i.d. components that obey (4). Hence the corresponding density factorizes over examples \(\mu =1,2,\ldots P\) and components \(j=1,2,\ldots N\) of the feature vectors in \(\mathbb {D}.\)
The randomness in \(\mathbb {D}\) can be interpreted as an external disorder which determines the actual energy function \(H(\mathbf {w})\) and the corresponding thermal equilibrium. In addition to the thermal average discussed in the previous section, the associated quenched average is denoted as \(\langle \ldots \rangle _{\mathbb {D}}.\) Quantities of interest have to be studied in terms of appropriate averages of the form \(\langle \langle \ldots \rangle _\beta \rangle _{\mathbb {D}}\) which can be derived from the quenched free energy
$$ \left\langle F \right\rangle _{\mathbb {D}} = - \, \left\langle \ln Z \right\rangle _{\mathbb {D}} \big / \beta .$$
The computation of \(\left\langle \ln Z \right\rangle _{\mathbb {D}}\) is, in general, quite involved and requires the application of sophisticated methods such as the replica trick [1, 15, 20,21,22].
We refrain from discussing the conceptual difficulties and mathematical subtleties of the replica approach. Instead we resort to a very much simplifying limit, which has been presented and discussed in [22]. In the extreme setting of learning at high formal temperature with \(\beta \rightarrow 0,\) the so-called annealed approximation
$$ \left\langle \ln Z \right\rangle _{\mathbb {D}} \approx \ln \left\langle Z \right\rangle _{\mathbb {D}} $$
becomes exact and can be exploited to obtain the typical training outcome [20,21,22]. Note that in this limit also
$$\begin{aligned} \left\langle Z \right\rangle _{\mathbb {D}}= & {} \left\langle \int d\mu (\mathbf {w}) e^{-\beta H(\mathbf {w})} \right\rangle _{\mathbb {D}} = \int d\mu (\mathbf {w}) e^{-\beta \langle H(\mathbf {w}) \rangle _{\mathbb {D}}} \text{ with } \nonumber \\ \langle H(\mathbf {w}) \rangle _{\mathbb {D}}= & {} \sum _{\mu =1}^P \, \bigg \langle \epsilon (S^\mu ,S^{*\mu }) \bigg \rangle _{\mathbb {D}} = P \, \epsilon _g. \rangle _{\mathbb {D}}. \end{aligned}$$
(14)
Here we make use of the fact that the i.i.d. random examples in \(\mathbb {D}\) contribute the same average error which is given by \(\epsilon _g.\) It is expressed as a function of the order parameter R in Eq. (8). We can now perform a saddle point integration in analogy to Eqs. (12, 13) to obtain
$$\begin{aligned} \lim _{N\rightarrow \infty } (-\ln \langle Z \rangle _{\mathbb {D}} / N ) \,=\, \beta \langle F\rangle _{\mathbb {D}}/N \, = \frac{\beta P}{N} \epsilon _g (R) - s(R). \end{aligned}$$
(15)
Again, the right hand side has to be evaluated in its minimum, now with respect to the characteristic order parameter R of the system. The entropy term
$$\begin{aligned} s(R) = \frac{1}{N} \ln \int d\mu (\mathbf {w}) \delta [ \mathbf {w}\cdot \mathbf {w^*} - N R ] \end{aligned}$$
(16)
can be obtained analytically by an additional saddle point integration making use of the integral representation of the \(\delta \)-function [1, 20,21,22]. Since s(R) depends on potential constraints on the weight vectors as represented by \(d\mu (\mathbf {w})\), we postpone the computation to the following sections.
In order to obtain meaningful results from the minimization with respect to R in Eq. (15), we have to assume that the number of examples P scales like
$$\begin{aligned} P= \alpha \, N / \beta \text{ with } \alpha = \mathcal{O} (1). \end{aligned}$$
(17)
Obviously, P should be proportional to the number N of adaptive weights in the system, which is consistent with an extensive energy. In addition, P has to grow like \(\beta ^{-1}\) in the high temperature limit. The weak role of the energy in this limit has to be compensated for by an increased number of example data. In layman’s terms: “Almost nothing is learned from infinitely many examples”. This also makes plausible the identification of the energy with the generalization error. The space of possible input vectors is sampled so well that training set performance and generalization behavior become indistinguishable.
Finally, the quenched free energy per weight, \(f=\langle F\rangle _{\mathbb {D}} /N \) of the perceptron model in the high temperature limit has the form
$$\begin{aligned} \beta f \, = \, \alpha \, \epsilon _g(R) \, - \, s(R), \end{aligned}$$
(18)
where \(\alpha \) plays the role of an effective temperature parameter, which couples the number of examples and the formal temperature of the training process. These quantities cannot be varied independently within the simplifying limit \(\beta \rightarrow 0\) in combination with \(P/N \propto \beta ^{-1}\).
2.6 Two Concrete Examples
Despite the significant simplifications and scaling assumptions, it is possible to obtain non-trivial, interesting results also in the high temperature limit. Very often, more sophisticated approaches, such as the replica method or the annealed approximation for finite training temperatures, confirm the results for \(\beta \rightarrow 0\) qualitatively. Therefore, the simplified treatment has often been used to obtain first, useful insights into the qualitative properties of various learning scenarios. In this brief review, we restrict the discussion to two well-known results for simple model situations. Both concern the training of a simple perceptron in a student teacher scenario. Originally the models were treated in [22] and they have been revisited in several reviews, for instance, [20, 21]. We reproduce the results here as particularly illustrative examples for the statistical physics approach to learning.
The Perceptron with Continuous Weights
Here we consider a student teacher scenario where the student weight vector \(\mathbf {w}\in \mathbb {R}^N\) is normalized (\(\mathbf {w}^2=N\)) but otherwise unrestricted.
The generalization error as a function of the student teacher overlap R is given in Eq. (8). The corresponding entropy, Eq. (16), can be obtained by means of a saddle point integration. Alternatively, one can interpret \(e^Ns\) as the volume of an \((N-1)\)-dimensional hypersphere in weight space with radius \(\sqrt{1-R^2}\), see [21] for the geometrical argument. One obtains
$$\begin{aligned} s(R) = \frac{1}{2} \ln (1-R^2) + const., \end{aligned}$$
(19)
where the additive constant does not depend on R. Apart from such irrelevant terms, we obtain the quenched free energy in the limit \(\beta \rightarrow 0\) as
$$\begin{aligned} \beta f \, = \, \alpha \, \frac{1}{\pi } \arccos {R} - \frac{1}{2} \ln (1-R^2). \end{aligned}$$
(20)
In absence of training data, \(\alpha =0\), the maximum of the entropy term in \(R=0\) governs the behavior of the system. In the high-dimensional feature space, the student weight vector is expected to be orthogonal to the unknown \(\mathbf {w^*}\).
The free energy is displayed in Fig. 1 for three different values of \(\alpha \). As \(\alpha \) is increased, we observe that the minimum of \(\beta f\) is found in larger, positive values of R, reflecting the knowledge about the rule as inferred from the set of examples.
The student teacher overlap \(R(\alpha )\) that corresponds to the minimum of \(\beta f\) is displayed in Fig. 2 (left panel). In this simple case, it can be obtained analytically from the necessary condition for the presence of a minimum:
$$\begin{aligned} \frac{\partial \beta f}{\partial R} = 0 \quad \Rightarrow \quad R(\alpha ) = \frac{\alpha }{\sqrt{\alpha ^2+\pi ^2}}. \end{aligned}$$
(21)
By means of Eq. (8) this result translates into a learning curve \(\epsilon _g (\alpha )\), which is shown in the right panel of Fig. 2. One can show that large training sets facilitate perfect generalization with
$$\begin{aligned} R(\alpha ) \approx 1 -\frac{\pi ^2}{2 \alpha ^2} \text{ and } \epsilon _g (\alpha ) \approx \frac{1}{\alpha } \text{ for } \alpha \rightarrow \infty . \end{aligned}$$
(22)
It is interesting to note that the basic asymptotic \(\alpha \)-dependences are recovered in the more sophisticated application of the annealed approximation or the replica formalism [22]. Obviously, an explicit temperature dependence and the correct prefactors cannot be obtained in the simplifying limit.
The Perceptron with Discrete Weights
As an interesting exercise we also revisit the model with discrete student weights [22]. The term Ising perceptron has been coined for the model with weights \(\mathbf {w}\in \{-1,1\}^N\) [21, 22]. Note that the assumed normalization \(\mathbf {w}^2=N\) is trivially satisfied. Moreover, the generalization error is also given by Eq. (8) since its derivation does not depend on details of the weight space.
The corresponding entropy can be obtained by a simple counting argument: In order to obtain an overlap \(\sum _{j} w_j w_j^*=NR\), a number of \(N (R+1)/2\) components must satisfy \(w_j=w_j^*\) while for \(N(R-1)/2\) we have \(w_j=-w_j^*\). The associated entropy of mixing is given by the familiar form
$$\begin{aligned} s(R) \, = \, - \left( \frac{1+R}{2}\right) \ln \left( \frac{1+R}{2}\right) - \left( \frac{1-R}{2}\right) \ln \left( \frac{1-R}{2}\right) . \end{aligned}$$
(23)
The resulting free energy (18) as a function of R is displayed in Fig. 3 for three different values of \(\alpha \).
For all \(\alpha >0\), \(\beta f\) displays a local minimum in \(R=1\) with \(f(R=1)=0\). For small \(\alpha \), however, a deeper minimum can be found with an overlap \(0<R<1.\) This is exemplified for \(\alpha =1.5\) in the leftmost panel of Fig. 3. The global mininum of \(\beta f\) determines the thermodynamically stable state of the system.
For training sets with \(\alpha \) larger than a critical value \(\alpha _c\approx 1.69\), the state with \(R=1\) constitutes the global minimum. A competing configuration with \(R<1\) persists as a local minimum, but becomes unstable for \(\alpha > \alpha _d \approx 2.08\), see the center and rightmost panel of Fig. 3.
The learning curves \(R(\alpha )\) and \(\epsilon _g(\alpha )\) reflect the specific \(\alpha \)-dependence of \(\beta f\) in terms of a discontinuous phase transition. In Fig. 4, the solid lines mark the thermodynamically stable state in terms of \(R(\alpha )\) (left panel) and \(\epsilon _g(\alpha )\) (right panel). Dashed lines correspond to local minima of \(\beta f\) and the characteristic values \(\alpha _c\) and \(\alpha _d\) are marked by the dotted and solid vertical lines, respectively.
The essential findings of the high temperature treatment do carry over to the training at lower formal temperatures, qualitatively [21, 22]. Most notably, the system displays a freezing transition to perfect generalization. Furthermore, the first order phase transition scenario will have non-trivial effects in practical training. The existence of metastable, poorly generalizing states can delay the success of training significantly. Related hysteresis effects with varying \(\alpha \) have been observed in Monte Carlo simulations of the training process, see [21] and references therein.