In Sect. 2.1, we introduce learning vector quantization for classification tasks with emphasis on the well established LVQ1 training scheme. We also propose a model density of data which was previously investigated in the mathematical analysis of LVQ training in stationary and specific non-stationary environments. Here, we extend the approach to the presence of virtual concept drift and consider weight decay as an explicit mechanism of forgetting.
Thereafter, Sect. 2.2 presents a student–teacher scenario for the learning of a regression scheme with shallow, layered neural networks of the feedforward type. Emphasis is on the comparison of two important types of hidden unit activations; traditional sigmoidal transfer functions and the popular rectified linear unit (ReLU) activation. We consider gradient-based training in the presence of real concept drift and also introduce weight decay as a mechanism of forgetting.
A unified description of the theoretical approach to analyse the training dynamics in classification and regression systems is given in Sect. 2.3.
Learning vector quantization
The family of LVQ algorithms is widely used for practical classification problems [13, 29, 30, 39]. The popularity of LVQ is due to a number of attractive features: It is straightforward to implement, very flexible and intuitive. Moreover, it constitutes a natural tool for multi-class problems. The actual classification scheme is very often based on Euclidean metrics or other simple measures, which quantify the distance of inputs or feature vectors from the class-specific prototypes. Unlike many other methods, LVQ facilitates direct interpretation of the classifier because prototypes are defined in the same space as the data [13, 39]. The approach is based on the idea of representing classes by more or less typical representatives of the training instances. This suggests that LVQ algorithms should also be capable of tracking changes in the density of samples, a hypothesis that has been studied for instance in [14, 25], recently.
Nearest prototype classifier
In general, several prototypes can be employed to represent each class. However, we restrict the analysis to the simple case of only one prototype per class in binary classification problems. Hence we consider two prototypes \({\bf w}_k \in \mathbb{R}^N\) each representing one of the classes \(k\in \{1,2\}.\) Together with a distance measure \(d({\bf w},{\boldsymbol{\xi}}),\) the system parameterizes a Nearest Prototype Classification (NPC) scheme: Any given input \({\boldsymbol{\xi} } \in \mathbb{R}^N\) is assigned to the class \(k=1\) if \(d({\bf w}_1,{\boldsymbol{\xi} })< d({\bf w}_2,{\boldsymbol{\xi}})\) and to class 2, otherwise. In practice, ties can be broken arbitrarily.
A variety of distance measures have been used in LVQ, enhancing the flexibility of the approach even further [13, 39]. This includes the conceptually interesting use of adaptive metrics in relevance learning, see [13] and references therein. Here, we restrict our analysis to the simple (squared) Euclidean measure
$$\begin{aligned} d({\bf w}, {\boldsymbol{\xi} })= ({\bf w} - {\boldsymbol{\xi} })^2. \end{aligned}$$
(1)
We assume that the training procedure provides a stream of single examples [5]: At time step \(\mu \, = \, 1,2,\ldots ,\) the vector \({\boldsymbol{\xi} }^{\, \mu }\) is presented, together with its given class label \(\sigma ^\mu =1,2\). Iterative on-line LVQ updates are of the general form [12, 20, 54]
$$\begin{aligned} {\bf w}_k^\mu= & {} {\bf w}_k^{\mu -1} \, + \, \frac{\eta }{N} \, \Delta {\bf w}_k^\mu \text{ with } \nonumber \\ \Delta {\bf w}_k^\mu= & {} f_k\left[ d_1^{\mu },d_2^{\mu },\sigma ^\mu ,\ldots \right] \, \left( {\boldsymbol{\xi}}^\mu - {\bf w}_k^{\mu -1}\right) \end{aligned}$$
(2)
where \(d_i^\mu = d({\bf w}_i^{\mu -1},{\boldsymbol{\xi} }^\mu )\) and the learning rate \(\eta\) is scaled with the input dimension N. The precise algorithm is specified by choice of the modulation function \(f_k[\ldots ]\), which depends typically on the Euclidean distances of the data point from the current prototype positions and on the labels \(k,\sigma ^\mu =1,2\) of the prototype and training example, respectively.
The LVQ1 training algorithm
A popular and intuitive LVQ training scheme was already suggested by Kohonen and is known as LVQ1 [29, 30]. Following the NPC concept, it updates only the currently closest prototype in a so-called Winner-Takes-All (WTA) scheme. Formally, the LVQ1 prescription for a system with two competing prototypes is given by Eq. (2) with
$$\begin{aligned} f_k[d_1^\mu ,d_2^\mu ,\sigma ^\mu ] \, = \Theta \left( d_{\widehat{k}}^\mu - d_{k}^\mu \right) \Psi (k,\sigma ^\mu ), \end{aligned}$$
(3)
where \(\widehat{k} = \left\{ \begin{array}{ll} 2 &{} \text{ if } k=1 \\ 1 &{} \text{ if } k=2, \end{array} \right. \text{ and } \Psi (k,\sigma )= \left\{ \begin{array}{ll} +1 &{} \text{ if } k=\sigma \\ -1 &{} \text{ else. } \\ \end{array} \right.\)
Here, the Heaviside function \(\Theta (\ldots )\) singles out the winning prototype and the factor \(\Psi (k,\sigma ^\mu )\) determines the sign of the update: The WTA update according to Eq. (3) moves the prototype towards the presented feature vector if it carries the same class label \(k=\sigma ^\mu\). On the contrary, if the prototype is meant to present a different class, its distance from the data point is increased even further. Note that LVQ1 cannot be interpreted as a gradient descent procedure of a suitable cost function in a straightforward way due to discontinuities at the class boundaries, see [12] for a discussion and references.
Numerous variants and modifications of LVQ have been presented in the literature, aiming at better convergence or classification performance, see [12, 13, 29, 39]. Most of these modifications, however, retain the basic idea of attraction and repulsion of the winning prototypes.
Clustered model data
LVQ algorithms are most suitable for classification schemes which reflect a given cluster structure in the data. In the modelling, we therefore consider a stream of random input vectors \({\boldsymbol{\xi} } \in \mathbb {R}^N\) which are generated independently according to a mixture of two Gaussians [12, 20, 54]:
$$\begin{aligned} P({\boldsymbol{\xi} })= & {} {\textstyle \sum _{m=1,2}} \, \, \,p_m P({\boldsymbol{ \xi} }\mid m) \text{ with } \text{ contributions } \nonumber \\ P({\boldsymbol \xi }\mid m)= & {} \frac{1}{(2\, \pi \, v_m)^{N/2}} \, \exp \left[ -\frac{1}{2 \, v_m} \left( {\boldsymbol{\xi} } - \lambda {\bf B}_m \right) ^2 \right] . \end{aligned}$$
(4)
The target classification coincides with the cluster membership, i.e., \(\sigma =m\) in Eq. (3). The class-conditional densities \(P({\boldsymbol{\xi} }\!\mid \!m\!=\!1,2)\) correspond to isotropic, spherical Gaussians with variance \(\, v_m\) and mean \(\lambda \, {\bf B}_m\). Prior weights of the clusters are denoted as \(p_m\) and satisfy \(p_1 + p_2 =1\). We assume that the vectors \({\bf B}_m\) are orthonormal with \({\bf B}_1^{\, 2}={\bf B}_2^{\, 2}=1\) and \({\bf B}_1 \cdot {\bf B}_2 =0\). Obviously, the classes \(m=1,2\) are not perfectly separable due to the overlap of the clusters.
We denote conditional averages over \(P({\boldsymbol{\xi }}\mid m)\) by \(\left\langle \cdots \right\rangle _m\), whereas mean values \(\langle \cdots \rangle = \sum _{m=1,2} \, p_m \, \left\langle \cdots \right\rangle _m\) are defined with respect to the full density (4). One obtains, for instance, the conditional and full averages
$$\begin{aligned} \left\langle {\boldsymbol{\xi} } \right\rangle _m&= {} \lambda \, {\bf B}_m, \langle {\boldsymbol{\xi} }^{\, 2} \rangle _m = v_m \, N + \lambda ^2 \text{ and } \nonumber \\ \langle {\boldsymbol{\xi}}^{\, 2}\rangle &= {} \left( p_1v_1 + p_2 v_2 \right) \, N + \lambda ^2. \end{aligned}$$
(5)
Note that in the thermodynamic limit \(N\rightarrow \infty\) considered later, \(\lambda ^2\) can be neglected in comparison to the terms of \(\mathcal{{O}}(N)\) in Eq. (5).
Similar clustered densities have been studied in the context of unsupervised learning and supervised perceptron training; see, e.g., [4, 10, 35]. Also, online LVQ in stationary situations was analysed in, e.g., [12].
Here we focus on the question whether LVQ learning schemes are able to cope with drift in characteristic model situations and whether extensions like weight decay can improve the performance in such settings.
Layered neural networks
The term Soft Committee Machine (SCM) has been established for shallow feedforward neural networks with a single hidden layer and a linear output unit, see for instance [2, 8, 9, 11, 26, 42, 44, 45, 49]. Its structure resembles that of a (crisp) committee machine with binary threshold hidden units, where the network output is given by their majority vote, see [4, 19, 53] and references therein.
The output of an SCM with K hidden units and fixed hidden-to-output weights is of the form
$$\begin{aligned} y({\boldsymbol{\xi} }) = \sum _{k=1}^K \, g({\bf w}_k \cdot {\boldsymbol{\xi} }) \text{ where } {\bf w}_k \in \mathbb {R}^N \end{aligned}$$
(6)
denotes the weight vector connecting the N-dimensional input layer with the k-th hidden unit. A non-linear transfer function \(g(\cdots )\) defines the hidden unit states and the final output is given as their sum.
As specific examples we consider the sigmoidal
$$\begin{aligned} g(x) = \mathrm{{erf}}\left( x/\sqrt{2}\right) \text{ with } g^\prime (x)= \sqrt{{2}/{\pi }} \,\, e^{-x^2/2} \end{aligned}$$
(7)
and the popular rectified linear unit (ReLU):
$$\begin{aligned} g(x) = x \, \Theta (x) \text{ with } g^\prime (x)= \, \Theta (x). \end{aligned}$$
(8)
The activation (7) resembles closely other sigmoidal functions, e.g., the more popular \(\tanh (x)\), but it facilitates the analytical treatment in the mathematical analysis as exploited in [8], originally. In the following, we refer to an SCM with the above sigmoidal activation as Erf-SCM, for brevity.
Similarly, we use the term ReLU-SCM for networks with hidden unit states given by Eq. (8). The ReLU activation has recently gained significant popularity in the context of Deep Learning [22]. This is, among other reasons, due to its simplicity which offers computational ease and numerical stability. According to the literature, ReLU networks have displayed favorable training and generalization behavior in several practical applications and benchmark problems [18, 31, 34, 38, 40].
Note that an SCM, cf. Eq. (6), is not quite a universal approximator. However, this property could be achieved by introducing hidden-to-output weights and adaptive local thresholds \(\vartheta _i \in \mathbb {R}\) in hidden unit activations of the form \(g\left( {\bf w}_i\cdot {\boldsymbol{\xi} } -\vartheta _i\right)\), see [16]. Adaptive hidden-to-output weights have been studied in, for instance, [42] from a statistical physics perspective. However, we restrict ourselves to the simpler model defined above and focus on basic dynamical effects and potential differences of ReLU- versus Erf-SCM in the presence of concept drift.
Regression scheme and on-line learning
The training of a neural network with real-valued output \(y({\boldsymbol{\xi}})\) based on examples \(\left\{ {\boldsymbol{\xi }}^\mu \in \mathbb {R}^N, \tau ^\mu \in \mathbb {R} \right\}\) for a regression problem is frequently guided by the quadratic deviation of the network output from the target values [15, 22, 23] . It serves as a cost function which evaluates the network performance with respect to a single example as
$$\begin{aligned} e^\mu \left( \{{\bf w}_k\}_{k=1}^K\right) = \frac{1}{2} \big ( y^\mu - \tau ^\mu \big )^2 \text{ with } y^\mu = y({\bf \xi }^\mu ). \end{aligned}$$
(9)
In stochastic or on-line gradient descent, updates of the weight vectors are based on the presentation of a single example at time step \(\mu\)
$$\begin{aligned} {\bf w}_k^{\mu } = {\bf w}_k^{\mu -1} + \frac{\eta }{N} \, \Delta {\bf w}_k^{\mu } \text{ with } \Delta {\bf w}_k^\mu = \, - \, \frac{\partial e^\mu }{\partial {\bf w}_k} \end{aligned}$$
(10)
where the gradient is evaluated in \({\bf w}_k^{\mu -1}\). For the SCM architecture specified in Eq. (6), \(\partial y^\mu / {\partial {\bf w}_k} = g'\left( h_k^\mu \right) {\boldsymbol\xi }^\mu ,\) and we obtain
$$\begin{aligned} \Delta {\bf w}_k^{\mu } = - \left( \sum _{i=1}^K g\left( h_i^\mu \right) - \tau ^\mu \right) \, g^\prime \left( h_k^\mu \right) {\boldsymbol \xi }^\mu \end{aligned}$$
(11)
with the inner products \(h^\mu _i = {\bf w}_i^{\mu -1}\cdot {\boldsymbol \xi }^\mu\) of the current weight vectors with the next example input in the stream. Note that the change of weight vectors is proportional to \({\boldsymbol \xi }^\mu\) and can be interpreted as a form of Hebbian Learning [15, 22, 23].
Student–teacher scenario and model data
In order to define and model meaningful learning situations, we resort to the consideration of student–teacher scenarios [4, 5, 19, 53].
We assume that the target can be defined in terms of an SCM with a number M of hidden units and a specific set of weights \(\left\{ {\bf B}_m \in \mathbb {R}^N \right\} _{m=1}^M\):
$$\begin{aligned} \tau ({\boldsymbol \xi }) = \sum _{m=1}^M \, g({\bf B}_m \cdot {\boldsymbol \xi }) \text{ and } \tau ^\mu = \tau ({\boldsymbol \xi }^\mu ) = \sum _{m=1}^M g(b_m^\mu ) \end{aligned}$$
(12)
with \(b_m^\mu = {\bf B}_m \cdot {\boldsymbol \xi }^\mu\) for one of the training examples. This so-called teacher network can be equipped with \(M>K\) hidden units in order to model regression schemes which cannot be learnt by an SCM student of the form (6). On the contrary, \(K>M\) would correspond to an over-learnable target or over-sophisticated student. For the discussion of these highly interesting cases in stationary environments, see for instance [8, 9, 42, 44, 45]. In a student–teacher scenario with K and M hidden units the update of the student weight vectors by on-line gradient descent is given by Eq. (11) with \(\tau ^\mu\) from Eq. (12).
In the following, we will restrict our analysis to perfectly matching student complexity with \(K=M=2\) only, which further simplifies Eq. (11). Extensions to more hidden units and settings with \(K\ne M\) will be considered in forthcoming projects.
In contrast to the model for LVQ-based classification, the vectors \({\bf B}_m\) define the target outputs \(\tau ^\mu = \tau ({\boldsymbol \xi }^\mu )\) explicitly via the teacher network for any input vector. While clustered input densities of the form (4) can also be studied for feedforward networks as in [35, 36], we assume here that the actual input vectors are uncorrelated with the teacher vectors \({\bf B}_m\). Consequently, we can resort to a simpler model density and consider vectors \({\boldsymbol \xi }\) of independent, zero mean, unit variance components with
$$\begin{aligned} P({\boldsymbol \xi }) = {(2\, \pi )^{-N/2}} \, \exp \left[ - \, {\bf \xi }^2/2 \right] . \end{aligned}$$
(13)
Note that the density (13) is recovered formally from Eq. (4) by setting \(\lambda =0\) and \(v_1=v_2=1\), for which both clusters in (4) coincide in the origin and the parameters \(p_{1,2}\) become irrelevant.
Note that the student/teacher scenario considered here is different from concepts used in studies of knowledge distillation, see [51] and references therein. In the context of distillation, a teacher network is itself trained on a given data set to approximate the target function. Thereafter a student network, frequently of a simpler architecture, distills the knowledge in a subsequent training process. In our work, as in most statistical physics-based studies [4, 19, 53], the teacher network is taken to directly define the true target function. A particular architecture is chosen and, together with its fixed weights, it controls the complexity of the task. The teacher network provides correct target outputs to all input data that are generated according to the distribution in Eq. (13). In the actual training process, a sequence of such input vectors and teacher-generated labels is presented to the student network.
Mathematical analysis of the training dynamics
In the following we sketch the successful theory of on-line learning [4, 5, 19, 43, 53] as, for instance, applied to the dynamics of LVQ algorithms in [12, 20, 54] and to on-line gradient descent in SCM in [8, 9, 26, 42, 44, 45, 49]. We refer the reader to the original publications for details. The extensions to non-stationary situations with concept drifts are discussed in Sect. 2.4.
The mathematical analysis proceeds along the same generic steps in both settings. Our presentation follows closely the descriptions in [14, 47].
We consider adaptive vectors \({\bf w}_{1,2}\in \mathbb {R}^N\) (prototypes in LVQ, student weights in the SCM) while the characteristic vectors \({\bf B}_{1,2}\) specify the target task (cluster centers in LVQ training, SCM teacher vectors for regression).
The consideration of the thermodynamic limit \(N\rightarrow \infty\) is instrumental for the theoretical treatment. The limit facilitates the following key steps which, eventually, yield an exact mathematical description of the training dynamics in terms of ordinary differential equations (ODE):
(a) Order parameters
The many degrees of freedom, i.e., the components of the adaptive vectors, can be characterized in terms of only very few quantities. The definition of these so-called order parameters follows naturally from the mathematical structure of the model. After presentation of a number \(\mu\) of examples, as indicated by corresponding superscripts, we describe the system by the projections for \(i,k,m \in \{1,2\}\)
$$\begin{aligned} R_{{\rm im}}^\mu ={\bf w}_i^\mu \cdot {\bf B}_m \,\, \text{ and } Q_{ik}^\mu ={\bf w}_i^\mu \cdot {\bf w}_k^\mu . \end{aligned}$$
(14)
Obviously, \(Q_{11}^\mu ,Q_{22}^\mu\) and \(Q_{12}^\mu =Q_{21}^\mu\) relate to the norms and mutual overlap of the adaptive vectors, while the quantities \(R_{{\rm im}}\) specify their projections into the linear subspace defined by the characteristic vectors \(\{{\bf B}_1,{\bf B}_2\}\), respectively.
(b) Recursions
Recursion relations for the order parameters (14) can be derived directly from the update steps, which are of the generic form \({\bf w}_k^\mu \, = {\bf w}_k^{\mu -1} \, + \eta /N \, \Delta {\bf w}_k^\mu .\) The corresponding inner products yield
$$\begin{aligned} N({R_{{\rm im}}^{\mu } - R_{{\rm im}}^{\mu -1}})&= {} \eta \, \Delta {\bf w}_i^\mu \cdot {\bf B}_m \nonumber \ \\ N ({Q_{ik}^{\mu } - Q_{ik}^{\mu -1}})&= {} \eta \left( {\bf w}^{\mu -1}_i \cdot \Delta {\bf w}^{\mu }_k + {\bf w}^{\mu -1}_k \cdot \Delta {\bf w}^{\mu }_i \right) \nonumber \\&\quad + \, \eta ^2/N \, \Delta {\bf w}^{\mu }_i \cdot \Delta {\bf w}^{\mu }_k. \end{aligned}$$
(15)
Terms of order \(\mathcal{O}(1/N)\) on the r.h.s. will be neglected in the following. Note however that \(\Delta {\bf w}^{\mu }_i \cdot \Delta {\bf w}^{\mu }_k\) comprises contributions of order \(|{\boldsymbol \xi }|^2 \propto N\) for the considered updates (2) and (10).
(c) Averages over the model data
Applying the central limit theorem (CLT) we can perform an average over the random sequence of independent examples.
Note that \(\Delta {\bf w}^\mu _k \propto {\boldsymbol \xi }^\mu\) or \(\Delta {\bf w}^\mu _k \propto \left( {\boldsymbol \xi }^\mu - {\bf w}^{\mu -1}_k\right)\) for the SCM and LVQ, respectively. Consequently, the current input \({\boldsymbol \xi }^\mu\) enters the r.h.s. of Eq. (15) only through its norm \(\mid {\bf \xi }\mid ^2 = \mathcal{{O}}(N)\) and the quantities
$$\begin{aligned} h_i^\mu \, = {\bf w}_i^{\mu -1} \cdot {\boldsymbol \xi }^\mu \text{ and } b_m^\mu \, = {\bf B}_m \cdot {\boldsymbol \xi }^\mu . \end{aligned}$$
(16)
Since these inner products correspond to sums of many independent random quantities in our model, the CLT implies that the projections in Eq. (16) are correlated Gaussian quantities for large N and the joint density \(P(h_1^\mu ,h_2^\mu ,b_1^\mu ,b_2^\mu )\) is given completely by first and second moments.
LVQ: For the clustered density, cf. Eqs. (4), the conditional moments read
$$\begin{aligned}&\left\langle h^\mu _{i} \right\rangle _{m} = \lambda R_{{\rm im}}^{\mu -1}, \quad \left\langle b^\mu _{m} \right\rangle _{n} = \lambda \delta _{mn},\nonumber \\&\left\langle h^\mu _{i} h^\mu _{k} \right\rangle _{m} - \left\langle h^\mu _{i} \right\rangle _{m} \left\langle h^\mu _{k} \right\rangle _{m} = v_m \, Q^{\mu -1}_{ik},\nonumber \\&\left\langle h^\mu _{i} b^\mu _{n} \right\rangle _{m} - \left\langle h^\mu _{i} \right\rangle _{m} \left\langle b^\mu _{n} \right\rangle _{m} = v_m \, R^{\mu -1}_{in}, \nonumber \\&\left\langle b^\mu _{l} b^\mu _{n} \right\rangle _{m} - \left\langle b^\mu _{l} \right\rangle _{m} \left\langle b^\mu _{n} \right\rangle _{m} = v_m \, \delta _{ln}, \end{aligned}$$
(17)
with \(i,k,l,m,n \in \{1,2\}\) and the Kronecker-Delta \(\delta _{ij}= 1\) for \(i=j\) and \(\delta _{ij}=0\) else.
SCM: In the simpler case of the isotropic, spherical density (13) with \(\lambda =0\) and \(v_1=v_2=1\) the moments reduce to
$$\begin{aligned}&\left\langle h^\mu _{i} \right\rangle = 0, \, \left\langle b^\mu _{m} \right\rangle = 0, \left\langle h^\mu _{i} h^\mu _{k} \right\rangle - \left\langle h^\mu _{i} \right\rangle \left\langle h^\mu _{k} \right\rangle = Q^{\mu -1}_{ik} \nonumber \\&\left\langle h^\mu _{i} b^\mu _{n} \right\rangle - \left\langle h^\mu _{i} \right\rangle \left\langle b^\mu _{n} \right\rangle = R^{\mu -1}_{in}, \left\langle b^\mu _{l} b^\mu _{n} \right\rangle \!-\! \left\langle b^\mu _{l} \right\rangle \left\langle b^\mu _{n} \right\rangle = \delta _{ln}. \end{aligned}$$
(18)
Hence, in both cases (LVQ and SCM) the four-dim. density of \(h_{1,2}^\mu\) and \(b_{1,2}^\mu\) is fully specified by the values of the order parameters in the previous time step and the parameters of the model density. This important result enables us to average the recursion relations (15) over the most recent training example by means of Gaussian integrals. The resulting r.h.s. can be expressed as functions of \(\{ R_{{\rm im}}^{\mu -1},Q_{ik}^{\mu -1} \}.\) Obviously, the precise form depends on the details of the algorithm and model setup.
(d) Self-Averaging Properties
The self-averaging property of the order parameters allows us to describe the dynamics in terms of mean values: Fluctuations of the stochastic dynamics can be neglected in the limit \(N\rightarrow \infty\). The concept relates to the statistical physics of disordered materials and has been transferred successfully to the study of neural network models and learning processes [4, 19, 53]. A detailed mathematical discussion in the context of sequential on-line learning dynamics is given in [41]. As a consequence, we can interpret the averaged equations (15) directly as deterministic recursions for the actual values of \(\{R_{{\rm im}}^\mu ,Q_{ik}^\mu \},\) which coincide with their disorder average in the thermodynamic limit.
(e) Continuous Time Limit
In the thermodynamic limit \(N\rightarrow \infty ,\) ratios of the form \((\ldots )/(1/N)\) on the left hand sides of Eq. (15) can be interpreted as derivatives with respect to a continuous learning time \(\alpha\) defined by
$$\begin{aligned} \alpha \, = {\, \mu \, }/{N} \text{ with } {\rm d}\alpha \, \sim \, 1/N. \end{aligned}$$
(19)
This scaling corresponds to the natural assumption that the number of examples should be proportional to the number of adaptive quantities in the system.
Averages are performed over the joint density \(P\left( h_1^\mu ,h_2^\mu ,b_1^\mu ,b_2^\mu \right)\) corresponding to the latest, independently drawn input vector. For simplicity, we omit indices \(\mu\) in the following. The resulting sets of coupled ODE is of the form
$$\begin{aligned} \left[ \frac{{\rm d}R_{{\rm im}}}{{\rm d}\alpha } \right] _{{\rm stat}} \!\!\!\!\! = \eta F_{{\rm im}} \text{; } \left[ \frac{{\rm d}Q_{ik}}{{\rm d}\alpha }\right] _{{\rm stat}} \!\!\!\!\! = \eta \, G^{(1)}_{ik} + \eta ^2 G^{(2)}_{ik}. \end{aligned}$$
(20)
Here, the subscript stat indicates that the ODE describe learning from a stationary density, Eqs. (4) or (13).
Limit of small learning rates
The dynamics can also be studied in the limit of small learning rates \(\eta \rightarrow 0\). In this case, the term \(\eta ^2 G_{ik}^{(2)}\) can be neglected in Eq. (20). In order to retain non-trivial performance, the small step size has to be compensated for by training with a large number of examples that diverges like \(1/\eta\). Formally, we introduce the quantity \(\widetilde{\alpha }\) in the simultaneous limit
$$\begin{aligned} \widetilde{\alpha } \, = \lim _{\eta \rightarrow 0} \lim _{\alpha \rightarrow \infty } \, (\eta \alpha ), \end{aligned}$$
(21)
which leads to a simplified system of ODE
$$\begin{aligned} \left[ \frac{{\rm d}R_{{\rm im}}}{{\rm d}\widetilde{\alpha }} \right] _{{\rm stat}} \!\!\!\!\! = F_{{\rm im}} \text{; } \left[ \frac{{\rm d}Q_{ik}}{{\rm d}\widetilde{\alpha }}\right] _{{\rm stat}} \!\!\!\!\! = G^{(1)}_{ik} \end{aligned}$$
(22)
in rescaled continuous time \(\widetilde{\alpha }\) for \(\eta \rightarrow 0.\)
LVQ: In the classification model we have to insert
$$\begin{aligned}&F_{{\rm im}} = \left( \left\langle b_m f_i \right\rangle \! -\! R_{{\rm im}} \left\langle f_i \right\rangle \right) , \,\nonumber \\&G^{(1)}_{ik} = \Big ( \left\langle h_i f_k + h_k f_i \right\rangle \! -\! Q_{ik} \left\langle f_i \! +\! f_k \right\rangle \Big ) \nonumber \\&\text{ and } G^{(2)}_{ik}= {\textstyle \sum _{m=1,2}} \, v_m p_m \left\langle f_i f_k \right\rangle _m \end{aligned}$$
(23)
in Eqs. (20) or (22). The LVQ1 modulation functions \(f_i\) is given in Eq. (3) and conditional averages \(\langle \ldots \rangle _m\) are with respect to the density (4).
SCM: In the case of non-linear regression we obtain
$$\begin{aligned}&F_{{\rm im}} = \langle \rho _i b_m \rangle , \quad G^{(1)}_{ik} = \langle \left( \rho _{i} h_k + \rho _k h_i\right) \rangle , \nonumber \\&\quad \text{ and } G^{(2)}_{ik}= \langle \rho _i \rho _k \rangle \text{ with } \rho _k=-(y-\tau ) g^\prime (h_k). \end{aligned}$$
(24)
Eventually, the r.h.s. of Eqs. (20) or (22) are expressed in terms of elementary functions of order parameters. For the straightforward, yet lengthy results we refer the reader to the original literature for LVQ [12, 20] and SCM [9, 42, 44, 45], respectively.
(f) Generalization error
After training, the success of learning is quantified in terms of the generalization error \(\epsilon _g\), which is also given as a function of the macroscopic order parameters.
LVQ: In the case of the LVQ model, \(\epsilon _g\) is given as the probability of misclassifying a novel, randomly drawn input vector. The class-specific errors corresponding to data from clusters \(k=1,2\) in Eq. (4) can be considered separately:
$$\begin{aligned} \epsilon _g = p_1 \, \epsilon _g^1 + p_2 \, \epsilon _g^2 \text{ where } \epsilon _g^k \, = \, \bigg \langle \Theta \left( d_{k} - d_{\widehat{k}} \right) \bigg \rangle _k \end{aligned}$$
(25)
is the class-specific misclassification rate, i.e., the probability for an example drawn from a cluster k to be assigned to \(\widehat{k}\ne k\) with \(d_{k} > d_{\widehat{k}}\). For the derivation of the class-wise and total generalization error for systems with two prototypes as functions of the order parameters we also refer to [12]. One obtains
$$\begin{aligned} \epsilon _g^k \, = \, \Phi \left( \frac{ Q_{kk}-Q_{\widehat{k}\widehat{k}}-2\lambda ( R_{kk}-R_{\widehat{k}\widehat{k}})}{2 \sqrt{v_k} \sqrt{Q_{11}-2Q_{12}+ Q_{22}}} \right) \end{aligned}$$
(26)
with the function \(\Phi (z)=\int _{-\infty }^{z} dx \, {e^{-x^2/2}}/{\sqrt{2\pi }}.\)
SCM: In the regression scenario, the generalization error is defined as an average \(\left\langle \cdots \right\rangle\) of the quadratic deviation between student and teacher output over the isotropic density, cf. Eq. (13):
$$\begin{aligned} \epsilon _g \, = \frac{1}{2} \left\langle \left[ \sum _{k=1}^K g \left( {h_k}\right) - \sum _{m=1}^M g\left( {b_m}\right) \right] ^2 \right\rangle . \end{aligned}$$
(27)
In the simplifying case of \(K=M=2\) we obtain for Erf-SCM:
$$\begin{aligned}\epsilon _g \, &= \frac{1}{3} + \frac{1}{\pi } \ \sum _{i,k=1}^2 \sin ^{-1}\left( \frac{Q_{ik}}{\sqrt{1+Q_{ii}}\sqrt{1+Q_{kk}}}\right) \nonumber \\ &\quad- \frac{2}{\pi } \sum _{i,m=1}^2 \sin ^{-1}\left( \frac{R_{{\rm im}}}{\sqrt{2} \sqrt{1+Q_{ii}} } \right) \end{aligned}$$
(28)
and for ReLU-SCM:
$$\begin{aligned}\epsilon _g&= \sum _{i,j=1}^2 \!\!\left[ \frac{Q_{ij}}{8}\!+\!\frac{\sqrt{Q_{ii}Q_{jj}\!-\!Q_{ij}^2}\!+\! Q_{ij}\sin ^{-1}\left( \!\frac{Q_{ij}}{\sqrt{Q_{ii}Q_{jj}}}\!\right) }{4\pi } \right] \nonumber \\&\quad -\!\!\sum _{i,j=1}^2 \!\!\left[ \frac{R_{ij}}{4}\!\!+\!\!\frac{\sqrt{Q_{ii}\!-\!R_{ij}^2}\!+\! R_{ij}\sin ^{-1}\left( \frac{R_{ij}}{\sqrt{Q_{ii}}}\!\right) }{2\pi } \right] \!+\! \frac{\pi \!+\!1}{2\pi }. \end{aligned}$$
(29)
Both results are for orthonormal teacher vectors, extensions to general \({\bf B}_m \cdot {\bf B}_n = T_{mn}\) can be found in [45, 47].
(g) Learning curves
The (numerical) integration of the ODE for a given particular training algorithm, model density and specific initial conditions \(\{ R_{{\rm im}}(0), Q_{ik}(0) \}\) yields the temporal evolution of order parameters in the course of training.
Exploiting the self-averaging properties of order parameters once more, we can obtain the learning curves \(\epsilon _g (\alpha )= \epsilon _g\left( \{ R_{{\rm im}}(\alpha ), Q_{ik}(\alpha )\}\right)\) or the class-wise \(\epsilon _g^{k}(\alpha )\), respectively. Hence, we determine the typical generalization error after on-line training with \((\alpha \, N)\) random examples.
The learning dynamics under concept drift
The analysis summarized in the previous section concerns learning in the presence of a stationary concept, i.e., for a density of the form (4) or (13) which does not change in the course of training. Here, we introduce the effect of concept drift to the modelling framework and consider weight decay as an example mechanism for explicit forgetting.
Virtual drift in classification
As defined above, virtual drifts affect statistical properties of the observed example data while the actual target function remains unchanged.
A variety of virtual drift processes can be addressed in our modelling framework. For example, time-varying label noise in regression or classification could be incorporated in a straightforward way [4, 19, 53]. Similarly, non-stationary cluster variances in the input density, cf. Eq. (4), can be introduced through explicitly time-dependent \(v_\sigma (\alpha )\) into Eq. (20) for the LVQ system.
Here we focus on a particularly relevant case in classification, in which a varying fraction of examples represents each of the classes in the data stream. We consider non-stationary, \(\alpha\)-dependent prior probabilities \(p_1(\alpha ) = 1-p_2(\alpha )\) in the mixture density (4). In practical situations, varying class bias can complicate the training significantly and lead to inferior performance [52]. Specifically, we distinguish the following scenarios:
(A) Drift in the training data only
Here we assume that the true target classification is defined by a fixed reference density of data. As a simple example we consider equal priors \(p_1=p_2=1/2\) in a symmetric reference density (4) with \(v_1=v_2\). On the contrary, the characteristics of the observed training data are assumed to be time-dependent. In particular, we study the effect of non-stationary \(p_m(\alpha )\) and weight decay on the learning dynamics. Given the order parameters of the learning systems in the course of training, the corresponding reference generalization error
$$\begin{aligned} \epsilon _{{\rm ref}}(\alpha )= \left( \epsilon _g^1 + \epsilon _g^2\right) /2 \end{aligned}$$
(30)
is obtained by setting \(p_1=p_2=1/2\) in Eq. (25), but inserting \(R_{{\rm im}}(\alpha )\) and \(Q_{ik}(\alpha )\) as obtained from the integration of the corresponding ODE with time dependent \(p_1(\alpha )=1-p_2(\alpha )\) in the training process.
(B) Drift in training and test data
In the second interpretation we assume that the variation of \(p_m(\alpha )\) affects training and test data in the same way. Hence, the change of the statistical properties of the data is inevitably accompanied by a modification of the target classification: For instance, the Bayes optimal classifier and its best linear approximation depend explicitly on the actual priors [12].
The learning system is supposed to track the actual drifting concept and we refer to the corresponding generalization error as the tracking error
$$\begin{aligned} \epsilon _{{\rm track}}= p_1(\alpha ) \, \epsilon _g^1 \, +\, p_2(\alpha ) \, \epsilon _g^2. \end{aligned}$$
(31)
In terms of modelling the training dynamics, both scenarios, (A) and (B), require the same straightforward modification of the ODE system: the explicit introduction of \(\alpha\)-dependent quantities \(p_\sigma (\alpha )\) in Eq. (20). The obtained temporal evolution yields the reference error \(\epsilon _{{\rm ref}}(\alpha )\) for the case of drift in the training data (A) and \(\epsilon _{{\rm track}}(\alpha )\) in interpretation (B).
Note that in both interpretations, we consider the very same drift processes affecting the training data. However, the interpretation of the relevant performance measure is different. In (A) only the training data is subject to the drift, but the classifier is evaluated with respect to an idealized static situation representing a fixed target. On the contrary, the tracking error in (B) is thought to be computed with respect to test data available from the stream, at the given time. Alternatively, one could interpret (B) as an example of real drift with a non-stationary target, where \(\epsilon _{{\rm track}}\) represents the corresponding generalization error. However, we will refer to (A) and (B) as virtual drift throughout the following.
Real drift in regression
In the presented framework, a real drift can be modelled as a process which displaces the characteristic vectors \({\bf B}_{1,2}\), i.e., the cluster centers in LVQ or the teacher weight vectors in the SCM. Here we focus on the latter case and refer the reader to [47] for earlier results on LVQ training under real drift.
A variety of time dependences could be considered in the model. We restrict ourselves to the analysis of diffusion-like random displacements of vectors \({\bf B}_{1,2} (\mu )\) at each time step. Upon presentation of example \(\mu\), we assume that random vectors \({\bf B}_{1,2}(\mu )\) are generated which satisfy the conditions
$$\begin{aligned}&{\bf B}_1(\mu ) \cdot {\bf B}_1(\mu \!-\!1) = {\bf B}_2(\mu ) \cdot {\bf B}_2(\mu \!-\!1) = \left( 1 - {\delta }/{N}\right) \nonumber \\&{\bf B}_1(\mu )\cdot {\bf B}_2(\mu )= 0 \text{ and } \mid {\bf B}_1(\mu )\mid ^2 = \mid {\bf B}_2(\mu )\mid ^2 = 1. \end{aligned}$$
(32)
Here \(\delta\) quantifies the strength of the drift process. The displacement of the teacher vectors is very small in an individual training step. For simplicity we assume that the orthonormality of teacher vectors is preserved in the drift. In continuous time \(\alpha =\mu /N\), the drift parameter defines a characterstic scale \(1/\delta\) on which the overlap of the current teacher vectors with their initial positions decay: \({\bf B}_{m}(\mu )\cdot {\bf B}_{m}(0)\, = \exp [-\delta \, \mu /N ].\)
The effect of such a drift process is easily taken into account in the formalism: For a particular student \({\bf w}_i\in \mathbb {R}^N\) we obtain [6, 7, 28, 50]
$$\begin{aligned} \left[ {\bf w}_i\cdot {\bf B}_k(\mu )\right] = \left( 1- {\delta }/{N}\right) \, \left[ {\bf w}_i\cdot {\bf B}_k(\mu -1)\right] . \end{aligned}$$
(33)
under the above specified random displacement. Hence, the drift tends to decrease the quantities \(R_{ik}\) which clearly reduces the success of training compared with the case of stationary teachers. The corresponding ODE in the limit \(N\rightarrow \infty\) in the drift process (32) become
$$\begin{aligned}&\left[ {{\rm d}R_{{\rm im}}}/{{\rm d}\alpha } \right] _{{\rm drift}} \, = \, \left[ {{\rm d}R_{{\rm im}}}/{{\rm d}\alpha } \right] _{{\rm stat}} \, - \delta \, R_{{\rm im}} \text{ and } \nonumber \\&\left[ {{\rm d}Q_{ik}}/{{\rm d}\alpha }\right] _{{\rm drift}} = \left[ {{\rm d}Q_{ik}}/{{\rm d}\alpha }\right] _{{\rm stat}} \end{aligned}$$
(34)
with the terms \(\left[ \cdots \right] _{{\rm stat}}\) for stationary environments taken from Eq. (20). Note that now order parameters \(R_{{\rm im}}(\alpha )\) correspond to the inner products \({\bf w}_i^\mu \cdot {\bf B}_m(\alpha )\), as the teacher vectors themselves are time-dependent.
Weight decay
Possible motivations for the introduction of so-called weight decay in machine learning systems range from regularization as to reduce the risk of over-fitting in regression and classification [15, 22, 23] to the modelling of forgetful memories in attractor neural networks [24, 37].
Here we include weight decay as to enforce explicit forgetting and to potentially improve the performance of the systems in the presence of real concept drift. We consider the multiplication of all adaptive vectors by a factor \((1-\gamma /N)\) before the generic learning step given by \(\Delta {\bf w}_i^\mu\) in Eq. (2) or Eq. (10) is performed:
$$\begin{aligned} {\bf w}_i^\mu \, = \, \left( 1-{\gamma }/{N}\right) \, {\bf w}_i^{\mu -1} \, + {\eta }/{N} \, \Delta {\bf w}_i^\mu . \end{aligned}$$
(35)
Since the multiplications with \(\left( 1-\gamma /N\right)\) accumulate in the course of training, weight decay enforces an increased influence of the most recent training data as compared to earlier examples. Note that analagous modifications of perceptron training under concept drift have been discussed in [6, 7, 28, 50].
In the thermodynamic limit \(N\rightarrow \infty\), the modified ODE for training under real drift, cf. Eq. (32), and weight decay, Eq. (35), are obtained as
$$\begin{aligned}&\left[ {{\rm d}R_{{\rm im}}}/{{\rm d}\alpha } \right] _{{\rm decay}} = \left[ {{\rm d}R_{{\rm im}}}/{{\rm d}\alpha } \right] _{{\rm stat}} - (\delta +\gamma ) R_{{\rm im}} \text{ and } \nonumber \\&\left[ {{\rm d}Q_{ik}}/{{\rm d}\alpha }\right] _{{\rm decay}} \, = \left[ {{\rm d}Q_{ik}}/{{\rm d}\alpha }\right] _{{\rm stat}} - 2\, \gamma \,Q_{ik} \end{aligned}$$
(36)
where the terms for stationary environments in absence of weight decay are given in Eq. (20).