1 Introduction

This article is concerned with the automatic construction of non-linear basis functions for linear algorithms. This is of particular importance if the original space of inputs \(\mathcal{X} \subseteq\mathbb{R}^{l}\) can not support an adequate linear solution.

New algorithms are often initially formulated using a linear function class \(\mathcal{F}_{\mathcal{X}} := \{ f(\boldsymbol{x}) = \sum_{i=1}^{l} w_{i} x_{i} | \boldsymbol{w} \in\mathbb{R}^{l} \}\) of possible solutions \(f: \mathcal{X} \to\mathbb{R}\). Examples include linear regression and the discrimination function for linear classification. We adopt the point of view that the desired solutions are actually defined on the domain of an unknown low dimensional space of latent variables Θ, which is embedded in the high dimensional space of sensor observations \(\boldsymbol{x} \in \mathcal{X}\). The restriction to linear functions \(f \in\mathcal{F}_{\mathcal{X}}\), however, can prevent a suitable solution because (a) Θ might be non-linearly embedded in \(\mathcal{X}\) and (b) the true solution f might be non-linear in Θ. This can be compensated by the introduction of a feature space \(\varPhi:= \{ \boldsymbol{\phi}(\boldsymbol{x}) | \boldsymbol{x} \in\mathcal{X} \}\) with mappings \(\phi_{i}: \mathcal{X} \to\mathbb{R}, \forall i \in\{ 1,\ldots,p\}\), such that:

  • ϕ i (⋅),∀i∈{1,…,p}, is non-linear in \(\mathcal{X}\) to encode Θ rather than \(\mathcal{X}\).

  • \(\{\phi_{i}\}_{i=1}^{p}\) constitutes a well behaving functional basis in Θ, e.g. a Fourier basis.

  • Dimensionality p is as low as possible to reduce the number of training samples required to reliably estimate \(f \in\mathcal{F}_{\varPhi}:= \{ f(\boldsymbol{x}) = \sum_{i=1}^{p} w_{i} \phi_{i}(\boldsymbol{x}) | \boldsymbol{w} \in\mathbb{R}^{p} \}\).

This article outlines an approach to automatically construct such a feature space Φ for a given time series. The critical assumption is that there exist an unknown low dimensional space of latent variables Θ, that generated the observed data \(\boldsymbol{x} \in \mathcal{X}\) by some unknown stochastic process \(\varTheta\to\mathcal{X}\). It is important to understand that we do not talk about the true underlying cause of the data, but a description Θ that suffices to generate the training samples \(\{\boldsymbol{x}_{t}\}_{t=1}^{n} \subset\mathcal{X}\). Additionally, we restrict ourselves here to the most general encoding of Θ, i.e. we assume no additional information about labels or classes of the training samples.Footnote 1 This is achieved by an unsupervised learning principle called slow feature analysis (SFA, Wiskott and Sejnowski 2002). SFA aims for temporally coherent features out of high dimensional and/or delayed sensor measurements. Given an infinite time series and an unrestricted function class, the learned features will be a Fourier basisFootnote 2 in Θ (Wiskott 2003; Franzius et al. 2007). Although there have been numerous studies highlighting its resemblance to biological sensor processing (Wiskott and Sejnowski 2002; Berkes and Wiskott 2005; Franzius et al. 2007), the method has not yet found its way in the engineering community that focuses on the same problems. One of the reasons is undoubtedly the lack of an easily operated non-linear extension that is powerful enough to generate a Fourier basis.

The major contribution of this article is to provide such an extension in the form of a kernel SFA algorithm and to demonstrate its application at the example of linear classification. An approach on kernelizing SFA has previously been made by Bray and Martinez (2002) and is reported to work well with a large image data set. Small training sets, however, lead to numerical instabilities in any kernel SFA algorithm. Our goal is to provide an algorithm that can be applied to both of the cases above and to document the pitfalls that arise in its application on real-world time series. This algorithm has previously been presented in a conference paper, which has been the basis for this article (Böhmer et al. 2011). Beyond the scope of the original paper, we provide evidence for its main hypothesis and evaluate linear classification algorithms in the feature spaces generated by kernel SFA and kernel PCA (Schölkopf et al. 1997).

Although formulated as a linear algorithm, SFA was originally intended to be applied on the space of polynomials, e.g. quadratic (Wiskott and Sejnowski 2002) or cubic (Berkes and Wiskott 2005). The polynomial expansion of potentially high dimensional data, however, spans an impractically large space of coefficients. Hierarchical application of quadratic SFA has been proposed to solve this problem (Wiskott and Sejnowski 2002). Although proven to work in complex tasks (Franzius et al. 2007), this approach involves a multitude of hyper-parameters and no easy way to counteract inevitable over-fitting is known so far. It appears biologically plausible, but is definitely not easy to operate.

A powerful alternative to polynomial expansions are kernel methods. Here the considered feature maps \(\phi: \mathcal{X} \to\mathbb{R}\) are elements of a reproducing kernel Hilbert space \(\mathcal{H}\). For many optimization problems a representer theorem holds in line with Wahba (1990), if a regularization term is used. Such theorems guarantee that the optimal solution for a given training set exists within the span of kernel functions, which are parametrized by training samples. Depending on the kernel, continuous functions can be approximated arbitrarily well in \(\mathcal{H}\) and a mapping \(\phi\in\mathcal{H}\) is thus very powerful (Shawe-Taylor and Cristianini 2004).

There are, however, fundamental drawbacks in a kernel approach to SFA. First, choosing feature mappings from a powerful Hilbert space is naturally prone to over-fitting. More to the point, kernel SFA shows numerical instabilities due to it’s unit variance constraint (see Sects. 3 and 5). This tendency has been analytically shown for the related kernel canonical correlation analysis (Fukumizu et al. 2007). We introduce a regularization term for the SFA objective to enforce a stable solution. Secondly, kernel SFA is based on a kernel matrix of size \(\mathcal{O}(n^{2})\), where n is the number of training samples. This is not feasible for large training sets. Our approach approximates the optimal solution by projecting into a sparse subset of the data. The choice of this subset, however, is a crucial decision.

The question how many samples should be selected can only be answered empirically. We compare two state-of-the-art sparse subset selection algorithms that approach this problem very differently: (1) A fast online algorithm (Csató and Opper 2002) that must recompute the whole solution to change the subset’s size. (2) A computational costly matching pursuit approach to sparse kernel PCA (Smola and Schölkopf 2000) that incrementally augments the selected subset. To obtain a method that is both fast and incremental we derive a novel matching pursuit approach to the first algorithm.

Bray and Martinez (2002) have previously introduced a kernel SFA algorithm that incorporates a simplistic sparsity scheme. Instead of the well-established framework of Wiskott and Sejnowski (2002), they utilize the cost function of Stone (2001) based on long and short term variances without explicit constraints. Due to a high level of sparsity, their approach does not require function regularization. We show that the same holds for our algorithm if the sparse subset is only a small fraction of the training data. However, for larger fractions additional regularization becomes inevitable.

Theoretic predictions by Wiskott (2003) and Franzius et al. (2007) suggests that regularized sparse kernel SFA features will resemble a Fourier basis in the space of latent variables Θ for a given real-world time series. To verify this hypothesis, we perform a vowel classification task on spoken words. None of the tested linear classification algorithms (Bishop 2006) were able to solve the task without non-linear basis functions. Our work extends Berkes (2005), who used non-linear SFA to classify images of hand written digits. His work was based on artificial time series rather that real-world data, though. Our results show excellent classification accuracy of more than 97 % and superior encoding of latent variables by kernel SFA in comparison with kernel PCA (Schölkopf et al. 1997). To the best of our knowledge, this work and its predecessor (Böhmer et al. 2011) are the first attempt to apply non-linear SFA as a pre-processing for audio data.

In the following section, we first review a variety of linear classification algorithms, which we will use to compare the constructed feature spaces Φ. In Sect. 3 we formulate the general SFA optimization problem and derive a regularized sparse kernel SFA algorithm. In Sect. 4 the sparse subset selection is introduced and a novel matching pursuit algorithm derived. Section 5 evaluates both the SFA algorithm and the generated feature space on a vowel classification task, followed by a discussion of our method and possible extensions in Sect. 6.

2 Linear classification algorithms

Classification aims to assign to each test sample \(\boldsymbol{x}^{*} \in \mathcal{X}\) a class \(c^{*} \in \mathcal{C}\), based on a given training set of samples \(\{\boldsymbol{x}_{t}\}_{t=1}^{n}\) and their assigned classes \(\{c_{t}\}_{t=1}^{n}\). In the case of two classes \(\mathcal{C} = \{-1, +1\}\), linear classification aims for a discrimination function \(f \in \mathcal{F}_{\varPhi}= \{ f(\boldsymbol{x}) = \sum_{i=1}^{p} w_{i} \phi_{i}(\boldsymbol{x}) | \boldsymbol{w} \in\mathbb{R}^{p} \}\) with \(\operatorname{sign} (f(\boldsymbol{x}_{t}) ) = c_{t}, \forall t \in\{ 1,\ldots ,n\}\). All of the below algorithms are excellently discussed by Bishop (2006).

2.1 Least squares classification/LDA/FDA

The most straight forward approach to classification is to encode each class label with a 1-of-\(|\mathcal{C}|\) binary label vectorFootnote 3 and to perform least-squares (LS) regression of \(\boldsymbol{g} \in \mathcal{F}^{| \mathcal{C}|}_{\varPhi}\) against this multivariate target (Bishop 2006). The function \(g_{i} \in \mathcal{F}_{\varPhi}\) with the highest output on a test sample yields the corresponding class. For example, in the two-class case \(\mathcal{C} = \{-1,+1\}\), the discrimination function \(f \in \mathcal{F}_{\varPhi}\) that separates the classes is given by f(x)=g 2(x)−g 1(x). The LS solution is found by minimizing

$$ \min_{\boldsymbol{g} \in \mathcal{F}_\varPhi^{| \mathcal{C}|}} \sum_{i=1}^{|\mathcal{ C}|} \sum_{t=1}^n \bigl( g_i( \boldsymbol{x}_t) - T_{it} \bigr)^2 . $$
(1)

As each linear function \(g_{i} \in \mathcal{F}_{\varPhi}\) is characterized by a vector w i ∈ℝp, the above cost function has the uniqueFootnote 4 analytical solution

$$ \boldsymbol{w}_i = \mathbf{C}^{-1} \boldsymbol{\mu}_i , \quad\text{where } C_{ij} := \frac{1}{n} \sum_{t=1}^n \phi_i(\boldsymbol{x}_t) \phi_j( \boldsymbol{x}_t) \quad\text{and} \quad \boldsymbol{ \mu}_i := \frac{1}{n} \sum_{t=1}^n T_{it} \boldsymbol{\phi}(\boldsymbol{x}_t) . $$
(2)

The terms μ i ∈ℝp and C∈ℝp×p are the empirical estimates of a weighted ith class mean and the second moment correlation matrix in feature space Φ. Notice that in zero mean and unit variance feature spaces, which we will compare in Sect. 5, the computation reduces to the estimation of the class means μ i . In general, however, the algorithm has a computational complexity of \(\mathcal{O}(p^{2} n + p^{3})\).

If Φ has zero mean w.r.t. all samples and the classes are equally distributed, C is the estimate of the covariance matrix and the solution is equivalent to a Gaussian estimate of the class probability densities with the restriction that all classes share the same covariance (Bishop 2006). This approach to derive the discrimination function out of Gaussian density estimates is also called linear discrimination analysis (LDA). With small modifications to the target weights (Duda and Hart 1973), the solution of (2) becomes also equivalent to Fisher discrimination analysis (FDA, Fisher 1936). As the empirical differences in our experiments were marginal, we restricted ourselves in Sect. 5 to LDA. Note, however, that due to the strong assumption of equal covariances, these algorithms are known to perform poorly if this assumption is not met.

2.2 Perceptron classification

The algorithm that started the field of artificial neural networks (Rosenblatt 1962) can classify two-class problems with \(\mathcal{C} = \{-1,+1\}\) by assigning a signum function onto the output of the discrimination function \(f \in \mathcal{F}_{\varPhi}\), i.e. \(y(\boldsymbol{x}_{t}) = \operatorname{sign} (f(\boldsymbol{x}_{t}) )\). At sample x t , the online version of this algorithm updates the weight vector w∈ℝp of \(f(\boldsymbol{x}_{t}) = \sum_{i=1}^{p} w_{i} \phi_{i}(\boldsymbol{x}_{t})\) by \(\Delta w_{i}^{(t)} := - \phi_{i}(\boldsymbol{x}_{t}) c_{t}\), but only if the prediction y(x t ) was incorrect. The update ww+αΔw (t) with learning rate 0<α≤1 is repeated until all training samples are classified correctly. The algorithm is guaranteed to converge for linear separable problems (Rosenblatt 1962).

We do not expect linearly separable problems and thus devised a batch algorithm with a decaying learning rate, i.e. α i :=1/i at iteration i until convergence.

$$ w_j^{(i+1)} := w_j^{(i)} - \frac{\alpha_i}{\sum_{t=1}^n d^{(i)}_t} \sum_{t=1}^n d^{(i)}_t c_t \phi_j( \boldsymbol{x}_t) , \quad\text{where } d^{(i)}_t := \left \{ \begin{array}{l@{\quad}l} 1 ,& \text{if } y(\boldsymbol{x}_t) \neq c_t \\ 0 ,& \text{otherwise}. \end{array} \right . $$
(3)

Convergence criteria that stay comparable over multiple feature space sizes are hard to formulate and so we ran the algorithm for 100 iterations, which appeared to be sufficient for convergence in all feature spaces Φ. The computational complexity for i iterations of the above algorithm is \(\mathcal{O}(pin)\).

2.3 Logistic regression

Rather than a signum in the perceptron algorithm, the logistic regression algorithm applies the logistic sigmoid function σ(a):=(1+exp(−a))−1 onto the discrimination function \(f \in \mathcal{F}_{\varPhi}\) to approximate the probability P(c t =1|x t )≈y(x t ):=σ(f(x t )). By encoding the classes \(\mathcal{C} := \{0,1\}\), one can express the likelihood function in the two-classes case (Bishop 2006) for a given discriminant function f(⋅) as

$$ p(c_1,\ldots,c_n | \boldsymbol{x}_1,\ldots, \boldsymbol{x}_n) = \prod_{t=1}^n \bigl(y(\boldsymbol{x}_t) \bigr)^{c_t} \bigl(1 - y( \boldsymbol{x}_t) \bigr)^{1-c_t} . $$
(4)

This likelihood can be maximized by the Newton-Raphson scheme (Fletcher 1987). The method requires the calculation of gradient and Hessian of the log-likelihood. As the sigmoid σ(⋅) is nonlinear, the Hessian H (i) is no longer independent of the discrimination function \(f^{(i)} \in \mathcal{F}_{\varPhi}\) and has to be recomputed every iteration i. The resulting algorithm is called iterative reweighted least squares (Rubin 1983):

(5)

The algorithm showed clear signs of over-fitting in large feature spaces. RegularizationFootnote 5 of the Hessian H (i)H (i)+η I has proven very effective in compensating for this problem (see Sect. 5.4). As the perceptron algorithm, we ran the algorithm for 100 iterations, which appeared to be sufficient for convergence. With i iterations the computational complexity is \(\mathcal{O}(p^{2} in + p^{3}i)\).

3 Slow feature analysis

Let \(\{\boldsymbol{x}_{t}\}_{t=1}^{n} \subset \mathcal{X}\) be a sequence of n observations. The goal of slow feature analysis (SFA) is to find a set of mappings \(\phi_{i}: \mathcal{X} \to\mathbb{R}\), i∈{1,…,p}, such that the values ϕ i (x t ) change slowly over time (Wiskott and Sejnowski 2002). A mapping ϕ i ’s change over time is measured by the discrete temporal derivative \(\dot{\phi}_{i}(\boldsymbol{x}_{t}) := \phi_{i}(\boldsymbol{x}_{t}) - \phi_{i}(\boldsymbol{x}_{t-1})\). The SFA objective (called slowness, (6)) is to minimize the squared mean of this derivative, where \(\mathbb{E}_{t}[\cdot]\) is the sample meanFootnote 6 over all available indices t:

(6)

To avoid trivial solutions as well as to deal with mixed sensor observations in different scales, the mappings are forced to change uniformly, i.e. to exhibit unit variance ((7) and (8)). Decorrelation ensures every mapping to extract unique information (9). The last degree of freedom is eliminated by demanding order (10), leading to the following constraints:

(7)
(8)
(9)
(10)

The principle of slowness, although not the above definition, has been used very early in the context of neural networks (Földiák 1991; Becker and Hinton 1992). Recent variations of SFA differ either in the objective (Bray and Martinez 2002) or the constraints (Einhäuser et al. 2005; Wyss et al. 2006). For some simplified cases, given an infinite time series and unrestricted function class, it can be analytically shown that SFA solutions converge to trigonometric polynomials w.r.t. the underlying latent variables (Wiskott 2003; Franzius et al. 2007). In reality those conditions are never met and one requires a function class that can be adjusted to the data set at hand.

3.1 Kernel SFA

Let the considered mappings \(\phi_{i}: \mathcal{X} \to\mathbb{R}\) be elements of a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\), with corresponding positive semi-definite kernel \(\kappa: \mathcal{X} \times \mathcal{X} \to\mathbb{R}\). The reproducing property of those kernels allows \(\forall\phi\in \mathcal{H}: \phi(\boldsymbol{x}) = \langle\phi , \kappa(\cdot , \boldsymbol{x}) \rangle_{\mathcal{H}}\), in particular \(\langle\kappa(\cdot,\boldsymbol{x}) , \kappa(\cdot,\boldsymbol{x}^{\prime}) \rangle_{\mathcal{H}} = \kappa (\boldsymbol{x},\boldsymbol{x}^{\prime})\). The representer theorem ensuresFootnote 7 the solution to be in the spanFootnote 8 of support functions, parametrized by the training data (Wahba 1990), i.e. \(\phi= \sum_{t=1}^{n} a_{t} \kappa(\cdot, \boldsymbol{x}_{t})\). Together those two relationships set the basis for the kernel trick (see e.g. Shawe-Taylor and Cristianini 2004).

The zero mean constraint can be achieved by centering all involved support functions \(\{\kappa(\cdot, \boldsymbol{x}_{t})\}_{t=1}^{n}\) in \(\mathcal{H}\) (for details see Sect. 3.2). Afterwards, the combined kernel SFA (K-SFA) optimization problem for all p mappings \(\boldsymbol{\phi}(\cdot) \in \mathcal{H}^{p}\) is

$$ \min_{\boldsymbol{\phi}{\in} { \mathcal{H}}^p} {\sum _{i=1}^p } \mathbb{E}_t \bigl[ {\dot{\phi}}^2_i(\boldsymbol{x}_t) \bigr], \quad \text{s.t. } \mathbb{E}_t \bigl[ \boldsymbol{\phi}( \boldsymbol{x}_t) \boldsymbol{\phi}(\boldsymbol{x}_t)^{\top} \bigr] = \mathbf{I} . $$
(11)

Through application of the kernel trick, the problem can be reformulated as

(12)

where K ij =κ(x i ,x j ) is the kernel matrix and D∈ℝn×n−1 the temporal derivation matrix Footnote 9 with all zero entries except ∀t∈{1,…,n−1}:D t,t =−1 and D t+1,t =1.

Sparse kernel SFA

If one assumes the feature mappings within the span of another set of data \(\{\boldsymbol{z}_{i}\}_{i=1}^{m} \subset \mathcal{X}\) (e.g. a sparse subset of the training data, often called support vectors), the sparse kernel matrix K∈ℝm×n is defined as K ij =κ(z i ,x j ) instead. The resulting algorithm will be called sparse kernel SFA. Note that the representer theorem no longer applies and therefore the solution merely approximates the optimal mappings in \(\mathcal{H}\). Both optimization problems have identical solutions if \(\forall t \in\{1,\ldots,n\}: \kappa(\cdot, \boldsymbol{x}_{t}) \in\mathbf{span}(\{\kappa(\cdot, \boldsymbol{z}_{i})\}_{i=1}^{m})\), e.g. \(\{\boldsymbol{z}_{i}\}_{i=1}^{n} = \{\boldsymbol{x}_{t}\}_{t=1}^{n}\).

Regularized sparse kernel SFA

The Hilbert spaces corresponding to some of the most popular kernels are equivalent to an infinite dimensional space of continuous functions (Shawe-Taylor and Cristianini 2004). One example is the Gaussian kernel \(\kappa(\boldsymbol{x},\boldsymbol{x}') = \exp(-\frac{1}{2\sigma^{2}} \|\boldsymbol{x} - \boldsymbol{x}'\|_{2}^{2})\). Depending on hyper-parameter σ and data distribution, this can obviously lead to over-fitting. Less obvious, however, is the tendency of kernel SFA to become numerically unstable for large σ, i.e. to violate the unit variance constraint. Fukumizu et al. (2007) have shown this analytically for the related kernel canonical correlation analysis. Note that both problems do not affect sufficiently sparse solutions, as sparsity reduces the function complexity and sparse kernel matrices KK are more robust w.r.t. eigenvalue decompositions.

One countermeasure is to introduce a regularization term to stabilize the sparse kernel SFA algorithm, which thereafter will be called regularized sparse kernel SFA (RSK-SFA). Our approach penalizes the squared Hilbert-norm of the selected functions \(\| \phi_{i} \|_{ \mathcal{H}}^{2}\) by a regularization parameter λ. Analogous to K-SFA the kernel trick can be utilized to obtain the new objective:

(13)

where \(\bar{K}_{ij} = \kappa(\boldsymbol{z}_{i}, \boldsymbol{z}_{j})\) is the kernel matrix of the support vectors.

3.2 The RSK-SFA algorithm

The RSK-SFA algorithm (Algorithm 1) is closely related to the linear SFA algorithm of Wiskott and Sejnowski (2002). It consists of three phases: (1) fulfilling zero mean by centering, (2) fulfilling unit variance and decorrelation by sphering and (3) minimizing the objective by rotation.

Algorithm 1
figure 1

Regularized sparse kernel slow feature analysis (RSK-SFA)

Zero mean

To fulfill the zero mean constraint, one centers the support functions \(\{g_{i}\}_{i=1}^{m} \subset \mathcal{H}\) w.r.t. the data distribution, i.e. \(g_{i}(\cdot) := \kappa(\cdot, \boldsymbol{z}_{i}) - \mathbb{E}_{t}[\kappa(\boldsymbol{x}_{t}, \boldsymbol{z}_{i})] \boldsymbol{1}_{\mathcal {H}}(\cdot)\), where \(\forall\boldsymbol{x} \in \mathcal{X}: \boldsymbol{1}_{\mathcal{H}}(\boldsymbol{x}) = \langle\boldsymbol{1}_{\mathcal{H}} , \kappa(\cdot,\boldsymbol{x}) \rangle_{\mathcal{H}} := 1\) is the constant functionFootnote 10 in Hilbert space \(\mathcal{H}\). Although \(\forall\phi\in\mathbf{span}(\{g_{i}\}_{i=1}^{m}): \mathbb{E}_{t}[\phi(\boldsymbol{x}_{t})] = 0\) already holds, it is of advantage to center the support functions as well w.r.t. each other (Schölkopf et al. 1998), i.e. \(\hat{g}_{i} := g_{i} - \mathbb{E}_{j}[g_{j}]\). The resulting transformation of support functions on the training data can be applied directly onto the kernel matrices K and \(\bar{\mathbf{K}}\):

(14)
(15)

where 1 m and 1 n are one-vectors of dimensionality m and n, respectively.

Unit variance and decorrelation

Analogue to linear SFA, we first project into the normalized eigenspace of \(\frac{1}{n}\hat{\mathbf{K}} \hat{\mathbf{K}}^{\top}=: \mathbf{U} \boldsymbol{\varLambda} \mathbf{U}^{\top}\). The procedure is called sphering or whitening and fulfills the constraint in (13), invariant to further rotations R∈ℝm×p:R R=I.

$$ \mathbf{A} := \mathbf{U} \boldsymbol{\varLambda}^{-\frac{1}{2}} \mathbf{R} \quad\Rightarrow\quad {\frac{1}{n}} \mathbf{A}^{\top} \hat{\mathbf{K}} \hat{\mathbf{K}}^{\top} \mathbf{A} = \mathbf{R}^{\top}\mathbf{R} = \mathbf{I}. $$
(16)

Note that an inversion of the diagonal matrix Λ requires the removal of zero eigenvalues and corresponding eigenvectors first.

Minimization of the objective

Application of (16) allows us to solve (13) with a second eigenvalue decomposition:

(17)

Note that R is composed of the eigenvectors to the p smallest eigenvalues of the above expression.

Solution

After the above calculations the i’th RSK-SFA solution is

(18)

Grouping the kernel functions of all support vectors together in a column vector, i.e. k(x)=[κ(z 1,x),…,κ(z m ,x)], the combined solution \(\boldsymbol{\phi}(\cdot) \in\mathcal{H}^{p}\) can be expressed more compactly:

(19)

The computational complexity is \(\mathcal{O}(m^{2} n)\). For illustrative purposes, Algorithm 1 exhibits a memory complexity of \(\mathcal{O}(mn)\). An online calculation of \(\hat{\mathbf{K}} \hat{\mathbf{K}}^{\top}\) and \(\hat{\mathbf{K}} \mathbf{D} \mathbf{D}^{\top}\hat{\mathbf{K}}^{\top}\) reduces this to \(\mathcal{O}(m^{2})\).

4 Sparse subset selection

Representer theorems guarantee that the target function \(\phi^{*} \in\mathcal{H}\) for training set \(\{\boldsymbol{x}_{t}\}_{t=1}^{n}\) can be found within \(\mathrm{span}(\{\kappa(\cdot,\boldsymbol{x}_{t})\}_{t=1}^{n})\). For sparse K-SFA, however, no such guarantee exists. The quality of such a sparse approximation depends exclusively on the set of support vectors \(\{\boldsymbol{z}_{i}\}_{i=1}^{m}\).

Without restriction on ϕ , it is straight forward to select a subset of the training data, indicated by an index vector Footnote 11 i∈ℕm with \(\{\boldsymbol{z}_{j}\}_{j=1}^{m} := \{\boldsymbol{x}_{\boldsymbol{i}_{j}}\}_{j=1}^{m}\), that minimizes the approximation error

for all training samples x t , where K tj =κ(x t ,x j ) is the full kernel matrix. Finding an optimal subset is an NP hard combinatorial problem, but there exist several greedy approximations to it.

Online maximization of the affine hull

A widely used algorithm (Csató and Opper 2002), which we will call online maximization of the affine hull (online MAH) in the absence of a generally accepted name, iterates through the data in an online fashion. At time t, sample x t is added to the selected subset if \(\epsilon_{t}^{\boldsymbol{i}}\) is larger than some given threshold η. Exploitation of the matrix inversion lemma (MIL) allows an online algorithm with computational complexity \(\mathcal{O}(m^{2} n)\) and memory complexity \(\mathcal{O}(m^{2})\). The downside of this approach is the unpredictable dependence of the final subset size m on hyper-parameter η. Changing the subset size requires therefore a complete re-computation with larger η. The resulting subset size is not predictable, although monotonically dependent on η.

Matching pursuit for sparse kernel PCA

This handicap is addressed by matching pursuit methods (Mallat and Zhang 1993). Applied on kernels, some criterion selects the best fitting sample, followed by an orthogonalization of all remaining candidate support functions in Hilbert space \(\mathcal{H}\). A resulting sequence of m selected samples therefore contains all sequences up to length m as well. The batch algorithm of Smola and Schölkopf (2000) chooses the sample x j that minimizesFootnote 12 \(\mathbb{E}_{t} [\epsilon_{t}^{\boldsymbol{i} \cup j} ]\). It was shown later that this algorithm performs sparse PCA in \(\mathcal{H}\) (Hussain and Shawe-Taylor 2008). The algorithm, which we will call in the following matching pursuit for sparse kernel PCA (MP KPCA), has a computational complexity of \(\mathcal{O}(n^{2}m)\) and a memory complexity of \(\mathcal{O}(n^{2})\). In practice it is therefore not applicable to large data sets.

Algorithm 2
figure 2

Matching pursuit maximization of the affine hull (MP MAH)

4.1 Matching pursuit for online MAH

The ability to change the size of the selected subset without re-computation is a powerful property of MP KPCA. Run-time and memory consumption, however, make this algorithm infeasible for most applications. To extend the desired property to the fast online algorithm, we derive a novel matching pursuit for online MAH algorithm (MP MAH). Online MAH selects samples with approximation errors that exceed the threshold η and therefore forces the supremum norm L of all samples below η. This is analogous to a successive selection of the worst approximated sample, until the approximation error of all samples drops below η. The matching pursuit approach therefore minimizes the supremum norm L of the approximation error.Footnote 13 At iteration j, given the current subset i∈ℝj,

$$ \boldsymbol{i}_{j+1} := \operatorname{argmin}_t \bigl\| \bigl[ \epsilon^{\boldsymbol{i} \cup t}_1, \ldots, \epsilon^{\boldsymbol{i} \cup t}_n \bigr] \bigr\|_\infty \approx\operatorname{argmax}_t \epsilon^{\boldsymbol{i}}_t. $$
(20)

Straight forward re-computation of the approximation error in every iteration is expensive. Using the matrix inversion lemma (MIL), this computation can be performed iteratively:

$$ \epsilon^{\boldsymbol{i} \cup j}_t = \epsilon^{\boldsymbol{i}}_t - \frac{1}{\epsilon^{\boldsymbol{i}}_j} \bigl(K_{tj} - \boldsymbol{K}_{t \boldsymbol{i}} (\mathbf{K}_{\boldsymbol{i} \boldsymbol{i}})^{-1} \boldsymbol{K}_{\boldsymbol{i} j} \bigr)^2 . $$
(21)

Algorithm 2 iterates between sample selection (20) and error update (21). The complexity is \(\mathcal{O}(m^{2}n)\) in time and \(\mathcal{O}(mn)\) in memory (to avoid re-computations of K (:,i)).

5 Empirical validation

In this section we will evaluate the hypothesis that RSK-SFA generates a feature space Φ which resembles a Fourier basis in the space of latent variables Θ for a given time series at the example of a two-vowel classification task.

The true Θ is not known and we use the performance of different linear classification algorithms to measure how well RSK-SFA features encode Θ. For comparison, all algorithms are tested on feature spaces generated by RSK-SFA and sparse kernel PCA (Schölkopf et al. 1997). Results in Sect. 5.4 show that RSK-SFA features indeed encode Θ, whereas kernel PCA does not aim for such an encoding. We also demonstrate the importance of sparse subset selection to an efficient encoding. Last but not least, a classification accuracy of more than 97 % in a task that is not linearly solvable demonstrates the excellent performance of our approach.

5.1 Benchmark data sets

The “north Texas vowel database”Footnote 14 contains uncompressed audio files with English words of the form h…d, spoken multiple times by multiple persons (Assmann et al. 2008). The natural task is to predict the central vowels of unseen instances of a trained word. To cover both cases of small and large training sets, we selected two data sets: (i) A small set with four training and four test instances for each of the words “heed” and “head”, spoken by the same person. (ii) A large data set containing all instances of “heed” and “head” for all 18 adult subjects. To apply cross-validation, we performed 20 random splits (folds) into 12 training and 6 test subjects, leading to 245±10 training instances and 116±10 test instances.

The spoken words are provided as mono audio streams of varying length at 48kHz, i.e. as a series of amplitude readings {a 1,a 2,…}. SFA requires the space of latent variables Θ to be embedded in the space of observations \(\mathcal{X}\), which is not the case for one-dimensional data. The problem resides in the ambiguity of the observed amplitude readings a t , i.e. mappings \(\phi_{i}: \mathcal{X} \to\mathbb{R}^{p}\) can not distinguish between two latent states which both generate the same observed amplitude a t . However, Takens Theorem (Takens 1981; Huke 2006) guarantees an embedding of Θ as a manifold in the space \(\mathcal{X} \subset\mathbb{R}^{l}\) of sufficiently many time-delayed observations. Based on the observed pattern of l time-delayed amplitude readings, non-linear SFA and PCA can utilize the resulting one-to-one mapping of latent variables in Θ onto a manifold of observations in \(\mathcal{X}\) to extract basis functions ϕ i (⋅) that encode Θ. We therefore defined our samples x t :=[a δt ,a δt+ϵ ,a δt+2ϵ ,…,a δt+(l−1)ϵ ], which is also called sliding window.Footnote 15 We evaluated the parameters δ, ϵ and l empiricallyFootnote 16 and chose δ=50, ϵ=5 and l=500. All algorithms were trained and tested with the joint9 time series {x t } of all instances of all subjects in the respective fold. Note, however, that all samples of each word-instance were exclusively used for either training or testing. Additionally, in the large data set all instances of one subject were exclusively used, too. This ensures that we really test for generalization to unseen instances of the trained words as well as to previously unseen subjects. The above procedure provided us with 3719(4108) trainings (test) samples x t ∈[−1,1]500 for the small and 97060±4615 (46142±4614) trainings (test) samples for the large data set.

5.2 RSK-SFA performance

We start our analysis with the RSK-SFA solution for different Gaussian kernels, i.e. \(k(\boldsymbol{x}, \boldsymbol{x}') = \exp (- \frac{1}{2\sigma^{2}} {\|\boldsymbol{x} - \boldsymbol{x}'\|_{2}^{2}} )\) with varying σ. To ensure that meaningful information is extracted, we measure the test slowness, i.e. the slowness of the learned feature mappings applied on a previously unseen test sequence drawn from the same distribution. Small σ, however, grossly underestimate a feature’s output on unseen test samples, as distances are strongly amplified. This changes the feature’s slowness. For comparison we normalized all outputs to unit variance on the test set before measuring the test slowness.

Figure 1 shows the mean test slowness of 200 RSK-SFA featuresFootnote 17 on both sets for multiple kernel parameter σ and regularization parameter λ. The small data set (a) uses the complete training set as support vectors, whereas for the large data set (b) a full kernel approach is not feasible. Instead we selected a subset of size 2500 with the MP MAH algorithm (based on kernel parameter σ=5) before training.

Fig. 1
figure 3

Mean test slowness of 200 RSK-SFA features over varying kernel parameter σ for different regularization parameters λ: (aSmall audio data set with all m=n=3719 training samples as support vectors, (blarge audio data set with m=2500 support vectors selected out of n=97060±4615 training samples by MP MAH. The small plots show the variance of the respective training output. Significant deviation from one violates the unit variance constraint and thus demonstrates numerical instability. The legend applies to all plots

In the absence of significant sparseness (Fig. 1a), unregularized kernel SFA (λ=0, equivalent to K-SFA, see (11) and (12)) shows both over-fitting and numerical instability. Over-fitting can be seen at small σ, where the features fulfill the unit variance constraint (small plot), but do not reach the minimal test slowness (main plot). The bad performance for larger σ, on the other hand, must be blamed on numerical instability, as indicated by a significantly violated unit variance constraint. Both can be counteracted by proper regularization. Although optimal regularization parameters λ are quite small and can reach computational precision, there is a wide range of kernel parameters σ for which the same minimal test slowness is reachable. E.g. in Fig. 1a, a fitting λ can be found between σ=0.5 and σ=20.

The more common case, depicted in Fig. 1b, is a large training set from which a small subset is selected by MP MAH. Here no regularization is necessary and unregularized sparse kernel SFA (λ=0) learns mappings of minimal slowness in the range from σ=1 to far beyond σ=500. Notice that this is rendering a time consuming search for an optimal parameter σ obsolete.

5.3 Sparse subset selection

To evaluate the behavior of RSK-SFA for sparse subsets of different size, Fig. 2 plots the test slowness of the small data set for all discussed sparse subset selection algorithms in Sect. 4. As a baseline, we plotted mean and standard deviation of a random selection scheme. One can observe that all algorithms surpass the random selection significantly but do not differ much w.r.t. each other. As expected, the MP MAH and Online MAH algorithms perform virtually identical. The novel MP MAH, however, allows unproblematic and fast fine tuning of the selected subset’s size.

Fig. 2
figure 4

Mean test slowness of 200 RSK-SFA features of the small audio data set (λ=10−7,σ=2) over sparse subset size, selected by different algorithms. For random selection the mean of 10 trials is plotted with standard deviation error bars

5.4 Classification performance

To demonstrate that Φ encodes Θ, we compared the accuracy of linear classification algorithms (see Sect. 2) in RSK-SFA and SK-PCA feature spaces of different sizes. Sparse kernel PCA optimizes the PCA objective (maximal variance) on the same function class as RSK-SFA, i.e. with the same kernel and support vectors. Because some iterative algorithms are sensitive to scaling, we scaled the SK-PCA features to unit variance on the test set, to make them fully comparable with RSK-SFA features.

In the evaluated two-class problem (“head” vs. “heed”), class labels are only available for whole words. Individual samples x t , however, do not necessarily belong to a vowel, but might contain one of the consonants in the beginning or end of the word. Many samples will therefore be ambiguously labeled, which the classifier must interpret as noise. When all samples of a word are classified, the final judgment about the spoken vowel is determined by the sum over all discrimination function outputs \(\sum_{t=1}^{n} f(\boldsymbol{x}_{t})\), rather than over the individual class predictions. This is supposed to take the classifiers certainty into account and yields much better results. It is also equivalent to the empirical temporal mean of the discriminant function output over the whole word.Footnote 18

This shift from training each sample to evaluating only the mean over all samples might look like a completely different problem at first. One could rightfully demand to train the classifiers on the mean feature vector \(\bar{\boldsymbol{\phi}} := \mathbb{E}_{t}[\boldsymbol{\phi}(\boldsymbol{x}_{t})]\) of each word instead of all individual samples thereof. Nonetheless, the authors chose the presented methodology to compare SFA and PCA feature spaces because

  1. 1.

    a training set of 245±10 words must over-fit in high dimensional feature spaces.

  2. 2.

    classification of individual samples x t must always result in very poor accuracy as most of them do not encode the vowel and will thus yield only noise.

  3. 3.

    \(\bar{\boldsymbol{\phi}}\) lives in the same feature space as the individual samples and should be affected by the same insufficiencies of that feature space Φ.

Figure 3 shows mean and standard deviation of the test accuracy Footnote 19 over 20 random splits of the large data set into training and test set (see Sect. 5.1) against the number of RSK-SFA or SK-PCA features. Both feature spaces are generated using the same support vectors selected at random or by the novel MP MAH algorithm. To demonstrate that this is not a trivial problem, the algorithms’ performance on the original 500 dimensional delayed amplitude space is given as a dashed line.

Fig. 3
figure 5

Accuracy of vowel detection on unseen test set against feature space size for all discussed linear classification algorithms (ac). Each algorithm was tested on feature spaces generated by RSK-SFA (λ=0,σ=5) and SK-PCA with 2500 support vectors selected at random or with MP MAH. As a comparison the accuracy on the original 500 dimensional space of delayed audio amplitudes is given as well (dashed line). All accuracies are plotted with mean and standard deviation for 20 random splits into training and test sets

It is apparent that in all cases the RSK-SFA feature space is superior to SK-PCA for all but 2500 features. At this point both feature spaces are identical up to rotation and any linear algorithm must therefore perform identical. Moreover, RSK-SFA features based on support vectors selected by MP MAH clearly outperform those based on randomly selected SV in all cases. In a comparison between algorithms, perceptron classification (Fig. 3c) yields the highest accuracy with the least features, i.e. more that 97 % with 256 RSK-SFA features. More features do not significantly raise the performance. This is particularly interesting because the training does not require a matrix inverse and is therefore the fastest of the evaluated algorithms in large feature spaces. It is also worth mentioning that 8 (32) RSK-SFA features reach an accuracy of 80 % (90 %), whereas the first 64 SK-PCA features apparently do not encode any information about Θ. Regularized logistic regressionFootnote 20 (Fig. 3b) reaches comparable accuracy (around 95 % with 256 or more features) but exhibits higher variance between folds. The poor performance of LS/LDA/FDA classification (Fig. 3a) must stem from the violated assumption of equal class covariances (see Sect. 2). Note, however, that all above comparisons between feature spaces and SV selection schemes still hold.

These results demonstrate that, in difference to SK-PCA, RSK-SFA feature spaces Φ aim to encode Θ. The high accuracy of over 97 % establishes that either (i) the problem is almost linearly separable in Θ or (ii) Φ resembles a functional basis in Θ, presumablyFootnote 21 a Fourier basis. At this point in time it is not possible to distinguish between (i) and (ii). However, the large number of required features could be interpreted as evidence for (ii).

We therefore conclude that RSK-SFA features encode the space of latent variables Θ of a given time series, presumably by resembling a Fourier basis of Θ.

6 Discussion

This article investigates the hypothesis that sufficiently powerful non-linear slow feature analysis features resemble a Fourier basis in the space of latent variables Θ, which underlies complex real-world time series. To perform a powerful but easily operated non-linear SFA, we derived a kernelized SFA algorithm (RSK-SFA). The novel algorithm is capable of handling small data sets by regularization and large data sets through sparsity. To select a sparse subset for the latter, we developed a matching pursuit approach to a widely used algorithm (MP MAH). In combination with linear classification algorithms, particularly the perceptron algorithm, our results support the hypothesis’ validity and demonstrate excellent performance.

6.1 Comparison to previous works

The major advancement of our approach over the kernel SFA algorithm of Bray and Martinez (2002) is the ability to obtain features that generalize well for small data sets. If one is forced to use a large proportion of the training set as support vectors, e.g. for small training sets of complex data, the solution can violate the unit variance constraint.

As suggested by Bray and Martinez (2002), our experiments show that for large data sets no explicit regularization is needed. The implicit regularization introduced by sparsity is sufficient to generate features that generalize well over a wide range of Gaussian kernels. However, our work documents the dependence of the algorithms’ performance on the sparse subset for the first time. In this setting, the subset size m takes the role of regularization parameter λ. It is therefore imperative to control m with minimal computational overhead.

We compared two state-of-the-art algorithms that select sparse subsets in polynomial time. Online MAH is well suited to process large data sets, but selects unpredictably large subsets. A change of m therefore requires a full re-computation without the ability to target a specific size. Matching pursuit for sparse kernel PCA (MP KPCA), on the other hand, returns an ordered list of selected samples. Increasing the subsets size requires simply another loop of the algorithm. The downside is a quadratic dependency on the training set size, both in time and memory. Both algorithms exhibited similar performance and significantly outperformed a random selection scheme. The subsets selected by the novel matching pursuit to online MAH (MP MAH) algorithm yielded virtually the same performance as those selected by Online MAH. There is no difference in computation time, but the memory complexity of MP MAH is linearly dependent on the training sets size. However, increasing m works just as with MP KPCA, which makes this algorithm the better choice if one can afford the memory. If not, Online MAH can be applied several times with slowly decreasing hyper-parameter η. Although a subset of suitable size will eventually be found, this approach will take much more time than MP MAH.

6.2 Hyper-parameter selection

For a first assessment of the feature space generated by RSK-SFA, selecting the support vectors randomly is sufficient. For optimal performance, however, one should select the sparse subset with MP MAH or Online MAH, depending on available resources and time (see last section). As shown in Fig. 1b, selecting the Gaussian kernel parameter σ is not an issue in face of sufficient sparsity. Empirically, setting σ such that more than half of all kernel outputs are above 0.5 has yielded sufficient performance in most cases.

Before raising the regularization parameter λ above 0, it suggests itself to check the unit variance constraint on the training data. Particularly if the kernel can not be adjusted, a violation can be compensated by slowly rising λ until the variance is sufficiently close to one. A numerically stable solution is no guarantee for optimal slowness, though. It is therefore always recommendable to track the test slowness on an independent test set. When in doubt of over-fitting, try to shrink the sparse subset size m, raise λ or increase kernel width σ, in this order.

6.3 Limitations

The methodology developed in this article will work in most standard machine learning scenarios involving time series. Although not unique to our method, there are certain limitations that are worth mentioning.

  1. 1.

    Our method of generating feature spaces Φ is based on the assumption of an underlying space of latent variables Θ. This implies that the generative process \(\varTheta\to\mathcal{X}\) is stationary over time. Preliminary experiments on EEG data (not shown), which are known to be non-stationary, yielded no generalizing features.

  2. 2.

    The generative mapping must be unique. Two elements of Θ which are reliably mapped onto the same element of \(\mathcal{X}\) are not distinguishable in Φ.

  3. 3.

    As the exact nature of Θ is ambiguous, there can be arbitrary many “unwanted” latent variables, which raises three problems: (i) If the slowest variables are not relevant for the task, they will appear as noise. (ii) The size of a Fourier basis Φ of Θ grows exponential in the dimensionality of Θ. (iii) Generating a reliable feature space Φ requires training samples from all regions of Θ, which could require infeasible amounts of training samples if Θ is high dimensional.

6.4 Including label or importance information

Applied on real-world time series, RSK-SFA features resemble a Fourier basis in Θ. We want to discuss two possible modifications to generate feature spaces.

  1. 1.

    SFA exploits the temporal succession of samples and thus can only be applied on time series. It is imaginable, however, to use other available information to modify the approach without much change to the methodology. For example, if one is faced with labeled iid data, Algorithm 1 can be modified with a different objective to minimize the distance between all samples of each class:

    $$ \min_{\boldsymbol{\phi}\in\mathcal{H}^p} \sum_{k=1}^p \mathbb{E}_i \bigl[ \mathbb{E}_j \bigl[ \delta_{c_i c_j} \bigl(\phi_k(\boldsymbol{x}_i) - \phi_k(\boldsymbol{x}_j) \bigr)^2 \bigr] \bigr] , $$
    (22)

    where \(\delta_{c_{i} c_{j}}\) is the Kronecker delta, which is 1 if c i =c j and 0 otherwise. The two nested expectations induce a quadratic dependency on the training set size n, but can be approximated by randomly drawing sample pairs from the same class. The resulting feature space Φ will not resemble a functional basis in Θ, but reflect a between-classes metric by mapping same class samples close to each other. Preliminary experiments (not shown) have generated promising feature spaces for linear classification. This approach has also been taken by Berkes (2005), who used SFA on 2 step “time series” of same class samples of hand written digits.

  2. 2.

    The individual components (or dimensions) of Θ are encoded according to their slowness. This can result in an inefficient encoding if the slowest changing components of Θ are not of any use to the task. It is therefore of interest if one can use additional information about the samples to restrict Φ to encode only a relevant subspace of Θ, at least in the first p features. If one can define an importance measure p(x t+1,x t )>0 with high values for transitions within the desired subspace, the slowness objective in (6) can be modified to

    $$ \min s'(\phi_i) := \frac{1}{n-1} \sum _{t=1}^{n-1} \frac{ (\phi_i(\boldsymbol{x}_{t+1}) - \phi_i(\boldsymbol{x}_t) )^2}{p(\boldsymbol{x}_{t+1}, \boldsymbol{x}_t)} . $$
    (23)

    As a result, the relative slowness of important transitions is reduced and the respective subspace of Θ will be encoded earlier. Increasing the importance will eventually encode the subspace spanned by the important transitions only within the first p features extracted by the modified algorithm.

  3. 3.

    If only arbitrary subsets of Θ are of consequence to the task, another modification can increase encoding by including additional information. If the samples can be labeled according to another importance measure q(x)≥0, which marks the important (q(x)≫0) and unimportant samples (q(x)≈0), the unit variance and decorrelation constraints of (11) can be modified, i.e.

    $$ \mathbb{E}_t \bigl[ \boldsymbol{\phi}(\boldsymbol{x}_t) q(\boldsymbol{x}_t) \boldsymbol{\phi}(\boldsymbol{x}_t)^{\top} \bigr] = \mathbf{I} , $$
    (24)

    which would enforce unit variance on the important samples only. Slowness still applies to all samples equally and the resulting feature space Φ would map samples of low importance ontoFootnote 22 each other. Given a powerful enough function class, the modified algorithm should construct a Fourier basis in Θ on the subset of important samples only.Footnote 23 For example, if we would know which samples x t contain a vowel (but not which vowel it is), the resulting feature space Φ should exclusively encode the subset of Θ that represents the vowels. When this subset is relatively small, the resulting feature space Φ will approximate the same task with much less basis functions.

6.5 Relationship to deep belief networks

The sound performance of the perceptron algorithm in combination with non-linear SFA suggests an interesting connection to deep belief networks (Hinton and Salakhutdinov 2006). It is long known that multilayer perceptrons (MLP) are so prone to initialization, that it is virtually impossible to learn a deep architecture (i.e. many layers) with any random initialization thereof (Haykin 1999). Hinton and Osindero (2006) came up with the ideaFootnote 24 to train the first layer as an auto-encoder, i.e. to predict its own inputs via a hidden layer of perceptrons. The trained auto-encoder is then fixed and the hidden layer used as the input of the next layer, which is trained in the same way. The resulting deep architecture is a generative model of the data and has proven to be a suitable initialization for many problems.

In line with this article’s hypothesis, however, we assume that the target function is defined over a space of latent variables Θ. Layer-wise training of an MLP version of SFA might therefore yield even better initialization, as it already spans a suitable function space rather than simply encoding the data (see Franzius et al. 2007 for an example of a hierarchical SFA). However, repeated application of non-linear SFA, even with the limited function class of one perceptron per feature, bears sooner or later the problem of over-fitting. Preliminary experiments on repeated application of quadratic SFA lost any generalization ability within a few layers. Without an automatic way to regularize the solution in each layer, this approach will thus not yield a reliable initialization for classical MLP backpropagation (Haykin 1999). Future research must show whether a properly regularized MLP-SFA algorithm can be used to initialize an MLP and whether or not this approach can compete with deep belief networks based on auto-encoders.