Generating feature spaces for linear algorithms with regularized sparse kernel slow feature analysis
 1.1k Downloads
 1 Citations
Abstract
Without nonlinear basis functions many problems can not be solved by linear algorithms. This article proposes a method to automatically construct such basis functions with slow feature analysis (SFA). Nonlinear optimization of this unsupervised learning method generates an orthogonal basis on the unknown latent space for a given time series. In contrast to methods like PCA, SFA is thus well suited for techniques that make direct use of the latent space. Realworld time series can be complex, and current SFA algorithms are either not powerful enough or tend to overfit. We make use of the kernel trick in combination with sparsification to develop a kernelized SFA algorithm which provides a powerful function class for large data sets. Sparsity is achieved by a novel matching pursuit approach that can be applied to other tasks as well. For small data sets, however, the kernel SFA approach leads to overfitting and numerical instabilities. To enforce a stable solution, we introduce regularization to the SFA objective. We hypothesize that our algorithm generates a feature space that resembles a Fourier basis in the unknown space of latent variables underlying a given realworld time series. We evaluate this hypothesis at the example of a vowel classification task in comparison to sparse kernel PCA. Our results show excellent classification accuracy and demonstrate the superiority of kernel SFA over kernel PCA in encoding latent variables.
Keywords
Time series Latent variables Unsupervised learning Slow feature analysis Sparse kernel methods Linear classification1 Introduction
This article is concerned with the automatic construction of nonlinear basis functions for linear algorithms. This is of particular importance if the original space of inputs \(\mathcal{X} \subseteq\mathbb{R}^{l}\) can not support an adequate linear solution.

ϕ _{ i }(⋅),∀i∈{1,…,p}, is nonlinear in \(\mathcal{X}\) to encode Θ rather than \(\mathcal{X}\).

\(\{\phi_{i}\}_{i=1}^{p}\) constitutes a well behaving functional basis in Θ, e.g. a Fourier basis.

Dimensionality p is as low as possible to reduce the number of training samples required to reliably estimate \(f \in\mathcal{F}_{\varPhi}:= \{ f(\boldsymbol{x}) = \sum_{i=1}^{p} w_{i} \phi_{i}(\boldsymbol{x})  \boldsymbol{w} \in\mathbb{R}^{p} \}\).
This article outlines an approach to automatically construct such a feature space Φ for a given time series. The critical assumption is that there exist an unknown low dimensional space of latent variables Θ, that generated the observed data \(\boldsymbol{x} \in \mathcal{X}\) by some unknown stochastic process \(\varTheta\to\mathcal{X}\). It is important to understand that we do not talk about the true underlying cause of the data, but a description Θ that suffices to generate the training samples \(\{\boldsymbol{x}_{t}\}_{t=1}^{n} \subset\mathcal{X}\). Additionally, we restrict ourselves here to the most general encoding of Θ, i.e. we assume no additional information about labels or classes of the training samples.^{1} This is achieved by an unsupervised learning principle called slow feature analysis (SFA, Wiskott and Sejnowski 2002). SFA aims for temporally coherent features out of high dimensional and/or delayed sensor measurements. Given an infinite time series and an unrestricted function class, the learned features will be a Fourier basis^{2} in Θ (Wiskott 2003; Franzius et al. 2007). Although there have been numerous studies highlighting its resemblance to biological sensor processing (Wiskott and Sejnowski 2002; Berkes and Wiskott 2005; Franzius et al. 2007), the method has not yet found its way in the engineering community that focuses on the same problems. One of the reasons is undoubtedly the lack of an easily operated nonlinear extension that is powerful enough to generate a Fourier basis.
The major contribution of this article is to provide such an extension in the form of a kernel SFA algorithm and to demonstrate its application at the example of linear classification. An approach on kernelizing SFA has previously been made by Bray and Martinez (2002) and is reported to work well with a large image data set. Small training sets, however, lead to numerical instabilities in any kernel SFA algorithm. Our goal is to provide an algorithm that can be applied to both of the cases above and to document the pitfalls that arise in its application on realworld time series. This algorithm has previously been presented in a conference paper, which has been the basis for this article (Böhmer et al. 2011). Beyond the scope of the original paper, we provide evidence for its main hypothesis and evaluate linear classification algorithms in the feature spaces generated by kernel SFA and kernel PCA (Schölkopf et al. 1997).
Although formulated as a linear algorithm, SFA was originally intended to be applied on the space of polynomials, e.g. quadratic (Wiskott and Sejnowski 2002) or cubic (Berkes and Wiskott 2005). The polynomial expansion of potentially high dimensional data, however, spans an impractically large space of coefficients. Hierarchical application of quadratic SFA has been proposed to solve this problem (Wiskott and Sejnowski 2002). Although proven to work in complex tasks (Franzius et al. 2007), this approach involves a multitude of hyperparameters and no easy way to counteract inevitable overfitting is known so far. It appears biologically plausible, but is definitely not easy to operate.
A powerful alternative to polynomial expansions are kernel methods. Here the considered feature maps \(\phi: \mathcal{X} \to\mathbb{R}\) are elements of a reproducing kernel Hilbert space \(\mathcal{H}\). For many optimization problems a representer theorem holds in line with Wahba (1990), if a regularization term is used. Such theorems guarantee that the optimal solution for a given training set exists within the span of kernel functions, which are parametrized by training samples. Depending on the kernel, continuous functions can be approximated arbitrarily well in \(\mathcal{H}\) and a mapping \(\phi\in\mathcal{H}\) is thus very powerful (ShaweTaylor and Cristianini 2004).
There are, however, fundamental drawbacks in a kernel approach to SFA. First, choosing feature mappings from a powerful Hilbert space is naturally prone to overfitting. More to the point, kernel SFA shows numerical instabilities due to it’s unit variance constraint (see Sects. 3 and 5). This tendency has been analytically shown for the related kernel canonical correlation analysis (Fukumizu et al. 2007). We introduce a regularization term for the SFA objective to enforce a stable solution. Secondly, kernel SFA is based on a kernel matrix of size \(\mathcal{O}(n^{2})\), where n is the number of training samples. This is not feasible for large training sets. Our approach approximates the optimal solution by projecting into a sparse subset of the data. The choice of this subset, however, is a crucial decision.
The question how many samples should be selected can only be answered empirically. We compare two stateoftheart sparse subset selection algorithms that approach this problem very differently: (1) A fast online algorithm (Csató and Opper 2002) that must recompute the whole solution to change the subset’s size. (2) A computational costly matching pursuit approach to sparse kernel PCA (Smola and Schölkopf 2000) that incrementally augments the selected subset. To obtain a method that is both fast and incremental we derive a novel matching pursuit approach to the first algorithm.
Bray and Martinez (2002) have previously introduced a kernel SFA algorithm that incorporates a simplistic sparsity scheme. Instead of the wellestablished framework of Wiskott and Sejnowski (2002), they utilize the cost function of Stone (2001) based on long and short term variances without explicit constraints. Due to a high level of sparsity, their approach does not require function regularization. We show that the same holds for our algorithm if the sparse subset is only a small fraction of the training data. However, for larger fractions additional regularization becomes inevitable.
Theoretic predictions by Wiskott (2003) and Franzius et al. (2007) suggests that regularized sparse kernel SFA features will resemble a Fourier basis in the space of latent variables Θ for a given realworld time series. To verify this hypothesis, we perform a vowel classification task on spoken words. None of the tested linear classification algorithms (Bishop 2006) were able to solve the task without nonlinear basis functions. Our work extends Berkes (2005), who used nonlinear SFA to classify images of hand written digits. His work was based on artificial time series rather that realworld data, though. Our results show excellent classification accuracy of more than 97 % and superior encoding of latent variables by kernel SFA in comparison with kernel PCA (Schölkopf et al. 1997). To the best of our knowledge, this work and its predecessor (Böhmer et al. 2011) are the first attempt to apply nonlinear SFA as a preprocessing for audio data.
In the following section, we first review a variety of linear classification algorithms, which we will use to compare the constructed feature spaces Φ. In Sect. 3 we formulate the general SFA optimization problem and derive a regularized sparse kernel SFA algorithm. In Sect. 4 the sparse subset selection is introduced and a novel matching pursuit algorithm derived. Section 5 evaluates both the SFA algorithm and the generated feature space on a vowel classification task, followed by a discussion of our method and possible extensions in Sect. 6.
2 Linear classification algorithms
Classification aims to assign to each test sample \(\boldsymbol{x}^{*} \in \mathcal{X}\) a class \(c^{*} \in \mathcal{C}\), based on a given training set of samples \(\{\boldsymbol{x}_{t}\}_{t=1}^{n}\) and their assigned classes \(\{c_{t}\}_{t=1}^{n}\). In the case of two classes \(\mathcal{C} = \{1, +1\}\), linear classification aims for a discrimination function \(f \in \mathcal{F}_{\varPhi}= \{ f(\boldsymbol{x}) = \sum_{i=1}^{p} w_{i} \phi_{i}(\boldsymbol{x})  \boldsymbol{w} \in\mathbb{R}^{p} \}\) with \(\operatorname{sign} (f(\boldsymbol{x}_{t}) ) = c_{t}, \forall t \in\{ 1,\ldots ,n\}\). All of the below algorithms are excellently discussed by Bishop (2006).
2.1 Least squares classification/LDA/FDA
If Φ has zero mean w.r.t. all samples and the classes are equally distributed, C is the estimate of the covariance matrix and the solution is equivalent to a Gaussian estimate of the class probability densities with the restriction that all classes share the same covariance (Bishop 2006). This approach to derive the discrimination function out of Gaussian density estimates is also called linear discrimination analysis (LDA). With small modifications to the target weights (Duda and Hart 1973), the solution of (2) becomes also equivalent to Fisher discrimination analysis (FDA, Fisher 1936). As the empirical differences in our experiments were marginal, we restricted ourselves in Sect. 5 to LDA. Note, however, that due to the strong assumption of equal covariances, these algorithms are known to perform poorly if this assumption is not met.
2.2 Perceptron classification
The algorithm that started the field of artificial neural networks (Rosenblatt 1962) can classify twoclass problems with \(\mathcal{C} = \{1,+1\}\) by assigning a signum function onto the output of the discrimination function \(f \in \mathcal{F}_{\varPhi}\), i.e. \(y(\boldsymbol{x}_{t}) = \operatorname{sign} (f(\boldsymbol{x}_{t}) )\). At sample x _{ t }, the online version of this algorithm updates the weight vector w∈ℝ^{ p } of \(f(\boldsymbol{x}_{t}) = \sum_{i=1}^{p} w_{i} \phi_{i}(\boldsymbol{x}_{t})\) by \(\Delta w_{i}^{(t)} :=  \phi_{i}(\boldsymbol{x}_{t}) c_{t}\), but only if the prediction y(x _{ t }) was incorrect. The update w←w+αΔw ^{(t)} with learning rate 0<α≤1 is repeated until all training samples are classified correctly. The algorithm is guaranteed to converge for linear separable problems (Rosenblatt 1962).
2.3 Logistic regression
3 Slow feature analysis
3.1 Kernel SFA
Let the considered mappings \(\phi_{i}: \mathcal{X} \to\mathbb{R}\) be elements of a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\), with corresponding positive semidefinite kernel \(\kappa: \mathcal{X} \times \mathcal{X} \to\mathbb{R}\). The reproducing property of those kernels allows \(\forall\phi\in \mathcal{H}: \phi(\boldsymbol{x}) = \langle\phi , \kappa(\cdot , \boldsymbol{x}) \rangle_{\mathcal{H}}\), in particular \(\langle\kappa(\cdot,\boldsymbol{x}) , \kappa(\cdot,\boldsymbol{x}^{\prime}) \rangle_{\mathcal{H}} = \kappa (\boldsymbol{x},\boldsymbol{x}^{\prime})\). The representer theorem ensures^{7} the solution to be in the span^{8} of support functions, parametrized by the training data (Wahba 1990), i.e. \(\phi= \sum_{t=1}^{n} a_{t} \kappa(\cdot, \boldsymbol{x}_{t})\). Together those two relationships set the basis for the kernel trick (see e.g. ShaweTaylor and Cristianini 2004).
Sparse kernel SFA
If one assumes the feature mappings within the span of another set of data \(\{\boldsymbol{z}_{i}\}_{i=1}^{m} \subset \mathcal{X}\) (e.g. a sparse subset of the training data, often called support vectors), the sparse kernel matrix K∈ℝ^{ m×n } is defined as K _{ ij }=κ(z _{ i },x _{ j }) instead. The resulting algorithm will be called sparse kernel SFA. Note that the representer theorem no longer applies and therefore the solution merely approximates the optimal mappings in \(\mathcal{H}\). Both optimization problems have identical solutions if \(\forall t \in\{1,\ldots,n\}: \kappa(\cdot, \boldsymbol{x}_{t}) \in\mathbf{span}(\{\kappa(\cdot, \boldsymbol{z}_{i})\}_{i=1}^{m})\), e.g. \(\{\boldsymbol{z}_{i}\}_{i=1}^{n} = \{\boldsymbol{x}_{t}\}_{t=1}^{n}\).
Regularized sparse kernel SFA
The Hilbert spaces corresponding to some of the most popular kernels are equivalent to an infinite dimensional space of continuous functions (ShaweTaylor and Cristianini 2004). One example is the Gaussian kernel \(\kappa(\boldsymbol{x},\boldsymbol{x}') = \exp(\frac{1}{2\sigma^{2}} \\boldsymbol{x}  \boldsymbol{x}'\_{2}^{2})\). Depending on hyperparameter σ and data distribution, this can obviously lead to overfitting. Less obvious, however, is the tendency of kernel SFA to become numerically unstable for large σ, i.e. to violate the unit variance constraint. Fukumizu et al. (2007) have shown this analytically for the related kernel canonical correlation analysis. Note that both problems do not affect sufficiently sparse solutions, as sparsity reduces the function complexity and sparse kernel matrices KK ^{⊤} are more robust w.r.t. eigenvalue decompositions.
3.2 The RSKSFA algorithm
Zero mean
Unit variance and decorrelation
Minimization of the objective
Solution
4 Sparse subset selection
Representer theorems guarantee that the target function \(\phi^{*} \in\mathcal{H}\) for training set \(\{\boldsymbol{x}_{t}\}_{t=1}^{n}\) can be found within \(\mathrm{span}(\{\kappa(\cdot,\boldsymbol{x}_{t})\}_{t=1}^{n})\). For sparse KSFA, however, no such guarantee exists. The quality of such a sparse approximation depends exclusively on the set of support vectors \(\{\boldsymbol{z}_{i}\}_{i=1}^{m}\).
Online maximization of the affine hull
A widely used algorithm (Csató and Opper 2002), which we will call online maximization of the affine hull (online MAH) in the absence of a generally accepted name, iterates through the data in an online fashion. At time t, sample x _{ t } is added to the selected subset if \(\epsilon_{t}^{\boldsymbol{i}}\) is larger than some given threshold η. Exploitation of the matrix inversion lemma (MIL) allows an online algorithm with computational complexity \(\mathcal{O}(m^{2} n)\) and memory complexity \(\mathcal{O}(m^{2})\). The downside of this approach is the unpredictable dependence of the final subset size m on hyperparameter η. Changing the subset size requires therefore a complete recomputation with larger η. The resulting subset size is not predictable, although monotonically dependent on η.
Matching pursuit for sparse kernel PCA
This handicap is addressed by matching pursuit methods (Mallat and Zhang 1993). Applied on kernels, some criterion selects the best fitting sample, followed by an orthogonalization of all remaining candidate support functions in Hilbert space \(\mathcal{H}\). A resulting sequence of m selected samples therefore contains all sequences up to length m as well. The batch algorithm of Smola and Schölkopf (2000) chooses the sample x _{ j } that minimizes^{12} \(\mathbb{E}_{t} [\epsilon_{t}^{\boldsymbol{i} \cup j} ]\). It was shown later that this algorithm performs sparse PCA in \(\mathcal{H}\) (Hussain and ShaweTaylor 2008). The algorithm, which we will call in the following matching pursuit for sparse kernel PCA (MP KPCA), has a computational complexity of \(\mathcal{O}(n^{2}m)\) and a memory complexity of \(\mathcal{O}(n^{2})\). In practice it is therefore not applicable to large data sets.
4.1 Matching pursuit for online MAH
5 Empirical validation
In this section we will evaluate the hypothesis that RSKSFA generates a feature space Φ which resembles a Fourier basis in the space of latent variables Θ for a given time series at the example of a twovowel classification task.
The true Θ is not known and we use the performance of different linear classification algorithms to measure how well RSKSFA features encode Θ. For comparison, all algorithms are tested on feature spaces generated by RSKSFA and sparse kernel PCA (Schölkopf et al. 1997). Results in Sect. 5.4 show that RSKSFA features indeed encode Θ, whereas kernel PCA does not aim for such an encoding. We also demonstrate the importance of sparse subset selection to an efficient encoding. Last but not least, a classification accuracy of more than 97 % in a task that is not linearly solvable demonstrates the excellent performance of our approach.
5.1 Benchmark data sets
The “north Texas vowel database”^{14} contains uncompressed audio files with English words of the form h…d, spoken multiple times by multiple persons (Assmann et al. 2008). The natural task is to predict the central vowels of unseen instances of a trained word. To cover both cases of small and large training sets, we selected two data sets: (i) A small set with four training and four test instances for each of the words “heed” and “head”, spoken by the same person. (ii) A large data set containing all instances of “heed” and “head” for all 18 adult subjects. To apply crossvalidation, we performed 20 random splits (folds) into 12 training and 6 test subjects, leading to 245±10 training instances and 116±10 test instances.
The spoken words are provided as mono audio streams of varying length at 48kHz, i.e. as a series of amplitude readings {a _{1},a _{2},…}. SFA requires the space of latent variables Θ to be embedded in the space of observations \(\mathcal{X}\), which is not the case for onedimensional data. The problem resides in the ambiguity of the observed amplitude readings a _{ t }, i.e. mappings \(\phi_{i}: \mathcal{X} \to\mathbb{R}^{p}\) can not distinguish between two latent states which both generate the same observed amplitude a _{ t }. However, Takens Theorem (Takens 1981; Huke 2006) guarantees an embedding of Θ as a manifold in the space \(\mathcal{X} \subset\mathbb{R}^{l}\) of sufficiently many timedelayed observations. Based on the observed pattern of l timedelayed amplitude readings, nonlinear SFA and PCA can utilize the resulting onetoone mapping of latent variables in Θ onto a manifold of observations in \(\mathcal{X}\) to extract basis functions ϕ _{ i }(⋅) that encode Θ. We therefore defined our samples x _{ t }:=[a _{ δt },a _{ δt+ϵ },a _{ δt+2ϵ },…,a _{ δt+(l−1)ϵ }]^{⊤}, which is also called sliding window.^{15} We evaluated the parameters δ, ϵ and l empirically^{16} and chose δ=50, ϵ=5 and l=500. All algorithms were trained and tested with the joint^{9} time series {x _{ t }} of all instances of all subjects in the respective fold. Note, however, that all samples of each wordinstance were exclusively used for either training or testing. Additionally, in the large data set all instances of one subject were exclusively used, too. This ensures that we really test for generalization to unseen instances of the trained words as well as to previously unseen subjects. The above procedure provided us with 3719(4108) trainings (test) samples x _{ t }∈[−1,1]^{500} for the small and 97060±4615 (46142±4614) trainings (test) samples for the large data set.
5.2 RSKSFA performance
We start our analysis with the RSKSFA solution for different Gaussian kernels, i.e. \(k(\boldsymbol{x}, \boldsymbol{x}') = \exp ( \frac{1}{2\sigma^{2}} {\\boldsymbol{x}  \boldsymbol{x}'\_{2}^{2}} )\) with varying σ. To ensure that meaningful information is extracted, we measure the test slowness, i.e. the slowness of the learned feature mappings applied on a previously unseen test sequence drawn from the same distribution. Small σ, however, grossly underestimate a feature’s output on unseen test samples, as distances are strongly amplified. This changes the feature’s slowness. For comparison we normalized all outputs to unit variance on the test set before measuring the test slowness.
In the absence of significant sparseness (Fig. 1a), unregularized kernel SFA (λ=0, equivalent to KSFA, see (11) and (12)) shows both overfitting and numerical instability. Overfitting can be seen at small σ, where the features fulfill the unit variance constraint (small plot), but do not reach the minimal test slowness (main plot). The bad performance for larger σ, on the other hand, must be blamed on numerical instability, as indicated by a significantly violated unit variance constraint. Both can be counteracted by proper regularization. Although optimal regularization parameters λ are quite small and can reach computational precision, there is a wide range of kernel parameters σ for which the same minimal test slowness is reachable. E.g. in Fig. 1a, a fitting λ can be found between σ=0.5 and σ=20.
The more common case, depicted in Fig. 1b, is a large training set from which a small subset is selected by MP MAH. Here no regularization is necessary and unregularized sparse kernel SFA (λ=0) learns mappings of minimal slowness in the range from σ=1 to far beyond σ=500. Notice that this is rendering a time consuming search for an optimal parameter σ obsolete.
5.3 Sparse subset selection
5.4 Classification performance
To demonstrate that Φ encodes Θ, we compared the accuracy of linear classification algorithms (see Sect. 2) in RSKSFA and SKPCA feature spaces of different sizes. Sparse kernel PCA optimizes the PCA objective (maximal variance) on the same function class as RSKSFA, i.e. with the same kernel and support vectors. Because some iterative algorithms are sensitive to scaling, we scaled the SKPCA features to unit variance on the test set, to make them fully comparable with RSKSFA features.
In the evaluated twoclass problem (“head” vs. “heed”), class labels are only available for whole words. Individual samples x _{ t }, however, do not necessarily belong to a vowel, but might contain one of the consonants in the beginning or end of the word. Many samples will therefore be ambiguously labeled, which the classifier must interpret as noise. When all samples of a word are classified, the final judgment about the spoken vowel is determined by the sum over all discrimination function outputs \(\sum_{t=1}^{n} f(\boldsymbol{x}_{t})\), rather than over the individual class predictions. This is supposed to take the classifiers certainty into account and yields much better results. It is also equivalent to the empirical temporal mean of the discriminant function output over the whole word.^{18}
 1.
a training set of 245±10 words must overfit in high dimensional feature spaces.
 2.
classification of individual samples x _{ t } must always result in very poor accuracy as most of them do not encode the vowel and will thus yield only noise.
 3.
\(\bar{\boldsymbol{\phi}}\) lives in the same feature space as the individual samples and should be affected by the same insufficiencies of that feature space Φ.
It is apparent that in all cases the RSKSFA feature space is superior to SKPCA for all but 2500 features. At this point both feature spaces are identical up to rotation and any linear algorithm must therefore perform identical. Moreover, RSKSFA features based on support vectors selected by MP MAH clearly outperform those based on randomly selected SV in all cases. In a comparison between algorithms, perceptron classification (Fig. 3c) yields the highest accuracy with the least features, i.e. more that 97 % with 256 RSKSFA features. More features do not significantly raise the performance. This is particularly interesting because the training does not require a matrix inverse and is therefore the fastest of the evaluated algorithms in large feature spaces. It is also worth mentioning that 8 (32) RSKSFA features reach an accuracy of 80 % (90 %), whereas the first 64 SKPCA features apparently do not encode any information about Θ. Regularized logistic regression^{20} (Fig. 3b) reaches comparable accuracy (around 95 % with 256 or more features) but exhibits higher variance between folds. The poor performance of LS/LDA/FDA classification (Fig. 3a) must stem from the violated assumption of equal class covariances (see Sect. 2). Note, however, that all above comparisons between feature spaces and SV selection schemes still hold.
These results demonstrate that, in difference to SKPCA, RSKSFA feature spaces Φ aim to encode Θ. The high accuracy of over 97 % establishes that either (i) the problem is almost linearly separable in Θ or (ii) Φ resembles a functional basis in Θ, presumably^{21} a Fourier basis. At this point in time it is not possible to distinguish between (i) and (ii). However, the large number of required features could be interpreted as evidence for (ii).
We therefore conclude that RSKSFA features encode the space of latent variables Θ of a given time series, presumably by resembling a Fourier basis of Θ.
6 Discussion
This article investigates the hypothesis that sufficiently powerful nonlinear slow feature analysis features resemble a Fourier basis in the space of latent variables Θ, which underlies complex realworld time series. To perform a powerful but easily operated nonlinear SFA, we derived a kernelized SFA algorithm (RSKSFA). The novel algorithm is capable of handling small data sets by regularization and large data sets through sparsity. To select a sparse subset for the latter, we developed a matching pursuit approach to a widely used algorithm (MP MAH). In combination with linear classification algorithms, particularly the perceptron algorithm, our results support the hypothesis’ validity and demonstrate excellent performance.
6.1 Comparison to previous works
The major advancement of our approach over the kernel SFA algorithm of Bray and Martinez (2002) is the ability to obtain features that generalize well for small data sets. If one is forced to use a large proportion of the training set as support vectors, e.g. for small training sets of complex data, the solution can violate the unit variance constraint.
As suggested by Bray and Martinez (2002), our experiments show that for large data sets no explicit regularization is needed. The implicit regularization introduced by sparsity is sufficient to generate features that generalize well over a wide range of Gaussian kernels. However, our work documents the dependence of the algorithms’ performance on the sparse subset for the first time. In this setting, the subset size m takes the role of regularization parameter λ. It is therefore imperative to control m with minimal computational overhead.
We compared two stateoftheart algorithms that select sparse subsets in polynomial time. Online MAH is well suited to process large data sets, but selects unpredictably large subsets. A change of m therefore requires a full recomputation without the ability to target a specific size. Matching pursuit for sparse kernel PCA (MP KPCA), on the other hand, returns an ordered list of selected samples. Increasing the subsets size requires simply another loop of the algorithm. The downside is a quadratic dependency on the training set size, both in time and memory. Both algorithms exhibited similar performance and significantly outperformed a random selection scheme. The subsets selected by the novel matching pursuit to online MAH (MP MAH) algorithm yielded virtually the same performance as those selected by Online MAH. There is no difference in computation time, but the memory complexity of MP MAH is linearly dependent on the training sets size. However, increasing m works just as with MP KPCA, which makes this algorithm the better choice if one can afford the memory. If not, Online MAH can be applied several times with slowly decreasing hyperparameter η. Although a subset of suitable size will eventually be found, this approach will take much more time than MP MAH.
6.2 Hyperparameter selection
For a first assessment of the feature space generated by RSKSFA, selecting the support vectors randomly is sufficient. For optimal performance, however, one should select the sparse subset with MP MAH or Online MAH, depending on available resources and time (see last section). As shown in Fig. 1b, selecting the Gaussian kernel parameter σ is not an issue in face of sufficient sparsity. Empirically, setting σ such that more than half of all kernel outputs are above 0.5 has yielded sufficient performance in most cases.
Before raising the regularization parameter λ above 0, it suggests itself to check the unit variance constraint on the training data. Particularly if the kernel can not be adjusted, a violation can be compensated by slowly rising λ until the variance is sufficiently close to one. A numerically stable solution is no guarantee for optimal slowness, though. It is therefore always recommendable to track the test slowness on an independent test set. When in doubt of overfitting, try to shrink the sparse subset size m, raise λ or increase kernel width σ, in this order.
6.3 Limitations
 1.
Our method of generating feature spaces Φ is based on the assumption of an underlying space of latent variables Θ. This implies that the generative process \(\varTheta\to\mathcal{X}\) is stationary over time. Preliminary experiments on EEG data (not shown), which are known to be nonstationary, yielded no generalizing features.
 2.
The generative mapping must be unique. Two elements of Θ which are reliably mapped onto the same element of \(\mathcal{X}\) are not distinguishable in Φ.
 3.
As the exact nature of Θ is ambiguous, there can be arbitrary many “unwanted” latent variables, which raises three problems: (i) If the slowest variables are not relevant for the task, they will appear as noise. (ii) The size of a Fourier basis Φ of Θ grows exponential in the dimensionality of Θ. (iii) Generating a reliable feature space Φ requires training samples from all regions of Θ, which could require infeasible amounts of training samples if Θ is high dimensional.
6.4 Including label or importance information
 1.SFA exploits the temporal succession of samples and thus can only be applied on time series. It is imaginable, however, to use other available information to modify the approach without much change to the methodology. For example, if one is faced with labeled iid data, Algorithm 1 can be modified with a different objective to minimize the distance between all samples of each class:where \(\delta_{c_{i} c_{j}}\) is the Kronecker delta, which is 1 if c _{ i }=c _{ j } and 0 otherwise. The two nested expectations induce a quadratic dependency on the training set size n, but can be approximated by randomly drawing sample pairs from the same class. The resulting feature space Φ will not resemble a functional basis in Θ, but reflect a betweenclasses metric by mapping same class samples close to each other. Preliminary experiments (not shown) have generated promising feature spaces for linear classification. This approach has also been taken by Berkes (2005), who used SFA on 2 step “time series” of same class samples of hand written digits.$$ \min_{\boldsymbol{\phi}\in\mathcal{H}^p} \sum_{k=1}^p \mathbb{E}_i \bigl[ \mathbb{E}_j \bigl[ \delta_{c_i c_j} \bigl(\phi_k(\boldsymbol{x}_i)  \phi_k(\boldsymbol{x}_j) \bigr)^2 \bigr] \bigr] , $$(22)
 2.The individual components (or dimensions) of Θ are encoded according to their slowness. This can result in an inefficient encoding if the slowest changing components of Θ are not of any use to the task. It is therefore of interest if one can use additional information about the samples to restrict Φ to encode only a relevant subspace of Θ, at least in the first p features. If one can define an importance measure p(x _{ t+1},x _{ t })>0 with high values for transitions within the desired subspace, the slowness objective in (6) can be modified toAs a result, the relative slowness of important transitions is reduced and the respective subspace of Θ will be encoded earlier. Increasing the importance will eventually encode the subspace spanned by the important transitions only within the first p features extracted by the modified algorithm.$$ \min s'(\phi_i) := \frac{1}{n1} \sum _{t=1}^{n1} \frac{ (\phi_i(\boldsymbol{x}_{t+1})  \phi_i(\boldsymbol{x}_t) )^2}{p(\boldsymbol{x}_{t+1}, \boldsymbol{x}_t)} . $$(23)
 3.If only arbitrary subsets of Θ are of consequence to the task, another modification can increase encoding by including additional information. If the samples can be labeled according to another importance measure q(x)≥0, which marks the important (q(x)≫0) and unimportant samples (q(x)≈0), the unit variance and decorrelation constraints of (11) can be modified, i.e.which would enforce unit variance on the important samples only. Slowness still applies to all samples equally and the resulting feature space Φ would map samples of low importance onto^{22} each other. Given a powerful enough function class, the modified algorithm should construct a Fourier basis in Θ on the subset of important samples only.^{23} For example, if we would know which samples x _{ t } contain a vowel (but not which vowel it is), the resulting feature space Φ should exclusively encode the subset of Θ that represents the vowels. When this subset is relatively small, the resulting feature space Φ will approximate the same task with much less basis functions.$$ \mathbb{E}_t \bigl[ \boldsymbol{\phi}(\boldsymbol{x}_t) q(\boldsymbol{x}_t) \boldsymbol{\phi}(\boldsymbol{x}_t)^{\top} \bigr] = \mathbf{I} , $$(24)
6.5 Relationship to deep belief networks
The sound performance of the perceptron algorithm in combination with nonlinear SFA suggests an interesting connection to deep belief networks (Hinton and Salakhutdinov 2006). It is long known that multilayer perceptrons (MLP) are so prone to initialization, that it is virtually impossible to learn a deep architecture (i.e. many layers) with any random initialization thereof (Haykin 1999). Hinton and Osindero (2006) came up with the idea^{24} to train the first layer as an autoencoder, i.e. to predict its own inputs via a hidden layer of perceptrons. The trained autoencoder is then fixed and the hidden layer used as the input of the next layer, which is trained in the same way. The resulting deep architecture is a generative model of the data and has proven to be a suitable initialization for many problems.
In line with this article’s hypothesis, however, we assume that the target function is defined over a space of latent variables Θ. Layerwise training of an MLP version of SFA might therefore yield even better initialization, as it already spans a suitable function space rather than simply encoding the data (see Franzius et al. 2007 for an example of a hierarchical SFA). However, repeated application of nonlinear SFA, even with the limited function class of one perceptron per feature, bears sooner or later the problem of overfitting. Preliminary experiments on repeated application of quadratic SFA lost any generalization ability within a few layers. Without an automatic way to regularize the solution in each layer, this approach will thus not yield a reliable initialization for classical MLP backpropagation (Haykin 1999). Future research must show whether a properly regularized MLPSFA algorithm can be used to initialize an MLP and whether or not this approach can compete with deep belief networks based on autoencoders.
Footnotes
 1.
Extensions that include such information are discussed in Sect. 6.
 2.
This holds only for factorizing spaces Θ, i.e. independent boundary conditions for each dimension. However, more general statements can be derived for any ergodic Markov chain.
 3.
Here the ith class label is a \( \mathcal{C}\) dimensional vector which is zero everywhere except for the ith entry, which is one. The result is a matrix T with T _{ it }=1 if c _{ t }=i and 0 otherwise.
 4.
The solution is unique if C is invertible. If not, a MoorePenrose pseudoinverse of C can still yield acceptable results.
 5.
This regularization approach is equivalent to weight decay, i.e. an additional regularization term \(\eta\\boldsymbol{w}\^{2}_{2}\) in the loglikelihood objective.
 6.
The samples are not i.i.d. and must be drawn by an ergodic Markov chain in order for the empirical mean to converge in the limit (Meyn and Tweedie 1993).
 7.
Technically this holds only when the solution is regularized in \(\mathcal{H}\)—which we will do later.
 8.
Note that the solutions ϕ _{ i }(⋅) are thus linear functions of inner products in \(\mathcal{H}\).
 9.
In Sect. 5 we will generate feature mappings from a collection of time series, i.e. multiple words. If all words are assembled into one large time series, the transition from one word to the next can be excluded from optimization by setting the respective entry in D to zero.
 10.
Not all Hilbert spaces \(\mathcal{H}\) contain \(\boldsymbol{1}_{\mathcal{H}}\). Technically we must optimize over \(\mathcal{H} \cup\{\boldsymbol{1}_{\mathcal{H}}\}\). This can be achieved by allowing solutions of the form \(\phi_{i}(\cdot) = \sum_{j=1}^{m} A_{ji} \kappa(\cdot, \boldsymbol{z}_{j})  c_{i}\).
 11.
Let “:” denote the index vector of all available indices. See Algorithm 2.
 12.
\(\epsilon_{t}^{\boldsymbol{i}}\) is nonnegative and MP KPCA thus minimizes the L _{1} norm of approximation errors.
 13.
An exact minimization of the L _{∞} norm is as expensive as the MP KPCA algorithm. However, since \(\epsilon_{t}^{\boldsymbol{i} \cup t} = 0\), selecting the worst approximated sample x _{ t } effectively minimizes the supremum norm L _{∞}.
 14.
 15.
This violates the i.i.d. assumption of most classification algorithms, as two successive samples are no longer independent. The classification results were excellent, however, indicating that the violation did not influence the algorithms’ performance too strongly.
 16.
Although the choice of embedding parameters change the resulting slowness in a nontrivial fashion, we want to point out that this change appears to be smooth and the presented shapes similar over a wide range of embedding parameters.
 17.
Comparison to linear SFA features is omitted due to scale, e.g. test slowness was slightly above 0.5 for both data sets. RSKSFA can therefore outperform linear SFA by more than a factor of 10, a magnitude we observed in other experiments as well.
 18.
Note that this is not the same as the discriminant output of the temporal mean over the whole word, i.e. \(\mathbb{E}_{t}[f(\boldsymbol{x}_{t})] = \sum_{i=1}^{p} w_{i} \mathbb {E}_{t}[\phi_{i}(\boldsymbol{x}_{t})] \neq f(\mathbb{E}_{t}[\boldsymbol{x}_{t}])\), as ϕ _{ i }(⋅) are nonlinear.
 19.
Test accuracy is the fraction of correct classifications for an previously unseen test set.
 20.
The unregularized algorithm showed clear signs of overfitting in large feature spaces, i.e. the accuracy dropped for large p. We repeated the training and evaluation procedure with slowly increasing regularization parameter η. The (presented) results stabilized at η=500.
 21.
 22.
Samples of low (or no) importance would be ideally mapped onto the feature output of the last important sample seen in the trajectory. However, due to the structure of Hilbert space \(\mathcal{H}\) the modified algorithm can produce mappings that generalize well in Θ.
 23.
The feature output of the unimportant part of Θ should be either constant or change as smooth as possible between adjacent important samples.
 24.
Hinton and Osindero (2006) showed this for restricted Boltzmann machines, but the principle holds for MLP as well.
Notes
Acknowledgements
This work has been supported by the Integrated Graduate Program on HumanCentric Communication at Technische Universität Berlin, the German Research Foundation (DFG SPP 1527 autonomous learning), the German Federal Ministry of Education and Research (grant 01GQ0850) and EPSRC grant #EP/H017402/1 (CARDyAL). We want to thank Matthias Franzius, who gave us a sound introduction into nonlinear SFA, and Roland Vollgraf for his contribution to an earlier version of RSKSFA.
References
 Assmann, P. F., Nearey, T. M., & Bharadwaj, S. (2008). Analysis and classification of a vowel database. Canadian Acoustics, 36(3), 148–149. Google Scholar
 Becker, S., & Hinton, G. E. (1992). A selforganizing neural network that discovers surfaces in random dot stereograms. Nature, 355(6356), 161–163. CrossRefGoogle Scholar
 Berkes, P., & Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 5, 579–602. CrossRefGoogle Scholar
 Berkes, P. (2005). Pattern recognition with slow feature analysis. Cognitive Sciences EPrint Archive (CogPrint) (4104). Google Scholar
 Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer. ISBN 9780387310732. MATHGoogle Scholar
 Böhmer, W., Grünewälder, S., Nickisch, H., & Obermayer, K. (2011). Regularized sparse kernel slow feature analysis. In ECML/PKDD 2011 (vol. I, pp. 235–248). Google Scholar
 Bray, A., & Martinez, D. (2002). Kernelbased extraction of slow features: complex cells learn disparity and translation invariance from natural images. Neural Information Processing Systems, 15, 253–260. Google Scholar
 Csató, L., & Opper, M. (2002). Sparse online gaussian processes. Neural Computation, 14(3), 641–668. MATHCrossRefGoogle Scholar
 Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. MATHGoogle Scholar
 Einhäuser, W., Hipp, J., Eggert, J., Körner, E., & König, P. (2005). Learning viewpoint invariant object representations using temporal coherence principle. Biological Cybernetics, 93(1), 79–90. MathSciNetMATHCrossRefGoogle Scholar
 Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. CrossRefGoogle Scholar
 Fletcher, R. (1987). Practical methods of optimization (2nd ed.). New York: Wiley. MATHGoogle Scholar
 Földiák, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194–200. CrossRefGoogle Scholar
 Franzius, M., Sprekeler, H., & Wiskott, L. (2007). Slowness and sparseness leads to place, headdirection, and spatialview cells. PLoS Computational Biology, 3(8), e166. MathSciNetCrossRefGoogle Scholar
 Fukumizu, K., Bach, F. R., & Gretton, A. (2007). Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8, 361–383. MathSciNetMATHGoogle Scholar
 Haykin, S. (1999). Neural networks: a comprehensive foundation (2nd ed.). New York: Prentice Hall. MATHGoogle Scholar
 Hinton, G. E., & Osindero, S. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. MathSciNetMATHCrossRefGoogle Scholar
 Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. MathSciNetMATHCrossRefGoogle Scholar
 Huke, J. P. (2006). Embedding nonlinear dynamical systems: a guide to Takens’ theorem. Technical report, University of Manchester. Google Scholar
 Hussain, Z., & ShaweTaylor, J. (2008). Theory of matching pursuit. In Advances in neural information processing systems (vol. 21, pp. 721–728). Google Scholar
 Mallat, S., & Zhang, Z. (1993). Matching pursuits with timefrequency dictionaries. IEEE Transactions on Signal Processing, 41, 3397–3415. MATHCrossRefGoogle Scholar
 Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. London: Springer. MATHCrossRefGoogle Scholar
 Rosenblatt, F. (1962). Principles of neurodynamics: perceptrons and the theory of brain mechanisms. Spartan. MATHGoogle Scholar
 Rubin, D. B. (1983). Iteratively reweighted least squares. Encyclopedia of Statistical Sciences, 4, 272–275. Google Scholar
 Schölkopf, B., Smola, A., & Müller, K. R. (1997). Kernel principal component analysis. In Artificial neural networks ICANN. Google Scholar
 Schölkopf, B., Smola, A., & Müller, K.R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319. CrossRefGoogle Scholar
 ShaweTaylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press. CrossRefGoogle Scholar
 Smola, A. J., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proceedings to the 17th international conference machine learning (pp. 911–918). Google Scholar
 Stone, J. V. (2001). Blind source separation using temporal predictability. Neural Computation, 13(7), 1559–1574. MATHCrossRefGoogle Scholar
 Takens, F. (1981). Detecting strange attractors in turbulence. In Dynamical systems and turbulence (pp. 366–381). CrossRefGoogle Scholar
 Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics. MATHCrossRefGoogle Scholar
 Wiskott, L. (2003). Slow feature analysis: a theoretical analysis of optimal free responses. Neural Computation, 15(9), 2147–2177. MATHCrossRefGoogle Scholar
 Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: unsupervised learning of invariances. Neural Computation, 14(4), 715–770. MATHCrossRefGoogle Scholar
 Wyss, R., König, P., & Verschure, P. F. M. J. (2006). A model of the ventral visual system based on temporal stability and local memory. PLoS Biology, 4(5), e120. CrossRefGoogle Scholar