Improved graph-based SFA: information preservation complements the slowness principle

Slow feature analysis (SFA) is an unsupervised learning algorithm that extracts slowly varying features from a multi-dimensional time series. SFA has been extended to supervised learning (classification and regression) by an algorithm called graph-based SFA (GSFA). GSFA relies on a particular graph structure to extract features that preserve label similarities. Processing of high dimensional input data (e.g., images) is feasible via hierarchical GSFA (HGSFA), resulting in a multi-layer neural network. Although HGSFA has useful properties, in this work we identify a shortcoming, namely, that HGSFA networks prematurely discard quickly varying but useful features before they reach higher layers, resulting in suboptimal global slowness and an under-exploited feature space. To counteract this shortcoming, which we call unnecessary information loss, we propose an extension called hierarchical information-preserving GSFA (HiGSFA), where some features fulfill a slowness objective and other features fulfill an information preservation objective. The efficacy of the extension is verified in three experiments: (1) an unsupervised setup where the input data is the visual stimuli of a simulated rat, (2) the localization of faces in image patches, and (3) the estimation of human age from facial photographs of the MORPH-II database. Both HiGSFA and HGSFA can learn multiple labels and offer a rich feature space, feed-forward training, and linear complexity in the number of samples and dimensions. However, the proposed algorithm, HiGSFA, outperforms HGSFA in terms of feature slowness, estimation accuracy, and input reconstruction, giving rise to a promising hierarchical supervised-learning approach. Moreover, for age estimation, HiGSFA achieves a mean absolute error of 3.41 years, which is a competitive performance for this challenging problem.


Introduction
Illustration of three basic extensions to SFA (graph-based, hierarchical, information-preserving), each of them is represented by a different direction. The combined use of these extensions results in 8 different variants of SFA. This article proposes information preservation, which is used in iSFA, iGSFA, HiSFA, and HiGSFA, the last one being the most promising variant of SFA as a minimization of the mean squared reconstruction error with PCA. The resulting network that considers both optimization goals is called hierarchical information-preserving GSFA (HiGSFA), and the algorithm used in each node of the network is called informationpreserving GSFA (iGSFA). The feature vector computed by iGSFA comprises two parts: (1) a slow part, which is a linear mixture of the (nonlinear) features computed using GSFA, and (2) an input-reconstructive part computed using PCA.
The iGSFA algorithm, which is the main building block of HiGSFA, reduces the redundancy between the slow and reconstructive parts by making both parts decorrelated: The reconstructive part does not directly approximate the input data but only a version of it where the slow part has been projected out, called residual data. Moreover, iGSFA also ensures that the scale of the slow components is compatible with the scale of the reconstructive components. This enables meaningful processing by PCA in subsequent layers (PCA is sensitive to input feature magnitudes). Different versions of SFA with and without information preservation are shown in Fig. 1.
A fundamental motivation for investigating H(iG)SFA is that, in contrast to gradient-based training of neural networks, it has a stronger biological feasibility because each iGSFA node adapts only to its local input. Thus, the method allows massive parallelization and scalability. On the other hand, it does not minimize a global loss function explicitly, as gradient-based method do, but our experimental results prove that one can achieve remarkable accuracy with the resulting networks.
The experiments show the versatility and generality of the proposed extensions in three setups: (1) Unsupervised learning, where the input data is the view of a simulated rat moving inside a box. This is solved with HiSFA, a special case of HiGSFA. (2) Supervised learning, where four pose parameters of faces depicted in image patches must be estimated. This problem is closely connected to face detection and tracking.
(3) A supervised problem requiring the estimation of age from facial photographs. All three experiments show the advantages of using information preservation: (a) slower features, (b) better generalization to unseen data, (c) much better input reconstruction (see Fig. 2), and (d) improved classification/regression accuracy (experiments 2 and 3). Furthermore, the computational and memory requirements of HiGSFA are moderate, having the same asymptotic order as those of HGSFA. The same image fully pre-processed (i.e., after pose normalization and face sampling, 96×96 pixels). Linear reconstructions on 75 features extracted with either c PCA, d HiGSFA or e HGSFA. f Average over all pre-processed images of the MORPH-II database. Notice that the reconstruction using HiGSFA features is most similar to that of PCA, whereas the reconstruction using HGSFA features is most similar to the average image.

Slow Feature Analysis (SFA)
Slow features can be computed using a few methods, such as online learning rules (e.g., Földiák 1991;Mitchison 1991), slow feature analysis (SFA, Wiskott 1998;Wiskott and Sejnowski 2002), which is a closed-form algorithm specific for this task and has biologically feasible variants , and an incremental-learning version (inc-SFA, Kompella et al. 2012).
The delta value Δ(y j ) is defined as the time average ( · t ) of the squared derivative of y j and is therefore a measure of the slowness (or rather fastness) of the signal, whereas the constraints require that the output signals are normalized, not constant, and represent different aspects of the input signal. SFA computes an optimal solution to the problem above within a linear feature space (possibly after the data has been expanded nonlinearly). Typically, discrete time is used, and the derivative is approximated asẏ j (t) def = y j (t + 1) − y j (t).

Hierarchical SFA (HSFA) and terminology
Hierarchical SFA (HSFA) was already introduced in the paper that first introduces the SFA algorithm (Wiskott 1998), where it is employed as a model of the visual system for learning invariant representations.
Various publications have followed this biological interpretation. Franzius et al. (2007) have used HSFA to learn invariant features from the simulated view of a rat walking inside a box. In conjunction with a sparseness post-processing step, the extracted features are similar to the responses of place cells in the hippocampus.
Other works have exploited the computational efficiency of HSFA compared to direct SFA. Franzius et al. (2011) have used HSFA for object recognition from images and to estimate pose parameters of single objects moving and rotating over a homogeneous background. Escalante-B. and  have used an 11-layer HGSFA network to accurately find the horizontal position of faces in photographs.
In general, HSFA networks (composed of several SFA nodes) are directed and acyclic. They are usually structured in multiple layers of nodes, following a grid structure. Most HSFA networks in the literature have a similar architecture. Typical differences are how the data is split into smaller blocks and the particular pre-processing done by before the SFA nodes themselves.
For simplicity, we refer to the input data as layer 0. Important parameters that define the structure of the network include: (a) The output dimensionality of the nodes. (b) The fan-in of the nodes, which is the number of nodes (or data elements) in a previous layer that feed into them. (c) The receptive field of the nodes, which refers to all the elements of the input data that (directly or indirectly) provide input to a particular node. (d) The stride of a layer, which tells how far apart the inputs to adjacent nodes in a layer are. If the stride is smaller than the fan-in along the same direction, then at least one node in the previous layer will feed two or more nodes in the current layer. This is called receptive field overlap.

Graph-based SFA (GSFA)
Graph-based SFA (Escalante-B. and Wiskott 2013) is an extension to SFA designed for supervised learning. GSFA extracts a compact set of features that is usually post-processed by a typical supervised algorithm to generate the final label or class estimate.
In GSFA the training data takes the form of a training graph G = (V, E) (illustrated in Fig. 3a), which is a structure composed of a set V of vertices x(n), each vertex being a sample (i.e. an I -dimensional vector), and a set E of edges (x(n), x(n )), which are pairs of samples, with 1 ≤ n, n ≤ N . The index n (or n ) replaces the time variable t of SFA. The edges are undirected and have symmetric weights γ n,n = γ n ,n , which indicate the label similarity between the connected vertices; also, each vertex x(n) has an associated weight v n > 0, which can be used to reflect its frequency. This representation includes the standard time series of SFA as a special case (Fig. 3b). The GSFA optimization problem (Escalante-B. and Wiskott 2013) can be stated as follows. For 1 ≤ j ≤ J , find output features y j (n) = g j (x(n)) with g j ∈ F , where 1 ≤ n ≤ N and F is the function space, such that the objective function (weighted delta value) under the constraints 1 Q n v n y j (n) = 0 (weighted zero mean), 1 Q n v n (y j (n)) 2 = 1 (weighted unit variance), and with R def = n,n γ n,n and Q def = n v n . The factors 1/R and 1/Q provide invariance to the scale of the edge and vertex weights. Typically, a linear function space is used, but the input samples are preprocessed by a nonlinear expansion function.
One should choose the edge weights of the training graph properly, because they control what kind of features are extracted by GSFA. In general, to obtain features useful for classification, one should favor connections between samples from the same class (stronger edge weighs for same-class samples). To obtain features useful for regression, one should favor connections between samples with similar labels. In Sects. 5.2.3 and 5.3.2, we combine various training graphs to learn feature representations that allow the solution of various classification and regression problems simultaneously (multi-label learning).
The rest of the article is organized as follows. In the next section we analyze the advantages and limitations of hierarchical networks for slow feature extraction. The shortcomings we expose motivate the introduction of the iSFA (and iGSFA) algorithms in Sect. 4. In Sect. 5 these algorithms, or more precisely, hierarchical neural networks built with them, are evaluated experimentally using three different problems of supervised and unsupervised DR. We conclude with a discussion section.

Assessment of hierarchical processing by HSFA and HGSFA networks
In this section we analyze HSFA networks in terms of their advantages and limitations. This analysis is crucial, since it justifies the extensions with information preservation (HiSFA and HiGSFA) proposed in Sect. 4.

Advantages of HSFA and HGSFA networks
The central advantages of hierarchical processing in H(G)SFA-compared to direct (G)SFA-have been mentioned in the introduction and Sect. 2.2: (1) It reduces overfitting and can be seen as a regularization method, (2) the nonlinearity of the layers of the neural network accumulates in a compositional manner, so that even when using simple expansions the network as a whole may describe a highly nonlinear feature space, and (3) better computational efficiency than (G)SFA.
Some remarks about these advantages are pertinent: Advantage (1) is explained by the fact that the input dimensionality to the nodes of the hierarchical network is much smaller than the original input dimensionality, whereas the number of samples remains unchanged. Thus, individual nodes are less susceptible to overfitting. Consequently, the gap in generalization performance between HSFA and direct SFA is larger when polynomial expansions are introduced (compared to their linear versions), because the dimensionality of the expanded data is much larger in direct SFA and this translates into stronger overfitting. On the other hand HSFA is not guaranteed to find the optimal (possibly overfitting) solution within the available function space whereas SFA is.
Advantage (2) is valuable in practice, because most real-life problems are nonlinear. A complex feature space may be necessary to extract the slowest hidden parameters and solve the supervised problem with good accuracy.
Advantage (3) is addressed more precisely in the following paragraphs by first recalling the computational complexity of SFA and GSFA. This complexity is then compared with the complexity of a particular HSFA network ("Appendix A"). We focus on the training complexity rather than on the complexity of feature extraction during testing, because the former can be considerable in practice, whereas the latter is relatively lightweight in both HSFA and direct SFA.
Following standard notation of algorithm complexity, computational (time) complexity is denoted by T (e.g., T SFA ). Training linear SFA has a time (computational) complexity where N is the number of samples and I is the input dimensionality (possibly after a nonlinear expansion). The same complexity holds for GSFA if one uses an efficient training graph (e.g., the serial and clustered graphs, see Sects. 5.2.3 and 5.3.2), otherwise (for arbitrary graphs) it can be as large as For large I (i.e., high-dimensional data) direct SFA and direct GSFA are therefore inefficient. 1 In contrast, HSFA and HGSFA can be much more efficient. The exact complexity of HSFA and HGSFA depends on the structure and parameters of the hierarchical network. "Appendix A" proves that the training complexity is linear in I and N for certain networks.

Limitations of HSFA networks
Although HSFA and HGSFA networks show remarkable advantages, they also have some shortcomings in their current form. The analysis in this section focuses on HSFA, but it also applies to other networks in which the nodes have only one criterion for DR, namely slowness maximization, such as HGSFA. Besides the slowness maximization objective, no other restriction is imposed on the nodes; they might be linear or nonlinear, include additive noise, clipping, various passes of SFA, etc.
It is shown here that relying only on slowness to determine which aspects of the data are preserved results in two shortcomings: unnecessary information loss and poor input reconstruction, explained below.
(1) Unnecessary information loss This shortcoming occurs when the nodes of the network discard dimensions of the data that are not significantly slow locally (i.e., at the node level), but which would have been useful for slowness optimization by other nodes in subsequent layers if they had been preserved (and combined with other dimensions).
The following minimal theoretical experiment shows that dimensions crucial to extract global slow features are not necessarily slow locally. Consider four zero-mean, unit-variance signals: s 1 (t), s 2 (t), s 3 (t) and n(t) that can only take binary values, either −1 or +1 (n stands for noise here, and t is time, or more precisely, sample number). Assume these signals are ordered by slowness (Δ s 1 < Δ s 2 < Δ s 3 < Δ n = 2.0) and they are statistically independent. The value 2.0 is precisely the expected Δ value of a random unit-variance i.i.d. noise feature. The same holds for GSFA if the graph is consistent and has no self-loops (Escalante-B. and Wiskott 2016).
Let the 4-dimensional input to the network be (x 1 , x 2 , x 3 , x 4 ) def = (s 2 , s 1 n, s 3 , n) and assume the number of samples is large enough. The direct application of quadratic SFA (QSFA, i.e., a quadratic expansion followed by linear SFA) to this data would allow us to extract the slowest possible feature, namely, x 2 x 4 = (s 1 n)n = s 1 (or equivalently −x 2 x 4 ). However, let us assume that a 2-layer quadratic HSFA (QHSFA) network with 3 nodes is used, where the output of the network is: QSFA QSFA(s 2 , s 1 n), QSFA(s 3 , n) . Each quadratic QSFA node reduces the number of dimensions from 2 to 1. Since Δ s 2 < Δ s 1 n = 2.0, the first bottom node computes QSFA(s 2 , s 1 n) = s 2 , and since Δ s 3 < Δ n = 2.0, the second bottom node computes QSFA(s 3 , n) = s 3 . The top node would then extract QSFA(s 2 , s 3 ) = s 2 . Therefore, the network would miss the slowest feature, s 1 , even though it actually belongs to the feature space spanned by the network.
The problem can be expressed in information theoretic terms: I (s 1 n, s 1 ) = 0 , and I (n, s 1 ) = 0 , but where H is entropy, and I denotes mutual information. 2 Equations (7)-(9) show that it is impossible to locally rule out that a feature contains information that might yield a slow feature (in this case s 1 ), unless one globally observes the whole data available to the network. The problem above could be solved if the network could preserve n and s 1 n by resorting to another criteria besides slowness. For example, if the signals n and s 1 n were 10n and 10s 1 n instead, one could distinguish and preserve them based on their larger variance. Unnecessary information loss is not only a theoretical concern, it can affect applications in practice. Figure 4 shows the Δ values of the slowest features extracted by the first layer of an HGSFA network trained for age estimation from human face images. Most Δ values are approximately 2.0, and only a few of them are significanlty smaller than 2.0. The value 2.0 is crucial; a feature with Δ = 2.0 can be a transformation of relevant inputs, a transformation of irrelevant noise inherent to the inputs, or something between these cases. In fact, if two or more feasible features have the same Δ value, GSFA outputs an arbitrary rotation of them, even though one might prefer features that are transformations of the relevant input rather than noise components. Due to DR only a fraction of the features may be preserved and the rest is discarded even though some of them might still contain useful information.
One might try to preserve many features to reduce information loss. However, this might be impractical, because it would increase the computational cost and contradict the goal of dimensionality reduction.
(2) Poor input reconstruction Input reconstruction is the task of generating an (artificial) input having a given feature representation (or an approximation of it). Wilbert (2012) has studied this task for HSFA.
In the image domain, reconstruction can be useful to extrapolate images by computing distortions to an input image that reflect modifications introduced to the output features. For example, assume SFA was trained to extract body mass index (BMI) from facial images. Reconstruction would allow us to visualize how a particular person would look after losing or gaining a few kilograms.
Experiments have shown that input reconstruction from top-level features extracted by HSFA is a challenging task (Wilbert 2012). We have confirmed this observation through previous experiments using various nonlinear methods for input reconstruction, including local and global methods.
We show here that poor input reconstruction may not be caused by the weakness of the reconstruction algorithms employed but rather by insufficient reconstructive information in the slow features: The extracted features ideally depend only on the hidden slow parameters and are invariant to any other factor. In the BMI estimation example, the extracted features would be strongly related to the BMI (and nonlinear transformations of it). Thus, the features would be mostly invariant to other factors, such as the identity of the person, his or her facial expression, the background, etc. Therefore, in theory only BMI information would be available for reconstruction.
In practice, residual information about the input data can still be found in the extracted features. However, one cannot rely on this information because it is partial (making reconstructions not unique) and highly nonlinear (making it difficult to untangle). Even the features extracted by linear SFA typically result in inaccurate reconstructions. Since HSFA has many layers of SFA nodes, the problem is potentially aggravated.
The connection between the problems of unnecessary information loss and poor input reconstruction is evident if one distinguishes between two types of information: (a) information about the full input data and (b) information about the global slow parameters. Losing (a) results in poor input reconstruction, whereas losing (b) results in unnecessary information loss. Of course, (a) contains (b). Therefore, both problems originate from losing different but related types of information.

Hierarchical information-preserving GSFA (HiGSFA)
In this section, we formally propose HiGSFA, an extension to HGSFA that counteracts the problems of unnecessary information loss and poor input reconstruction described above by extracting reconstructive features in addition to slow features. HiGSFA is a hierarchical implementation of information-preserving GSFA (iGSFA). To simplify the presentation we first focus on information-preserving SFA (iSFA) and explain later how to easily extend iSFA to iGSFA and HiGSFA. We write iSFA with lowercase 'i' to distinguish it from independent SFA (ISFA, Blaschke et al. 2007).
iSFA combines two learning principles: the slowness principle and information preservation. Information preservation requires the maximization of the mutual information between the output features and the input data. However, for finite, discrete and typically unique data samples, it is difficult to measure and maximize mutual information unless one assumes a specific probability model. Therefore, we implement information preservation more practically as the minimization of a reconstruction error. A closely related concept is the explained variance, but this term is avoided here because it is typically restricted to linear transformations.

Algorithm overview (iSFA)
HiSFA improves feature extraction of HSFA networks at the node and global level. The SFA nodes of an HSFA network are replaced by iSFA nodes, leaving the network structure unchanged (although the network structure could be tuned to achieve better accuracy, if desired).
The feature vectors computed by iSFA consist of two parts: (1) a slow part derived from SFA features, and (2) a reconstructive part derived from principal components (PCs). Generally speaking, the slow part captures the slow aspects of the data and is basically composed of standard SFA features, except for an additional linear mixing step to be explained in Sects. 4.2 and 4.4. The reconstructive part ignores the slowness criterion and describes the input linearly as closely as possible (disregarding the part already described by the slow part). In Sect. 5 we show that, although the reconstructive features are not particularly slow, they indeed contribute to global slowness maximization. Figure 5 shows the components of iSFA, and Algorithm 1 summarizes the algorithm compactly. The iSFA algorithm (training phase) is defined as follows. Let X def = (x 1 , . . . , x N ) be the I -dimensional training data, D the output dimensionality, h(·) the nonlinear expansion function, and Δ T ≈ 2.0 (a Δ-threshold, in practice slightly smaller than 2.0). First, the average samplex def = 1 N n x n is removed from the N training samples resulting in the centered data X def = {x n }, with x n def = x n −x, for 1 ≤ n ≤ N . Then, X is expanded via h(·), resulting in vectors z n = h(x n ) of dimensionality I . Afterwards, linear SFA is trained with the expanded data Z def = {z n }, resulting in a weight matrix W SFA and an average expanded vectorz. The slow features extracted from Z are s n = W T SFA (z n −z). The first J components of s n with Δ < Δ T and J ≤ min(I , D) are denoted s n . The remaining components of s n are discarded. S def = {s n } has J features, each of them having zero mean and unit variance. The next steps modify the amplitude of S : The centered data X is approximated from S linearly by using ordinary least squares regression to compute a matrix M and a vector d, such that

Algorithm description (training phase of iSFA)
where A is the approximation of the centered data given by the slow part (i.e., x n ≈ a n def = Ms n +d) and 1 is a vector of 1s of length N . Since X and S are centered, d could be discarded because d = 0. However, when GSFA is used the slow features have only weighted zero mean, and d might improve the approximation of X . Afterwards, the QR decomposition of M M = QR Algorithm 1 Training phase of iSFA Require: D > 0: output dimensionality 1: procedure train(X) % X = (x 1 , . . . , x N ): training samples 2: ∀n : x n ← x n −x %x: average sample 3: ∀n : z n ← h(x n ) % Z = (z 1 , . . . , z N ): expanded samples 4: W SFA ,z ← SFA.train(Z, output_dim = min(I , D)) 5: ∀n : s n ← W SFA T (z n −z) 6: ∀n : s n ← (s n1 , . . . , s n J ) T % Preserve the first J features with Δ < Δ T 7: ∀n : a n ← Ms n + d % For M and d, such that MS + d ≈ X 8: ∀n : y n ← Rs n % For QR = M, the QR decomposition of M 9: ∀n : c n ← W T PCA b n % Only D − J PCs are preserved 12: ∀n : y n = y n |c n % Concatenation of slow and reconstructive parts 13: is computed, where Q is orthonormal and R is upper triangular. Then, the (amplitudecorrected) slow feature part is computed as Section 4.4 justifies the mixing and scaling (12) of the slow features s n .
To obtain the reconstructive part, residual data b n def = x n − a n is computed, i.e., the data that remains after the data linearly reconstructed from y n (or s n ) is removed from the centered data. Afterwards, PCA is trained with {b n }, resulting in a weight matrix W PCA . There is no bias term, because b n is centered. The reconstructive part c n is then defined as the first D − J principal components of b n and computed accordingly.
Thereafter, the slow part y n (J features) and the reconstructive part c n (D − J features) are concatenated, resulting in the D-dimensional output features y n def = y n |c n , where | is vector concatenation.
Finally, the algorithm returns Y = (y 1 , . . . , y N ),x, W SFA ,z, W PCA , J , Q, R, and d. The output features Y are usually computed only during feature extraction. Still, we keep them here to simplify the understanding of the signals involved.
We would like to remark that the algorithm proposed above takes care of the following important considerations: (a) From the total output dimensionality D, it decides how many features the slow and reconstructive part should contain. (b) It minimizes the redundancy between the slow and the reconstructive part, allowing the output features to be more compact and have higher information content. (c) It corrects the amplitudes of the slow features (SFA features have unit variance) to make them compatible with the PCs and allow their processing by PCA in subsequent nodes (PCA is a rotation and projection. Thus, it preserves the amplitude of the original data in the retained directions).

Feature extraction by iSFA
The feature extraction algorithm (described in Algorithm 2) is similar to the training algorithm, except that the parametersx, W SFA ,z, W PCA , J , Q, R, and d have already been learned from the training data. Algorithm 2 processes a single input sample, but it can be easily and efficiently adapted to process multiple input samples by taking advantage of matrix operations.

Algorithm 2 Feature extraction with iSFA
Require: D,x, W SFA ,z, W PCA , J , Q, R, d 1: procedure extract(x) % x: a new sample 2: x ← x −x %x: mean of the training data 3: c ← W T PCA b 10: y = y |c 11: return y 12: end procedure

Mixing and scaling of slow features: QR scaling
In the iSFA algorithm, the J -dimensional slow features s n are transformed into the scaled y features. Such a transformation is necessary to make the amplitude of the slow features compatible with the amplitude of the PCA features, so that PCA processing of the two sets of features together is possible and meaningful in the next layers.
In general, a feature scaling method should ideally offer two key properties of PCA.
(1) If one adds unit-variance noise to one of the output features (e.g., principal components), the reconstruction error also increases by one unit on average. (2) If one adds independent noise to two or more output features simultaneously, the reconstruction error increases additively.
The feature scaling method (10)-(12) used by Algorithm 1 is called QR scaling, and it can be shown that it fulfills the two properties above. QR scaling ensures that the amplitude of the slow features is approximately equal to the reduction in the reconstruction error that each feature allows. In practice, a lower bound on the scales (not shown in the pseudo-code) ensures that all features have amplitudes > 0 even if they do not contribute to reconstruction.
In the iSFA algorithm, we approximate the input linearly asx =ã +b +x, wherẽ a = Qy + d andb = W PCA c (see Sect. 4.5 and recall that W PCA is the Moore-Penrose pseudoinverse of W T PCA ). Approximations are denoted here using tilded variables. Therefore, vector y = y |c fulfills the two key properties of reconstruction of PCA above because matrix Q and matrix W PCA are orthogonal, and because the columns of the two matrices are mutually orthogonal.
One small drawback is that (12) mixes the slow features, which is undesirable when one uses certain expansion functions. Therefore, as an alternative we propose a sensitivity-based scaling method in "Appendix B", which does not mix the slow features.

Input reconstruction from iSFA features
One interesting property of iSFA is that both the slow and the reconstructive features are nonlinear w.r.t. the input data. The slow features are nonlinear due to the expansion function. The residual data is nonlinear because it is computed using the (nonlinear) slow part and the centered data. The reconstructive features are computed using the residual data and are linear w.r.t. the residual data but nonlinear w.r.t. the input data.
Even though the computed features are all nonlinear, iSFA allows a linear approximation of the input (linear input reconstruction). In contrast, standard SFA does not have a standard input reconstruction method, although various gradient-descent and vector-quantization methods have been tried (e.g., Wilbert 2012) with limited success.
The reconstruction algorithm simply computes the input approximation described in Sect. 4.4, and further detailed in Algorithm 3.

Algorithm 3 Linear input reconstruction for iSFA
,ã +b +x 5: returnx 6: end procedure Moreover, the linear reconstruction algorithm has practical properties: It is simpler than the feature extraction algorithm, since the nonlinear expansion h and W SFA are not used, and it has lower computational complexity, because it consists of only two matrix-vector multiplications and three vector additions, none of them using expanded I -dimensional data.
Alternatively, nonlinear reconstruction from iSFA features is also possible, as described in the algorithm below. Assume y is the iSFA feature representation of a sample x, which we denote as y = iSFA(x). Since x is unknown, the reconstruction error cannot be computed directly. However, one can indirectly measure the accuracy of a particular reconstructionx by means of the feature error, which is defined here as e feat def = ||y − iSFA(x)||. The feature error can be minimized forx using the function iSFA(·) as a black box and gradient descent or other generic nonlinear minimization algorithms. Frequently, such algorithms require a first approximation, which can be very conveniently provided by the linear reconstruction algorithm.
Although nonlinear reconstruction methods might result in higher reconstruction accuracy than linear methods in theory, they are typically more expensive computationally. Moreover, in the discussion we explain why minimizing e feat typically increases the reconstruction error in practice.

Some remarks on iSFA and derivation of iGSFA and HiGSFA
Clearly, the computational complexity of iSFA is at least that of SFA, because iSFA consists of SFA and a few additional computations. However, none of the additional computations is done on the expanded I -dimensional data but at most on I or D-dimensional data (e.g., PCA is applied to I -dimensional data, and the QR decomposition is applied to an I × I -matrix, which have an O(N I 2 + I 3 ) and O(I 3 ) complexity, respectively). Therefore, iSFA is slightly slower than SFA but has the same complexity order. Practical experiments (Sect. 5) confirm this observation.
The presentation above focuses on iSFA. To obtain information-preserving GSFA (iGSFA) one only needs to substitute SFA by GSFA inside the iSFA algorithm and provide GSFA with the corresponding training graph during the training phase. Notice that GSFA features have weighted zero mean instead of the simple (unweighted) zero mean enforced by SFA. This difference has already been compensated by vector d. HiGSFA is constructed simply by connecting iGSFA nodes in a hierarchy, just as HSFA is constructed by connecting SFA nodes.
The iGSFA node has been implemented in Python (including iSFA) and is publicly available in Cuicuilco 3 as well as in the MDP toolkit (Zito et al. 2009).

Experimental evaluation of HiGSFA
This section evaluates the proposed extensions, HiSFA and HiGSFA, on three problems: (1) The unsupervised extraction of temporally stable features from images simulating the view of a rat walking inside a box, which is addressed with HiSFA. (2) The extraction of four pose parameters of faces in image patches, which is a multi-label learning problem suitable for HiGSFA. (3) The estimation of age, race, and gender from human face photographs, which is also solved using HiGSFA. The chosen problems exemplify the general applicability of the proposed algorithms.

Experiment 1: Extraction of slow features with HiSFA from the view of a rat
The input data in this first experiment consists of the (visual) sensory input of a simulated rat. The images have been created using the RatLab toolkit (Schönfeld and Wiskott 2013). RatLab simulates the random path of a rat as it explores its environment, rendering its view according to its current position and head-view angle. For simplicity the rat is confined to a square area bounded by four walls having different textures. Franzius et al. (2007) have shown theoretically and in simulations that the slowest features one extracted from this data are trigonometric functions of the position of the rat and its head direction (i.e., the slow configuration/generative parameters).

Training and test images
For this experiment, first a large fixed sequence of 200,000 images was generated. Then, the training data of a single experiment run is selected randomly as a contiguous subsequence of size 40,000 from the full sequence. The test sequence is selected similarly but enforcing no overlap with the training sequence. All images are in color (RGB) and have a resolution of 320×40 pixels, see Fig. 6.

Description of HiSFA and HSFA networks
To evaluate HiSFA we reproduced an HSFA network that has already been used (Schönfeld and Wiskott 2013). This network has three layers of quadratic SFA. Each layer consists of three components: (1) linear SFA, which reduces the data dimensionality to 32 features, (2) a quadratic expansion, and (3) linear SFA, which again reduces the expanded dimensionality to 32 features. The first two layers are convolutional (i.e., the nodes of these layers share the same weight matrices).
For a fair comparison, we built an HiSFA network with exactly the same structure as the HSFA network described above. The HiSFA network has additional hyperparameters Δ T (one for each node), but in order to control the size of the slow part more precisely, we For computational convenience we simplified the iSFA algorithm to enable the use of convolutional layers as in the HSFA network. Convolutional layers can be seen as a single iSFA node cloned at different spatial locations. Thus, the total input to a node consists not of a single input sequence (as assumed by the iSFA algorithm) but of several sequences, one at each location of the node. The iSFA node was modified as follows: (1) the decorrelation step (between the slow and reconstructive parts) is removed and (2) all slow features are given the same variance as the median variance of the features in the corresponding reconstructive part (instead of QR scaling).

Results
To evaluate HSFA and HiSFA we compute Δ values of the 3 slowest features extracted from test data, which are shown in Table 1. The experiments were repeated 5 times, each run using training and test sequences randomly computed using the procedure of Sect. 5.1.1. Table 1 shows that HiSFA extracts clearly slower features than HSFA for both training and test data in the main setup (40 k training images). For instance, for test data Δ 1 , Δ 2 , and Δ 3 are 28-52% smaller in HiSFA than in HSFA. This is remarkable given that HSFA and HiSFA span the same global feature space (same model capacity).
In order to compare the robustness of HSFA and HiSFA w.r.t. the number of images in the training sequence, we also evaluate the algorithms using shorter training sequences of 5 k, 10 k, and 20 k images. As usual, the test sequences have 40k images. Table 1 shows that HiSFA computes slower features than HSFA given the same number of training images. This holds for both training and test data. In fact, when HiSFA is trained using only 5 k images the slowest extracted features are already slower than those extracted by HSFA trained on 40 k images (both for training and test data). In contrast, the performance of HSFA decreases significantly when trained on 10 k and 5 k images. Therefore, HiSFA is much more robust than HSFA w.r.t. the number of training images (higher sample efficiency).

Experiment 2: Estimation of face pose from image patches
The second experiment evaluates the accuracy of HiGSFA compared to HGSFA in a supervised learning scenario. We consider the problem of finding the pose of a face contained in Smallest delta values of each column have been highlighted in bold Smaller values are better. In the proposed setup the training data consists of 40 k images, but we also show the results when using only 20 k, 10 k, and 5 k training images to illustrate the resistance of HiSFA and HSFA to overfitting. All results have been averaged over 5 runs and include the standard error of the mean. For the HiGSFA networks, the size of the slow part has been fixed to 6 features in all nodes, except in the experiment that uses 5 k training images, where the size of the slow part is 3 features. This hyperparameter was tuned using a different run and random seed an image patch. Face pose is described by four parameters: (1) the horizontal and (2) vertical position of the face center (denoted by x-pos and y-pos, respectively), (3) the size of the face relative to the image patch, and (4) the in-plane rotation angle of the eyes-line. Therefore, we solve a regression problem on four real-valued labels. The resulting system can be easily applied to face tracking and to face detection with an additional face discrimination step.

Generation of the image patches
The complete dataset consists of 64,470 images that have been extracted from a few datasets to increase image variability: Caltech (Fink et al. 2003), CAS-PEAL (Gao et al. 2008), FaceTracer (Kumar et al. 2008), FRGC (Phillips et al. 2005), and LFW (Huang et al. 2007).
In each run of the system two disjoint image subsets are randomly selected, one of 55,000 images, used for training, and another of 9000 images, used for testing. The images are processed in two steps. First they are normalized (i.e., centered, scaled, and rotated). Then, they are rescaled to 64 × 64 pixels and are systematically randomized: In the resulting image patches the center of the face deviates horizontally from the image center by at most ±20 Fig. 7 Examples of image patches after pose randomization illustrating the range of variations in pose. The eyes of the subjects have been pixelated for privacy and copyright reasons pixels, vertically by at most ±10 pixels, the angle of the eye-line deviates from the horizontal by at most ±22.5 deg, and the size of the largest and smallest faces differs by a factor of √ 2 (a factor of 2 in their area). The concrete pose parameters are sampled from uniform distributions in the above-mentioned ranges. To increase sample efficiency, each image of the training set is used twice with different random distortions, thus the effective training set has size 110,000. Examples of the final image patches are shown in Fig. 7.

HiGSFA and HGSFA networks
For comparison purposes, we adopt an HGSFA network that has been previously designed (and partially tuned) to estimate facial pose parameters by Escalante-B. and Wiskott (2013) without modification. Most SFA nodes of this network consist of expansion function followed by linear SFA, except for the SFA nodes of the first layer, which have an additional preprocessing step that uses PCA to reduce the number of dimensions from 16 to 13. In contrast to the RatLab networks, this network does not use weight sharing, increasing feature specificity at each node location. For a fair comparison, we construct an HiGSFA network having the same structure as the HGSFA network (e.g., same number of layers, nodes, expansion function, data dimensions, receptive fields). Similarly to experiment 1, we directly set the number of slow features preserved by the iGSFA nodes, which in this experiment varies depending on the layer from 7 to 20 features (these values have been roughly tuned using a run with a random seed not used for testing). These parameters used to construct the networks are shown in Table 2.

Training graphs that encode pose parameters
In order to train HGSFA and HiGSFA one needs a training graph (i.e., a structure that contains the samples and the vertex and edge weights). A few efficient predefined graphs have already been proposed (Escalante-B. and Wiskott 2013), allowing training of GSFA with a complexity of O(N I 2 + I 3 ), which is of the same order as SFA, making this type of graphs competitive in terms of speed. One example of a training graph for classification is the clustered graph (see Sect. 5.3.2) and one for regression is the serial training graph, described below.
Serial training graph (Escalante-B. and Wiskott 2013) The features extracted using this graph typically approximate a monotonic function of the original label and its higher frequency harmonics. To solve a regression problem and generate label estimates in the appropriate domain, a few slow features (e.g., extracted using HGSFA or HiGSFA) are post-processed by an explicit regression step. There are more training graphs suitable for regression (e.g., mixed graph), but this one has consistently given good results in different experiments. Both networks have the same structure. Layer 0 denotes the input image, whereas layer 9 is the top node. The parameter "size of slow part" (used only by HiGSFA) indicates the number of slow features preserved by the nodes of the respective layer and eliminates parameter Δ T Fig. 8 Illustration of a serial training graph used to learn the position of the face center (x-pos). The training images are first ordered by increasing x-pos and then grouped into L = 50 groups of N g = 2200 images each. Each dot represents an image, edges represent connections, and ovals represent the groups. The images of the first and last group have weight 1 and the remaining images have weight 2 (image weights are represented by smaller/bigger dots). All edges have a weight of 1. The serial graphs for learning x-pos, y-pos, angle, and scale are constructed in the same way. Figure 8 illustrates a serial graph useful to learn x-pos. In general, a serial graph is constructed by ordering the samples by increasing label. Then, the samples are partitioned into L groups of size N g = N /L. A representative label ∈ { 1 , . . . , L } is assigned to each group, where 1 < 2 < · · · < L . Edges connect all pairs of samples from two consecutive groups with group labels ( l and l+1 ). Thus, all connections are inter-group, no intra-group connections are present. Notice that since any two vertices of the same group are adjacent to exactly the same neighbors, they are likely to be mapped to similar outputs by GSFA. We use this procedure to construct four serial graphs G x-pos , G y-pos , G angle , and G scale that encode the x-pos, y-pos, angle, and scale label, respectively. All of them have the same parameters: L = 50 groups, N = 110,000 samples, and N g = 2200 samples per group.
Combined graph to learn all pose parameters One disadvantage of current pre-defined graphs is that they allow to learn only a single (real-valued or categorical) label. In order to learn various labels simultaneously, we resort to a method proposed earlier that allows the combination of graphs (Escalante-B. and Wiskott 2016). We use this method here for the first time on real data to create an efficient graph that encodes all four labels combining G x-pos , G y-pos , G angle , and G scale . The combination preserves the original samples of the graphs, but the vertex and edge weights are added, which is denoted G 4L def = G x-pos + G y-pos + G angle + G scale . The combination of graphs guarantees that the slowest optimal free responses of the combined graph span the slowest optimal free responses of the original graphs, as long as three conditions are fulfilled: (1) all graphs have the same samples and are consistent, (2) all graphs have the same (or proportional) node weights, and (3) optimal free responses that are slow in one graph (Δ < 2.0) should not be fast (Δ > 2.0) in any other graph. Since the labels (i.e., pose parameters) have been computed randomly and independently of each other, these conditions are fulfilled on average (e.g., for the G x-pos graph, any feature y that solely depends on y-pos, scale, or angle has Δ G x-pos y ≈ 2.0). However, the naive graph G 4L does not take into account that the 'angle' and 'scale' labels are more difficult to estimate compared to the x-pos and y-pos labels. Thus, the feature representation is dominated by features that encode the easier labels (and their harmonics) and only few features encode the difficult labels, making their estimation even more difficult.
To solve this problem, we determine weighting factors such that each label is represented at least once in the first five slow features. Features that are more difficult to extract have higher weights to avoid an implicit focus on easy features. The resulting graph and weighting factors are: G 4L def = G x-pos + 1.25G y-pos + 1.5G angle + 1.75G scale .
Supervised post-processing We extract 20 slow features from the training dataset to train four separate Gaussian classifiers (GC), one for each label. Ground truth classes are generated by discretizing the labels into 50 values representing class 1-50. After the GC has been trained with this data, the pose estimate (on the test patches) is computed using the soft-GC method (Escalante-B. and Wiskott 2013), which exploits class membership probabilities of the corresponding classifier: Let P(C l |y) be the estimated class probability that the input sample x with feature representation y = g(x) belongs to the group with average label l . Then, the estimated label is˜ def = 50 l=1 l · P(C l |y) .
Equation (14) has been designed to minimize the root mean squared error (RMSE), and although it incurs an error due to the discretization of the labels, the soft nature of the estimation has provided good accuracy and low percentage of misclassifications.

Results: Accuracy of HiGSFA for pose estimation
The HiGSFA and HGSFA networks were trained using graph G 4L described above. Table 3 shows the estimation error of each pose parameter. The results show that HiGSFA yields more accurate estimations than HGSFA for all pose parameters. We would like to remark that the HiGSFA network has the same structure as the HGSFA network, which has been tuned for the current problem. However, one may adapt the network structure specifically for HiGSFA to further exploit the advantages of this algorithm. Concretely, the improved robustness of HiGSFA allows to handle more complex networks (e.g., by increasing the output dimensionality of the nodes or using more complex expansion functions). In the following Best results for each parameter are shown in bold The number of slow features that were post-processed by the soft-GC regression step is indicated in parenthesis. This hyperparameter was tuned for each algorithm and pose parameter using two random runs different from those of the evaluation. All results were averaged over five runs and include the standard error of the mean experiment on age estimation, the network structure of HiGSFA and its hyperparameters are tuned, yielding higher accuracy.

Experiment 3: Age estimation from frontal face photographs
Systems for age estimation from photographs have many applications in areas such as humancomputer interaction, group-targeted advertisement, and security. However, age estimation is a challenging task, because different persons experience facial aging differently depending on several intrinsic and extrinsic factors.
The first system for age estimation based on SFA was a four-layer HSFA network that processes raw images without prior feature extraction (Escalante-B. and Wiskott 2010). The system was trained on synthetic input images created using special software for 3Dface modeling. However, the complexity of the face model was probably too simple, which allowed linear SFA (in fact linear GSFA) to achieve good performance, and left open the question of whether SFA/GSFA could also be successful on real photographs. This subsection first describes the image pre-processing method. Then, a training graph used to learn age, race, and gender simultaneously is presented. Finally, an HiGSFA network is described and evaluated according to three criteria: feature slowness, age estimation error (compared with state-of-the-art algorithms), and linear reconstruction error.

Image database and pre-processing
The MORPH-II database (i.e. MORPH, Album 2, Ricanek Jr. and Tesafaye 2006) is a large database suitable for age estimation. It contains 55,134 images of about 13,000 different persons with ages ranging from 16 to 77 years. The images were taken under partially controlled conditions (e.g. frontal pose, good image quality and lighting), and include variations in head pose and expression. The database annotations include age, gender (M or F), and "race" ("black", "white", "Asian", "Hispanic", and "other", denoted by B, W, A, H, and O, respectively) as well as the coordinates of the eyes. The procedure used to assign the race label does not seem to be documented. Most of the images are of black (77%) or white races (19%), making it probably more difficult to generalize to other races, such as Asian.
We follow the evaluation method proposed by Guo and Mu (2014), which has been used in many other works. In this method, the input images are partitioned in 3 disjoint sets S 1 and S 2 of 10, 530 images, and S 3 of 34,074 images. The racial and gender composition of S 1 and S 2 is the same: about 3 times more images of males than females and the same number of white and black people. Other races are omitted. More exactly, |M B| = |MW | = 3980, |F B| = |F W | = 1285. The remaining images constitute the set S 3 , which is composed as follows: |M B| = 28,872, |F B| = 3187, |MW | = 1, |F W | = 28, |M A| = 141, |M H| = 1667, |M O| = 44, |F A| = 13, |F H| = 102 and |F O| = 19. The evaluation is done twice by using either S 1 and S 1 -test def = S 2 + S 3 or S 2 and S 2 -test def = S 1 + S 3 as training and test sets, respectively.
We pre-process the input images in two steps: pose normalization and face sampling (Fig. 2). The pose-normalization step fixes the position of the eyes ensuring that: (a) the eye line is horizontal, (b) the inter-eye distance is constant, and (c) the output resolution is 256×260 pixels. After pose normalization, a face sampling step selects the head area only, enhances the contrast, and scales down the image to 96×96 pixels.
In addition to the S 1 , S 2 , and S 3 datasets, three extended datasets (DR, S, and T) are defined in this work: A DR-dataset is used to train HiGSFA to perform dimensionality reduction, an S-dataset is used to train the supervised step on top of HiGSFA (a Gaussian classifier), and a T-dataset is used for testing. The DR and S-datasets are created using the same set of training images (either S 1 or S 2 ), and the T-dataset using the corresponding test images, either S 1 -test or S 2 -test.
The images of the DR and S-datasets go through a random distortion step during face sampling, which includes a small random translation of max ±1.4 pixels, a rotation of max ±2 degrees, a rescaling of ±4%, and small fluctuations in the average color and contrast. The exact distortions are sampled uniformly from their respective ranges. Although these small distortions are frequently imperceptible, they teach HiGSFA to become invariant to small errors during image normalization and are necessary due to its feature specificity to improve generalization to test data. Other algorithms that use pre-computed features, such as BIF, or particular structures (e.g., convolutional layers, max pooling) are mostly invariant to such small transformations by construction (e.g., Guo and Mu 2014).
Distortions allow us to increase the number of training images. The images of the DRdataset are used 22 times, each time using a different random distortion, and those of the S-dataset 3 times, resulting in 231,660 and 31,590 images, respectively. The images of the T-dataset are not distorted and used only once.

A multi-label training graph for learning age, gender, and race
We create an efficient training graph by combining three pre-defined graphs: a serial graph for age estimation and two clustered graphs (one for gender and the other for race classification).

Clustered training graphs
The clustered graph generates features useful for classification that are equivalent to those of FDA (see Klampfl and Maass 2010, also compare Berkes 2005aand Berkes 2005b. This graph is illustrated in Fig. 9. The optimization problem associated with this graph explicitly demands that samples from the same class should be mapped to similar outputs. If C is the number of classes, C − 1 output (slow) features can be extracted and passed to a standard classifier, which computes the final class estimate.
The graph used for gender classification is a clustered graph that has only two classes (female/male) of N F = 56,540 and N M = 175,120 samples, respectively. The graph used for race classification is similar to the graph above: Only two classes are considered (B and W), and the number of samples per class is N B = N W = 115,830. Serial training graph for age estimation Serial graphs have been described in Sect. 5.2.3. To extract age-related features, we create a serial graph with L = 32 groups, where each group has 7238 images.
Efficient graph for age, race, and gender estimation We use again the method for multiple label learning (Escalante-B. and Wiskott 2016) to learn age, race, and gender labels, by constructing a graph G 3L that combines a serial graph for age estimation, a clustered graph for gender, and a clustered graph for race. Whereas the vertex weights of the clustered graph are constant, the vertex weights of the serial graph are not (first and last groups have smaller vertex weights), but we believe this does not affect the accuracy of the combined graph significantly. For comparison purposes, we also create a serial graph G 1L that only learns age.
The graph combination method yields a compact feature representation. For example, one can combine a clustered graph for gender (M or F) estimation and another for race (B or W). The first 2 features learned from the resulting graph are then enough for gender and race classification. Alternatively, one could create a clustered graph with four classes (MB, MW, FB, FW), but to ensure good classification accuracy one must keep 3 features instead of 2. Such a representation would be impractical for larger numbers of classes. For example, if the number of classes were C 1 = 10 and C 2 = 12, one would need to extract C 1 C 2 − 1 = 119 features, whereas with the proposed graph combination, one would only need to extract (C 1 − 1) + (C 2 − 1) = 20 features.
Supervised post-processing We use the first 20 or fewer features extracted from the S-dataset to train three separate Gaussian classifiers, one for each label. For race and gender only two classes are considered (B, W, M, F). For age, the images are ordered by increasing age and partitioned in 39 classes of the same size. This hyperparameter has been tuned independently of the number of groups in the age graph, which is 32. The classes have average ages of {16.6, 17.6, 18.4, . . . , 52.8, 57.8} years. To compute these average ages, as well as to order the samples by age in the serial graph, the exact birthday of the persons is used, representing age with a day resolution (e.g., an age may be expressed as 25.216 years).
The final age estimation (on the T-dataset) is computed using the soft-GC method (14), except that 39 groups are used instead of 50. Moreover, to comply with the evaluation protocol, we use integer ground-truth labels and truncate the age estimates to integers. Both networks have the same number of nodes and general structure, but they differ in the type of nodes and in the number of features preserved by them, individually optimized for best performance. Layer 0 denotes the input image, whereas layer 10 is the top node

Evaluated algorithms
We compare HiGSFA to other algorithms: HGSFA, PCA, and state-of-the-art age-estimation algorithms. The structure of the HiGSFA and HGSFA networks is described in Table 4. In both networks, the nodes are simply an instance of iGSFA or GSFA preceded by different linear or nonlinear expansion functions, except in the first layer, where PCA is applied to the pixel data to preserve 20 out of 36 principal components prior to the expansion. The method used to scale the slow features is the sensitivity method, described in "Appendix B". The hyperparameters have been hand-tuned to achieve best accuracy on age estimation using educated guesses, sets S 1 , S 2 and S 3 different to those used for the evaluation, and fewer image multiplicities to speed up the process. The proposed HGSFA/HiGSFA networks are different in several aspects from SFA networks used so far (e.g., Franzius et al. 2007). For example, to improve feature specificity at the lowest layers, no weight sharing is used. Moreover, the input to the nodes (fan-in) originates mostly from the output of 3 nodes in the preceding layer (3×1 or 1×3). Such small fan-ins reduce the computational cost because they minimize the input dimensionality. The resulting networks have 10 layers.
The employed expansion functions consist of different nonlinear functions on subsets of the input vectors and include: (1) The identity function I(x) = x, (2) quadratic terms The max2 function is proposed here inspired by state-ofthe-art CNNs for age estimation (Yi et al. 2015;Xing et al. 2017) that include max pooling or a variant of it. As a concrete example of the nonlinear expansions employed by the HiGSFA network, the expansion of the first layer is I(x 1 , . . . , x 18 ) |0.8ET(x 1 , . . . , x 15 ) | max2(x 1 , . . . , x 17 ) | QT(x 1 , . . . , x 10 ), where | indicates vector concatenation. The expansions used in the remaining layers can be found in the available source code. The parameter Δ T of layers 3 to 10 is set to 1.96. Δ T is not used in layers 1 and 2, and instead the number of slow features is fixed to 3 and 4, resp. The number of features given to the supervised algorithm, shown in Table 5, has been tuned for each DR algorithm and supervised problem.
Since the data dimensionality allows it, PCA is applied directly (it was not resorted to hierarchical PCA) to provide more accurate principal components and smaller reconstruction errors.

Experimental results
The results of HiGSFA, HGSFA and PCA (as well as other algorithms, where appropriate) are presented from three angles: feature slowness, age estimation error, and reconstruction error. Individual scores are reported as a ± b, where a is the average over the test images (S 1 -test and S 2 -test), and b is the standard error of the mean (i.e., half the absolute difference).

Feature slowness
The weighted Δ values of GSFA (Eq. 1) are denoted here as Δ DR,G 3L j and depend on the graph G 3L , which in turn depends on the training data and the labels. To measure slowness (or rather fastness) of test data (T), standard Δ values are computed using the images ordered by increasing age label, Δ T,lin j def = 1 N −1 n (y j (n + 1) − y j (n)) 2 . The last expression is equivalent to a weighted Δ value using a linear graph (Fig. 3b). In all cases, the features are normalized to unit variance before computing their Δ values to allow for fair comparisons in spite of the feature scaling method. Table 6 shows Δ DR,G 3L 1,2,3 (resp. Δ T,lin 1,2,3 ), that is, the Δ values of the three slowest features extracted from the DR-dataset (resp. T-dataset) using the graph G 3L (resp. a linear graph). HiGSFA maximizes slowness better than HGSFA. The Δ T,lin values of the PCA features are larger, which is not surprising, because PCA does not optimize for slowness. Since Δ DR,G 3L and Δ T,lin are computed from different graphs, they should not be compared with each other. Δ T,lin considers transitions between images with the same or very similar ages but arbitrary race and gender. Δ DR,G 3L only considers transitions between images having at least one of a) the same gender, b) the same race, or c) different but consecutive age groups.
Age estimation error We treat age estimation as a regression problem with estimates expressed as an integer number of years, and use three metrics to measure age estimation accuracy: (1) the mean absolute error (MAE) (see Geng et al. 2007), which is the most frequent metric for age estimation, (2) the root mean squared error (RMSE), which is a common loss function for regression. Although it is sensitive to outliers and has been barely used in the literature on age estimation, some applications might benefit from its stronger penalization of large estimation errors. And (3) cumulative scores (CSs, see Geng et al. 2007), which indicate the fraction of the images that have an estimation error below a given threshold. For instance, CS(5) is the fraction of estimates (e.g., expressed as a percentage) having an error of at most 5 years w.r.t. the real age.  Top results for each column are shown in bold Classification rates (CR) for race and gender estimation are also provided. The chance level is the best possible performance when the estimation is constant The accuracies are summarized in Table 7. The MAE of HGSFA is 3.921 years, which is better than that of BIF+3Step, BIF+KPLS, BIF+rKCCA, and a baseline CNN, similar to BIF+rKCCA+SVM, and worse than the MCNNs. The MAE of HiGSFA is 3.497 years, clearly better than HGSFA and also better than MCNNs. At first submission of this publication, HiGSFA achieved the best performance, but newer CNN-based methods have now improved the state-of-the-art performance. In particular, MRCNN yields an MAE of 3.48 years, and Net VGG hybrid an MAE of only 2.96 years. In contrast, PCA has the largest MAE, namely 6.804 years. Table 8 Percentual cumulative scores (the larger the better) for various maximum allowed errors ranging from 0 to 30 years Algorithm cs (0) cs (1) cs (2) cs (3) cs (4) cs (5) cs (6) cs (7) cs (8) cs (9)  MCNN denotes a multi-scale CNN (Yi et al. 2015) that has been trained on images decomposed as 23 48×48-pixel image patches. Each patch has one out of four different scales and is centered on a particular facial landmark. A similar approach was followed by Liu et al. (2017) to propose the use of a multi-region CNN (MRCNN). Xing et al. (2017) evaluated several CNN architectures and loss functions and propose the use of specialpurpose hybrid architecture (Net VGG hybrid ) with five VGG-based branches. One branch estimates a distribution on demographic groups (black female, black male, white female, white male), and the remaining four branches estimate age, where each branch is specialized in a single demographic group. The demographic distribution is used to combine the outputs of the four branches to generate the final age estimate.
In an effort to improve our performance, we also tried support vector regression (SVR) as supervised post-processing and computed an average age estimate using the original images and their mirrored version (HiGSFA-mirroring). This slightly improved the estimation to 3.412 years, becoming better than MRCNN. Mirroring is also done by MCNN and Net VGG hybrid . Detailed cumulative scores for HiGSFA and HGSFA are provided in Table 8, facilitating future comparisons with other methods. The RMSE of HGSFA on test data is 5.148 years, whereas HiGSFA yields an RMSE of 4.583 years, and PCA an RMSE of 8.888 years. The RMSE of other approaches does not seem to be available.
The poor accuracy of PCA for age estimation is not surprising, because principal components might lose wrinkles, skin imperfections, and other information that could reveal age. Another reason is that principal components are too unstructured to be properly untangled by the soft GC method, in contrast to slow features, which have a very specific and simple structure.
The behavior of the estimation errors of HiGSFA is plotted in Fig. 10 as a function of the real age. On average, older persons are estimated much younger than they really are. This is in part due to the small number of older persons in the database, and because the oldest class used in the supervised step (soft-GC) has an average of about 58 years, making this the largest age that can be estimated by the system. The MAE is surprisingly low for persons below 45 years. The most accurate estimation is an MAE of only 2.253 years for 19-year-old persons.
Reconstruction error A reconstruction error is a measure of how much information of the original input is contained in the output features. In order to compute it, we assume a linear global model for input reconstruction.
Let X be the input data and Y the corresponding set of extracted features. A matrix D and a vector c are learned from the DR-dataset using linear regression (ordinary least squares) such thatX def = DY + c1 T approximates X as closely as possible, where 1 is a vector of N ones.

Fig. 10
The average age estimates of HiGSFA are plotted as a function of the real age. The MAE is also computed as a function of the real age and plotted as age ± MAE(age). Thus,X contains the reconstructed samples (i.e.x n def = Dy n + c is the reconstruction of the input x n given its feature representation y n ). Figure 2 shows examples of face reconstructions using features extracted by different algorithms.
The model is linear and global, which means that output features are mapped to the input domain linearly. For PCA this gives the same result as the usual multiplication with the transposed projection matrix plus image average. An alternative (local) approach for HiGSFA would be to use the linear reconstruction algorithm of each node to perform reconstruction from the top of the network to the bottom, one node at a time. However, such a local reconstruction approach is less accurate than the global one.
The normalized reconstruction error, computed on the T-dataset, is then defined as which is the ratio between the energy of the reconstruction error and the variance of the test data except for a factor N /(N − 1). Reconstruction errors of HGSFA, HiGSFA and PCA using 75 features are given in Table 9. The constant reconstructionx (chance level) is the baseline with an error of 1.0. As expected, HGSFA does slightly better than chance level, but worse than HiGSFA, which is closer to PCA. PCA yields the best possible features for the given linear global reconstruction method, and is better than HiGSFA by 0.127. For HiGSFA, from the 75 output features, 8 of them The reported results are the age estimation errors (MAE and RMSE), the reconstruction error (e rec ), the percentual classification rate for race and gender, and the average length of the slow part in the features computed by the nodes of the third HiGSFA layer. All metrics have been computed on test data.
are slow features (slow part), and the remaining 67 are reconstructive. If one uses 67 features instead of 75, PCA yields a reconstruction error of 0.211, which is still better because the PCA features are computed globally.
HiGSFA network with HGSFA hyperparameters We verify that the performance of HiGSFA for age estimation is better than that of HGSFA not simply due to different hyperparameters by evaluating the performance of an HiGSFA network using the hyperparameters of the HGSFA network (the only difference is the use of iGSFA nodes instead of GSFA nodes). The hyperparameter Δ T , not present in HGSFA, is set as in the tuned HiGSFA network. As expected, the change of hyperparameters affected the performance of HiGSFA: The MAE increased to 3.72 years, and the RMSE increased to 4.80 years. Although the suboptimal hyperparameters increased the estimation errors of HiGSFA, it was still clearly superior to HGSFA.
Sensitivity to the delta threshold Δ T The influence of Δ T on estimation accuracy and numerical stability is evaluated by testing different values of Δ T . For simplicity, the same Δ T is used from layers 3 to 10 in this experiment (Δ T is not used in layers 1 and 2, where the number of features in the slow part is constant and equal to 3 and 4 features, respectively). The performance of the algorithm as a function of Δ T is shown in Table 10. The Δ T yielding minimum MAE and used in the optimized architecture is 1.96. The average number of slow features in the third layer changes moderately depending on the value of Δ T , ranging from 2.87 to 4.14 features, and the final metrics change only slightly. This shows that the parameter Δ T is not critical and can be tuned easily.
Evaluation on the FG-NET database The FG-NET database (Cootes 2004) is a small database with 1002 facial images taken under uncontrolled conditions (e.g., many are not frontal) and includes identity and gender annotations. Due to its small size, it is unsuitable to evaluate HiGSFA directly. However, FG-NET is used here to investigate the capability of HiGSFA to generalize to a different test database. The HiGSFA (G 3L ) network that has been trained with images of the MORPH-II database (either with the set S 1 or S 2 ) is tested using images of the FG-NET database. For this experiment, images outside the original age range from 16 to 77 years are excluded.
For age estimation, the MAE is 7.32 ± 0.08 years and the RMSE is 9.51 ± 0.13 years (using 4 features for the supervised step). For gender and race estimation, the classification rates (5 features) are 80.85% ± 0.95% and 89.24% ± 1.06%, resp. The database does not include race annotations, but all inspected subjects appear to be closer to white than to black. Thus, we assumed that all test persons have white race. The most comparable cross-database experiment known to us is a system (Ni et al. 2011) trained on a large database of images from the internet and tested on FG-NET. By restricting the ages to the same 16-77 year range used above, their system achieves an MAE of approximately 8.29 years.

Discussion
In this article, we propose the use of information preservation to enhance HSFA and HGSFA. The resulting algorithms are denoted by HiSFA and HiGSFA, respectively, and significantly improve global slowness, input reconstruction and in supervised learning settings also label estimation accuracy.
We have analyzed the advantages and limitations of HSFA and HGSFA networks, particularly the phenomena of unnecessary information loss and poor input reconstruction. Unnecessary information loss occurs when a node in the network prematurely discards information that would have been useful for slowness maximization in another node higher up in the hierarchy. Poor input reconstruction refers to the difficulty of approximating an input accurately from its feature representation. We show unnecessary information loss is a consequence of optimizing slowness locally, yielding suboptimal global features.
HiSFA and HiGSFA improve the extracted features to address these shortcomings. In the conclusions below we focus on HiGSFA for simplicity, although most of them also apply to HiSFA. The feature vectors computed by iGSFA nodes in an HiGSFA network have two parts: a slow and a reconstructive part. The features of the slow part follow a slowness optimization goal and are slow features (in the sense of SFA) transformed by a linear scaling. The features of the reconstructive part follow the principle of information preservation (i.e. maximization of mutual information between the output and input data), which we implement in practice as the minimization of a reconstruction error. A parameter Δ T (Δ-threshold) balances the lengths of the slow and reconstructive parts, consisting of J and D − J features, respectively, where D is the output dimensionality and J is the number of slow features selected having Δ < Δ T .
Parameter Δ T thus controls the composition of the feature vector. A small Δ T results in more reconstructive features and a large Δ T results in more slow features. In particular, when Δ T < 0, iGSFA becomes equivalent to PCA, and when Δ T ≥ 4.0, iGSFA becomes equivalent to GSFA except for a linear transformation (this assumes positive edge weights and a consistent graph). Theory justifies fixing Δ T slightly smaller than 2.0 (Sect. 3.2), resulting in some features being similar to those of GSFA and other features being similar to those of PCA (on residual data).
Experiment 1 on unsupervised feature extraction from the view of a simulated rat shows that HiSFA yields significantly slower features than HSFA, and it is especially resistant to overfitting, allowing the extraction of slower features using much fewer training samples. Due to its improved feature slowness, HiSFA may be useful to improve simulations for neuroscience based on HSFA and SFA.
A method proposed by Escalante-B. and Wiskott (2016) for the combination of training graphs is used for the first time on real data to estimate either pose or subject (age/race/gender) attributes by combining pre-defined training graphs for the individual attributes into a single training graph, allowing efficient and accurate learning of multiple labels. In Experiment 2 on pose estimation, we experienced the problem that the HGSFA and HiGSFA networks concentrated too much on the easiest labels, yielding very few features that convey infor-mation about the difficult labels. This was easily solved by increasing the weight of the graphs corresponding to less represented labels. For this experiment HiGSFA yielded better label estimation accuracy than HGSFA even though it was constrained to hyperparameters previously tuned for HGSFA. Experiment 3 on age estimation (where the hyperparameters are unconstrained) shows that HiGSFA is superior to HGSFA in terms of feature slowness, input reconstruction and label estimation accuracy. Moreover, HiGSFA offers a competitive accuracy for age estimation, surpassing approaches based on bio-inspired features and some convolutional neural networks, such as a multi-scale CNNs (MCNN), which HiGSFA outperforms by 48.5 days (≈ 1.5 months) mean absolute error. However, HiGSFA is not as accurate as current state-of-theart CNNs for age estimation (Xing et al. 2017), which have a problem-specific structure. The improvement of HiGSFA over HGSFA is large: 154 days (≈ 5 months). This is a significant technical and conceptual advance.
The next sections provide additional insights-mostly conceptual-into the proposed approach, the obtained results, and future work.

Remarks on the approach
We have shown that a single GSFA node in a network cannot always identify from its input which aspects or features contribute to the computation of global slow features. Hence, some of these features may be incorrectly discarded if the data dimensionality is reduced. Therefore, we resort in HiGSFA to a reconstruction goal to compress the local input and increase the chances that the local output features include information relevant to extract global slow features.
Besides the method used in HiGSFA to combine the slowness principle and information preservation, another method is to optimize a single objective function that integrates both criteria, favoring directions that are slow and have a large variance. However, we found in previous experiments that balancing these two criteria is difficult in practice. In addition, splitting the two types of features allows to keep SFA nonlinear and PCA linear.
HiGSFA should not be seen merely as a more sample efficient version of HGSFA, because it also provides a better feature representation and yields slower features. Even assuming unlimited training data and computational resources, features extracted by HGSFA would not necessarily reach the slowness achieved by HiGSFA. This hypothetical scenario is free of overfitting, but information loss would only decrease partially in HGSFA, because the main cause of this problem is not overfitting but blind optimization of slowness locally. One can observe this phenomenon in Table 1 (40 k), where HSFA does not reach the feature slowness of HiSFA even though overfitting is minimal.

Network hyperparameters and training times
By selecting the training graph and network structure appropriately, the computational complexity of HiGSFA (and other hierarchical versions of SFA) is linear w.r.t. the number of samples and their dimensionality, resulting in feasible training times, see "Appendix A". Training a single HiGSFA network (231,660 images of 96×96 pixels) takes only 10 hours, whereas HiSFA takes about 6 hours (including the time needed for data loading, the supervised step, training, and testing) on a single computer (24 virtual cores Xeon-X7542 @ 2.67GHz and 128 GB of RAM) without GPU computing. However, the algorithm could also benefit from GPUs or distributed computing, because nodes within the same layer can be trained independently. For comparison, the system of Guo and Mu (2014) takes 24.5 hours (training and testing) for fewer images and lower resolution. Training times of the various CNNs were not included in the publications.
HiGSFA is more accurate than HGSFA even when using output dimensionalities and network hyperparameters that have been tuned for the latter network. However, HiGSFA can yield even higher accuracies if one uses larger output dimensionalities. This can be explained by various factors: (a) In both networks, the input dimensionality of a single GSFA node is I (the expanded dimension). In HiGSFA the input dimensionality of a single PCA node is I , where typically I I . Hence, the reconstructive features of HiGSFA may overfit less than the slow features of HGSFA that they replace. (b) In HGSFA, the features of all the nodes are attracted to the same optimal free responses, whereas in HiGSFA the reconstructive part is attracted to the local principal components, which are different for each node. Thus, HGSFA layers might potentiate overfitting more than HiGSFA layers. (c) HGSFA networks may benefit less from additional features computed by the nodes, because additional quickly varying features are frequently noise-like and cause more overfitting in later nodes without improving slowness or reconstruction significantly.
In order to train iSFA and iGSFA, one must choose the Δ T hyperparameter. A too small value Δ T ≈ 0.0 results in only PCA features, whereas a too large value Δ T ≈ 4.0 results in only slow features. The recommendation is to start with a value slightly smaller than 2.0. One can then tune this hyperparameter to improve the target metric. The control experiment in Sect. 5.3.4 shows that this tuning is straightforward. Alternatively, one can directly specify the number of slow features J .

Remarks on experiment 3
Age estimation from adult facial photographs appears to be an ideal problem to test the capabilities of HiGSFA. PCA is not very useful for this problem because PCs represent wrinkles, skin texture and other higher-frequency features poorly. Therefore, it is counter-intuitive that feature slowness can be improved by incorporating PCs in HiGSFA. Improvements on feature slowness on other supervised learning problems (such as digit and traffic sign recognition, and the estimation of gender, race, and face pose from images) are less conclusive, because for such problems a few PCs encode discriminative information relatively well.
To estimate age, race, and gender simultaneously, we construct a graph G 3L that combines three pre-defined graphs, encoding sensitivity to the particular labels, and favoring invariance to any other factor. Out of the 75 features extracted in the top-most node, 8 are slow and 67 are reconstructive. The best number of features passed to the supervised step ranges from 4 to 7 slow features, and the reconstructive part is not used. This shows that HiGSFA and HGSFA concentrate the label information in the first features. One can actually replace the iGSFA node on the top of the HiGSFA network by a regular GSFA node, so that all features are slow, without affecting the performance. The superiority in age estimation of HiGSFA over HGSFA is thus not due to the use of principal components in the final supervised step but to the higher quality of the slow features.
The performance of HiGSFA for age estimation on the MORPH-II database is competitive with an MAE of 3.497 years (or 3.412 years for HiGSFA-mirroring). Previous state-of-theart results include an MAE of 3.63 years using a multi-scale CNN (Yi et al. 2015) and 3.92 using BIF+rKCCA+SVM (Guo and Mu 2014). During the preparation of the manuscript, newer results have improved the state of the art to 3.48 years ) and 2.96 years (Xing et al. 2017).
The main goal of this research is not to provide best performance on any particular application. Our specific goal is to improve HSFA and HGSFA. In this sense, the central claim is that the new extensions are better regarding feature slowness, input reconstruction and (in the case of HiGSFA) label estimation accuracy.

Input reconstruction from slow features
Experiment 3 confirms that PCA is more accurate than HiGSFA at reconstruction using the same number of features (here 75), which was expected because PCA features are optimal when reconstruction is linear. From the 75 features extracted by HiGSFA, 67 are reconstructive (8 fewer than in PCA) and they are computed hierarchically (locally), in contrast to the PCA features, which are global. Thus, it is encouraging that the gap between PCA and HiGSFA at input reconstruction is moderate. In turn, HiGSFA is much more accurate than HGSFA, because reconstruction is the secondary goal of HiGSFA, whereas HGSFA does not pursue reconstruction.
Since the HiGSFA network implements a nonlinear transformation, it is reasonable to employ nonlinear reconstruction algorithms. Nonlinear reconstruction can provide more accurate reconstructions in theory, but in our experience it is difficult to train such type of algorithms well enough to perform better on test data than the simpler global linear reconstruction algorithm. An algorithm for nonlinear reconstruction that minimizes the feature error e feat has been described in Sect. 4.5. However, since the number of dimensions is reduced by the network, one can expect many samples with feature error e feat = 0 (or e feat minimal) that differ from any valid input sample fundamentally (i.e., do not belong to the face manifold). To correct this problem, one might need to consider the input distribution and limit the range of valid reconstructions.
Another option is to resort to generative adversarial networks (Goodfellow et al. 2014;Radford et al. 2015;Denton et al. 2015), which have been used in the context of CNNs to generate random inputs with excellent image quality. For HiGSFA networks, such a promising approach might also be applicable to do nonlinear reconstruction, if properly adapted.

Future work
Age and pose estimation accuracy may be improved by using more training samples and more complex hierarchical networks, for instance by increasing the overlap of the receptive fields and using more complex nonlinearities. For age estimation one could use true face-distortion methods instead of simple image transformations.
One key factor for the performance of MCNN and MRCNN is the use of receptive fields centered at specific facial points (e.g., see entries 'MCNN no align' and 'MCNN' in Table 7). For Net VGG hybrid the use of a face normalization procedure and an image distortion method may also be relevant. These ideas could also be applied to HiGSFA, and might particularly boost generalization.
Another direction of research is to investigate a possible connection between information preservation in HiGSFA and other methods used in CNNs, particularly the use of shortcut connections in ResNets (He et al. 2016) and Dense Networks (Huang et al. 2017). These connections provide a new information channel that allows a layer access to the unmodified input of a previous layer. Although these methods are different and have been motivated by other ideas, it appears they may also be justified by the information preservation principle.

Conclusion
We believe it is possible to develop successful learning algorithms based on a few simple but strong learning principles and heuristics, and this is the approach that we try to pursue with HiGSFA. An algorithm that might be strong but cannot be understood and justified analytically would be of less interest to us.
HiGSFA follows two of the suggestions by Krüger et al. (2013) based on findings from neuroscience regarding the primate visual system useful for successful computer vision, namely, hierarchical processing and information-channel separation. The slow and reconstructive parts of the extracted features can be seen as two information channels: The first one encodes information representing the slow parameters, and the second one encodes information representing the remaining aspects of the input. The slow part can be further decomposed. For example, in Experiment 3 one can observe one or more features for each label; the 3rd slowest feature is mostly related to race, the 4th one to gender, and all the remaining features to age.
This work explores the potential of bottom-up training and shows that it can be surprisingly accurate, making this type of architectures attractive for further research by exploring new heuristics and extensions.
The proposed algorithm is general purpose (e.g., it does not know anything about face geometry), but it still provides competitive accuracy, at least for the age/race/gender estimation problem. The results show the improved versatility and robustness of the algorithm and make it a good candidate for many other problems of computer vision on high-dimensional data, particularly those lying at the intersection of image analysis, nonlinear feature extraction, and supervised learning.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Complexity of a quadratic HSFA network
Although existing systems based on HSFA have resorted to hierarchical processing for efficiency reasons, apparently its actual asymptotic complexity has not yet been formally established. In this section, we compute the computational complexity of a concrete quadratic HSFA (QHSFA) network illustrated in Fig. 11. This network operates on data with a 1D structure (i.e., vectors), where all nodes of the network perform quadratic SFA (QSFA, a quadratic expansion followed by linear SFA), and has two important parameters L and k. Parameter L indicates the number of layers and k is used for various purposes and is assumed to be fixed. In the first layer, the nodes have a fan-in and stride of k input values. In the remaining layers, the nodes have a fan-in and stride of two nodes. Each node of this network reduces the dimensionality of the data from k input components to k/2 output components. We denote the total input dimensionality by I . Fig. 11 Example of a 1D QHSFA network with binary fan-ins and no overlap. Each node performs quadratic SFA and reduces the dimensionality from k to k/2 components. The use of such a small fan-in results in a network with a large number L of layers, although L only grows logarithmically with the input dimension I From the structure described above, it follows that the receptive fields of the nodes are non-overlapping, the network's output has k/2 components, the number of layers L is related to I and k: and the total number of nodes in the network is Internally, all the QSFA nodes of the network have the same structure. The input data to a single node has k dimensions, which are increased by the quadratic expansion to k(k + 3)/2 components. Afterwards, linear SFA reduces the dimensionality to k/2. Thus, the complexity of training a single (nonlinear) node is T QSFA (N , k) (5) = O(N (k(k + 3)/2) 2 + (k(k + 3)/2) 3 ) = O(N k 4 + k 6 ).
On the other hand, the number of nodes is M (16,17) = O(I /k). Therefore, the complexity of training the whole network is T QHSFA (N , I , k) (19) = O((N k 4 + k 6 )I /k) = O(I Nk 3 + I k 5 ) .
Thus, the complexity of the complete QHSFA network above is linear w.r.t. the input dimension I , whereas the complexity of direct QSFA (on I -dimensional data) is T QSFA (N , I ) (19) = O(N I 4 + I 6 ), which is linear w.r.t. I 6 . This shows that the QHSFA network is computationally much more efficient than direct QSFA (given that k I ). Since each layer in the QHSFA network is quadratic, in general the output features of the l-th layer can be written as polynomials of degree 2 l on the input values. In particular, the output features of the whole network are polynomials of degree 2 L . However, the actual feature space spanned by the network does not include all polynomials of this degree but only a subset of them due to the restricted connectivity of the network. In contrast, direct QSFA only contains quadratic polynomials (although all of them).
One could try to train direct SFA on data expanded by a polynomial expansion of degree 2 L (to encompass the feature space of QHSFA), but the complexity would be prohibitive due to the large expanded dimensionality 2 L d=0 d+I −1 I −1 . Training SFA with such a highdimensional data appears to be exponential in I .
We are also interested in analyzing the memory complexity of the QHSFA network. Memory (space) complexity is denoted by S. Thus, the memory complexity of linear SFA is where the term N I is due to the input data, and I 2 is due to the covariance matrices. If a quadratic expansion is added, SFA becomes QSFA and the memory complexity becomes One can reduce these complexities by using HSFA/QHSFA and by training the nodes sequentially, one at a time, independently of whether an expansion has been applied or not. In particular, the memory complexity of the QHSFA network is only The excellent computational and memory complexity of the QHSFA network is not exclusive to this simple architecture. It is possible to design more sophisticated networks and preserve a similar computational complexity. For example, training the HiGSFA network proposed in Sect. 5, which has overlapping receptive fields, larger fan-ins, and a 2D structure, has complexity also linear in I and N if one adds more pairs of layers with a fan-in of 1 × 3 and 3 × 1 to match the size of the input data as needed (concrete analysis not provided).

B Sensitivity-based scaling
As mentioned in Sect. 4.4, the QR scaling method (10)-(12) is useful to give the features in the slow part a meaningful scale, but it has the disadvantage that it mixes the slow features. This is irrelevant when one combines a polynomial expansion with SFA because the extracted features are invariant to invertible linear transformations of the input, including mixtures, i.e., SFA(QExp(Ux)) ≡ SFA(QExp(x)), where U is any invertible matrix and QExp is the quadratic expansion). In other words, polynomial SFA can extract the same features from s or y .
However, other expansions, including the 0.8Exp (13) expansion, do not have this property. When the 0.8Exp expansion is combined with SFA, the resulting algorithm is only invariant to scalings of the input components but not to their mixing, i.e., SFA(0.8Exp( x)) ≡ SFA(0.8Exp(x)), where is a diagonal matrix with diagonal elements λ i = 0, but SFA(0.8Exp(Ux)) ≡ SFA(0.8Exp(x)) in general. Thus, using QR scaling would change the features extracted in later nodes fundamentally. Furthermore, the 0.8Exp expansion has been motivated by a model where the input slow features are noisy harmonics of increasing frequency of a hidden parameter and the expansion should be applied to these features directly. Thus, mixing the slow features would break the assumed model and might compromise slowness extraction in the next layers in practice.
Technically, feature mixing by QR scaling could be reverted in the next layer (e.g., by an additional application of linear SFA before the expansion), but such a step would add unnecessary complexity. Therefore, as an alternative to the QR scaling, we also propose a sensitivity based scaling, which scales the slow features without mixing them, as follows.
Clearly, the transformation (25) does not mix the slow features, it only scales them. From the two key reconstruction properties of PCA mentioned in Sect. 4.4 (adding noise of certain variance to either one or many features increases the variance of the reconstruction error by the same amount), the first one (noise on a single feature) is fulfilled, because the columns of M −1 have unit norm since −1 = Diag(1/λ 1 , . . . , 1/λ J ). The second property is not fulfilled, because M −1 is in general not orthogonal (in contrast to Q).
At first glance, it seems like multi-view learning (data fusion, e.g., Xia et al. 2010) might be an alternative for the scaling methods used here. However, multi-view learning actually solves a different problem since we do not join two information channels but actually split them.