This section evaluates the proposed extensions, HiSFA and HiGSFA, on three problems: (1) The unsupervised extraction of temporally stable features from images simulating the view of a rat walking inside a box, which is addressed with HiSFA. (2) The extraction of four pose parameters of faces in image patches, which is a multi-label learning problem suitable for HiGSFA. (3) The estimation of age, race, and gender from human face photographs, which is also solved using HiGSFA. The chosen problems exemplify the general applicability of the proposed algorithms.
Experiment 1: Extraction of slow features with HiSFA from the view of a rat
The input data in this first experiment consists of the (visual) sensory input of a simulated rat. The images have been created using the RatLab toolkit (Schönfeld and Wiskott 2013). RatLab simulates the random path of a rat as it explores its environment, rendering its view according to its current position and head-view angle. For simplicity the rat is confined to a square area bounded by four walls having different textures. Franzius et al. (2007) have shown theoretically and in simulations that the slowest features one extracted from this data are trigonometric functions of the position of the rat and its head direction (i.e., the slow configuration/generative parameters).
Training and test images
For this experiment, first a large fixed sequence of 200,000 images was generated. Then, the training data of a single experiment run is selected randomly as a contiguous subsequence of size 40,000 from the full sequence. The test sequence is selected similarly but enforcing no overlap with the training sequence. All images are in color (RGB) and have a resolution of 320\(\times \)40 pixels, see Fig. 6.
Description of HiSFA and HSFA networks
To evaluate HiSFA we reproduced an HSFA network that has already been used (Schönfeld and Wiskott 2013). This network has three layers of quadratic SFA. Each layer consists of three components: (1) linear SFA, which reduces the data dimensionality to 32 features, (2) a quadratic expansion, and (3) linear SFA, which again reduces the expanded dimensionality to 32 features. The first two layers are convolutional (i.e., the nodes of these layers share the same weight matrices).
For a fair comparison, we built an HiSFA network with exactly the same structure as the HSFA network described above. The HiSFA network has additional hyperparameters \(\varDelta _T\) (one for each node), but in order to control the size of the slow part more precisely, we removed this parameter and fixed instead the size of the slow part to 6 features in all nodes of the HiSFA network (i.e., \(J=6\)).
For computational convenience we simplified the iSFA algorithm to enable the use of convolutional layers as in the HSFA network. Convolutional layers can be seen as a single iSFA node cloned at different spatial locations. Thus, the total input to a node consists not of a single input sequence (as assumed by the iSFA algorithm) but of several sequences, one at each location of the node. The iSFA node was modified as follows: (1) the decorrelation step (between the slow and reconstructive parts) is removed and (2) all slow features are given the same variance as the median variance of the features in the corresponding reconstructive part (instead of QR scaling).
Results
To evaluate HSFA and HiSFA we compute \(\varDelta \) values of the 3 slowest features extracted from test data, which are shown in Table 1. The experiments were repeated 5 times, each run using training and test sequences randomly computed using the procedure of Sect. 5.1.1.
Table 1 Delta values of the first three features extracted by HSFA and HiSFA from training data (above) and test data (below) Table 1 shows that HiSFA extracts clearly slower features than HSFA for both training and test data in the main setup (40 k training images). For instance, for test data \(\varDelta _1\), \(\varDelta _2\), and \(\varDelta _3\) are 28–52% smaller in HiSFA than in HSFA. This is remarkable given that HSFA and HiSFA span the same global feature space (same model capacity).
In order to compare the robustness of HSFA and HiSFA w.r.t. the number of images in the training sequence, we also evaluate the algorithms using shorter training sequences of 5 k, 10 k, and 20 k images. As usual, the test sequences have 40k images. Table 1 shows that HiSFA computes slower features than HSFA given the same number of training images. This holds for both training and test data. In fact, when HiSFA is trained using only 5 k images the slowest extracted features are already slower than those extracted by HSFA trained on 40 k images (both for training and test data). In contrast, the performance of HSFA decreases significantly when trained on 10 k and 5 k images. Therefore, HiSFA is much more robust than HSFA w.r.t. the number of training images (higher sample efficiency).
Experiment 2: Estimation of face pose from image patches
The second experiment evaluates the accuracy of HiGSFA compared to HGSFA in a supervised learning scenario. We consider the problem of finding the pose of a face contained in an image patch. Face pose is described by four parameters: (1) the horizontal and (2) vertical position of the face center (denoted by x-pos and y-pos, respectively), (3) the size of the face relative to the image patch, and (4) the in-plane rotation angle of the eyes-line. Therefore, we solve a regression problem on four real-valued labels. The resulting system can be easily applied to face tracking and to face detection with an additional face discrimination step.
Generation of the image patches
The complete dataset consists of 64,470 images that have been extracted from a few datasets to increase image variability: Caltech (Fink et al. 2003), CAS-PEAL (Gao et al. 2008), FaceTracer (Kumar et al. 2008), FRGC (Phillips et al. 2005), and LFW (Huang et al. 2007). In each run of the system two disjoint image subsets are randomly selected, one of 55,000 images, used for training, and another of 9000 images, used for testing. The images are processed in two steps. First they are normalized (i.e., centered, scaled, and rotated). Then, they are rescaled to 64 \(\times \) 64 pixels and are systematically randomized: In the resulting image patches the center of the face deviates horizontally from the image center by at most \(\pm 20\) pixels, vertically by at most \(\pm 10\) pixels, the angle of the eye-line deviates from the horizontal by at most \(\pm 22.5\deg \), and the size of the largest and smallest faces differs by a factor of \(\sqrt{2}\) (a factor of 2 in their area). The concrete pose parameters are sampled from uniform distributions in the above-mentioned ranges. To increase sample efficiency, each image of the training set is used twice with different random distortions, thus the effective training set has size 110,000. Examples of the final image patches are shown in Fig. 7.
HiGSFA and HGSFA networks
For comparison purposes, we adopt an HGSFA network that has been previously designed (and partially tuned) to estimate facial pose parameters by Escalante-B. and Wiskott (2013) without modification. Most SFA nodes of this network consist of expansion function
$$\begin{aligned} 0.8{\textsc {Exp}}(x_1, x_2, \dots , x_I)&{\mathop {=}\limits ^{{\mathrm {def}}}}\nonumber \\ (x_1, x_2, \dots , x_I, |x_1&|^{0.8}, |x_2|^{0.8}, \dots , |x_I|^{0.8}) \, , \end{aligned}$$
(13)
followed by linear SFA, except for the SFA nodes of the first layer, which have an additional preprocessing step that uses PCA to reduce the number of dimensions from 16 to 13. In contrast to the RatLab networks, this network does not use weight sharing, increasing feature specificity at each node location.
For a fair comparison, we construct an HiGSFA network having the same structure as the HGSFA network (e.g., same number of layers, nodes, expansion function, data dimensions, receptive fields). Similarly to experiment 1, we directly set the number of slow features preserved by the iGSFA nodes, which in this experiment varies depending on the layer from 7 to 20 features (these values have been roughly tuned using a run with a random seed not used for testing). These parameters used to construct the networks are shown in Table 2.
Table 2 Description of the HiGSFA and HGSFA networks used for pose estimation Training graphs that encode pose parameters
In order to train HGSFA and HiGSFA one needs a training graph (i.e., a structure that contains the samples and the vertex and edge weights). A few efficient predefined graphs have already been proposed (Escalante-B. and Wiskott 2013), allowing training of GSFA with a complexity of \(\mathcal {O}(NI^2+I^3)\), which is of the same order as SFA, making this type of graphs competitive in terms of speed. One example of a training graph for classification is the clustered graph (see Sect. 5.3.2) and one for regression is the serial training graph, described below.
Serial training graph (Escalante-B. and Wiskott 2013) The features extracted using this graph typically approximate a monotonic function of the original label and its higher frequency harmonics. To solve a regression problem and generate label estimates in the appropriate domain, a few slow features (e.g., extracted using HGSFA or HiGSFA) are post-processed by an explicit regression step. There are more training graphs suitable for regression (e.g., mixed graph), but this one has consistently given good results in different experiments.
Figure 8 illustrates a serial graph useful to learn x-pos. In general, a serial graph is constructed by ordering the samples by increasing label. Then, the samples are partitioned into L groups of size \(N_g = N/L\). A representative label \(\in \{ \ell _1, \ldots , \ell _{L} \}\) is assigned to each group, where \(\ell _1< \ell _2< \cdots < \ell _{L}\). Edges connect all pairs of samples from two consecutive groups with group labels (\(\ell _{l}\) and \(\ell _{l+1}\)). Thus, all connections are inter-group, no intra-group connections are present. Notice that since any two vertices of the same group are adjacent to exactly the same neighbors, they are likely to be mapped to similar outputs by GSFA. We use this procedure to construct four serial graphs \(\mathcal {G}_{x\text {-pos}}\), \(\mathcal {G}_{y\text {-pos}}\), \(\mathcal {G}_{\text {angle}}\), and \(\mathcal {G}_{\text {scale}}\) that encode the x-pos, y-pos, angle, and scale label, respectively. All of them have the same parameters: \(L=50\) groups, \(N=110{,}000\) samples, and \(N_g = 2200\) samples per group.
Combined graph to learn all pose parameters One disadvantage of current pre-defined graphs is that they allow to learn only a single (real-valued or categorical) label. In order to learn various labels simultaneously, we resort to a method proposed earlier that allows the combination of graphs (Escalante-B. and Wiskott 2016). We use this method here for the first time on real data to create an efficient graph that encodes all four labels combining \(\mathcal {G}_{x\text {-pos}}\), \(\mathcal {G}_{y\text {-pos}}\), \(\mathcal {G}_{\text {angle}}\), and \(\mathcal {G}_{\text {scale}}\). The combination preserves the original samples of the graphs, but the vertex and edge weights are added, which is denoted \(\mathcal {G}'_{\text {4L}} {\mathop {=}\limits ^{{\mathrm {def}}}}\mathcal {G}_{x\text {-pos}} + \mathcal {G}_{y\text {-pos}} + \mathcal {G}_{\text {angle}} + \mathcal {G}_{\text {scale}} \).
The combination of graphs guarantees that the slowest optimal free responses of the combined graph span the slowest optimal free responses of the original graphs, as long as three conditions are fulfilled: (1) all graphs have the same samples and are consistent, (2) all graphs have the same (or proportional) node weights, and (3) optimal free responses that are slow in one graph (\(\varDelta < 2.0\)) should not be fast (\(\varDelta > 2.0\)) in any other graph. Since the labels (i.e., pose parameters) have been computed randomly and independently of each other, these conditions are fulfilled on average (e.g., for the \(\mathcal {G}_{x\text {-pos}}\) graph, any feature \({\mathbf {y}}\) that solely depends on y-pos, scale, or angle has \(\varDelta _{{\mathbf {y}}}^{\mathcal {G}_{x\text {-pos}}} \approx 2.0\)).
However, the naive graph \(\mathcal {G}'_{\text {4L}}\) does not take into account that the ‘angle’ and ‘scale’ labels are more difficult to estimate compared to the x-pos and y-pos labels. Thus, the feature representation is dominated by features that encode the easier labels (and their harmonics) and only few features encode the difficult labels, making their estimation even more difficult. To solve this problem, we determine weighting factors such that each label is represented at least once in the first five slow features. Features that are more difficult to extract have higher weights to avoid an implicit focus on easy features. The resulting graph and weighting factors are: \(\mathcal {G}_{\text {4L}} {\mathop {=}\limits ^{{\mathrm {def}}}}\mathcal {G}_{x\text {-pos}} + 1.25 \mathcal {G}_{y\text {-pos}} + 1.5 \mathcal {G}_{\text {angle}} + 1.75 \mathcal {G}_{\text {scale}}\).
Supervised post-processing We extract 20 slow features from the training dataset to train four separate Gaussian classifiers (GC), one for each label. Ground truth classes are generated by discretizing the labels into 50 values representing class 1–50. After the GC has been trained with this data, the pose estimate (on the test patches) is computed using the soft-GC method (Escalante-B. and Wiskott 2013), which exploits class membership probabilities of the corresponding classifier: Let \(\mathbf {P}(C_{{\ell }_l} | {\mathbf {y}} )\) be the estimated class probability that the input sample \({\mathbf {x}}\) with feature representation \({\mathbf {y}}= {\mathbf {g}}({\mathbf {x}})\) belongs to the group with average label \({\ell }_l\). Then, the estimated label is
$$\begin{aligned} \tilde{\ell } \; {\mathop {=}\limits ^{{\mathrm {def}}}}\; \sum _{l=1}^{50} {\ell }_l \cdot \mathbf {P}(C_{{\ell }_l} | {\mathbf {y}}) \, . \end{aligned}$$
(14)
Equation (14) has been designed to minimize the root mean squared error (RMSE), and although it incurs an error due to the discretization of the labels, the soft nature of the estimation has provided good accuracy and low percentage of misclassifications.
Results: Accuracy of HiGSFA for pose estimation
The HiGSFA and HGSFA networks were trained using graph \(\mathcal {G}_{\text {4L}}\) described above. Table 3 shows the estimation error of each pose parameter. The results show that HiGSFA yields more accurate estimations than HGSFA for all pose parameters. We would like to remark that the HiGSFA network has the same structure as the HGSFA network, which has been tuned for the current problem. However, one may adapt the network structure specifically for HiGSFA to further exploit the advantages of this algorithm. Concretely, the improved robustness of HiGSFA allows to handle more complex networks (e.g., by increasing the output dimensionality of the nodes or using more complex expansion functions). In the following experiment on age estimation, the network structure of HiGSFA and its hyperparameters are tuned, yielding higher accuracy.
Table 3 Estimation errors (RMSE) of HGSFA and HiGSFA for each pose parameter Experiment 3: Age estimation from frontal face photographs
Systems for age estimation from photographs have many applications in areas such as human-computer interaction, group-targeted advertisement, and security. However, age estimation is a challenging task, because different persons experience facial aging differently depending on several intrinsic and extrinsic factors.
The first system for age estimation based on SFA was a four-layer HSFA network that processes raw images without prior feature extraction (Escalante-B. and Wiskott 2010). The system was trained on synthetic input images created using special software for 3D-face modeling. However, the complexity of the face model was probably too simple, which allowed linear SFA (in fact linear GSFA) to achieve good performance, and left open the question of whether SFA/GSFA could also be successful on real photographs.
This subsection first describes the image pre-processing method. Then, a training graph used to learn age, race, and gender simultaneously is presented. Finally, an HiGSFA network is described and evaluated according to three criteria: feature slowness, age estimation error (compared with state-of-the-art algorithms), and linear reconstruction error.
Image database and pre-processing
The MORPH-II database (i.e. MORPH, Album 2, Ricanek Jr. and Tesafaye 2006) is a large database suitable for age estimation. It contains 55,134 images of about 13,000 different persons with ages ranging from 16 to 77 years. The images were taken under partially controlled conditions (e.g. frontal pose, good image quality and lighting), and include variations in head pose and expression. The database annotations include age, gender (M or F), and “race” (“black”, “white”, “Asian”, “Hispanic”, and “other”, denoted by B, W, A, H, and O, respectively) as well as the coordinates of the eyes. The procedure used to assign the race label does not seem to be documented. Most of the images are of black (77%) or white races (19%), making it probably more difficult to generalize to other races, such as Asian.
We follow the evaluation method proposed by Guo and Mu (2014), which has been used in many other works. In this method, the input images are partitioned in 3 disjoint sets \(S_1\) and \(S_2\) of 10, 530 images, and \(S_3\) of 34,074 images. The racial and gender composition of \(S_1\) and \(S_2\) is the same: about 3 times more images of males than females and the same number of white and black people. Other races are omitted. More exactly, \(|MB|=|MW|={3980}\), \(|FB|=|FW|={1285}\). The remaining images constitute the set \(S_3\), which is composed as follows: \(|MB|=28{,}872\), \(|FB|={3187}\), \(|MW|=1\), \(|FW|=28\), \(|MA|=141\), \(|MH|={1667}\), \(|MO|=44\), \(|FA|=13\), \(|FH|=102\) and \(|FO|={19}\). The evaluation is done twice by using either \(S_1\) and \(S_1\text {-test} {\mathop {=}\limits ^{{\mathrm {def}}}}S_2 + S_3\) or \(S_2\) and \(S_2\text {-test} {\mathop {=}\limits ^{{\mathrm {def}}}}S_1 + S_3\) as training and test sets, respectively.
We pre-process the input images in two steps: pose normalization and face sampling (Fig. 2). The pose-normalization step fixes the position of the eyes ensuring that: (a) the eye line is horizontal, (b) the inter-eye distance is constant, and (c) the output resolution is 256\(\times \)260 pixels. After pose normalization, a face sampling step selects the head area only, enhances the contrast, and scales down the image to 96\(\times \)96 pixels.
In addition to the \(S_1\), \(S_2\), and \(S_3\) datasets, three extended datasets (DR, S, and T) are defined in this work: A DR-dataset is used to train HiGSFA to perform dimensionality reduction, an S-dataset is used to train the supervised step on top of HiGSFA (a Gaussian classifier), and a T-dataset is used for testing. The DR and S-datasets are created using the same set of training images (either \(S_1\) or \(S_2\)), and the T-dataset using the corresponding test images, either \(S_1\text {-test}\) or \(S_2\text {-test}\).
The images of the DR and S-datasets go through a random distortion step during face sampling, which includes a small random translation of max \(\pm 1.4\) pixels, a rotation of max \(\pm 2\) degrees, a rescaling of \(\pm 4\%\), and small fluctuations in the average color and contrast. The exact distortions are sampled uniformly from their respective ranges. Although these small distortions are frequently imperceptible, they teach HiGSFA to become invariant to small errors during image normalization and are necessary due to its feature specificity to improve generalization to test data. Other algorithms that use pre-computed features, such as BIF, or particular structures (e.g., convolutional layers, max pooling) are mostly invariant to such small transformations by construction (e.g., Guo and Mu 2014).
Distortions allow us to increase the number of training images. The images of the DR-dataset are used 22 times, each time using a different random distortion, and those of the S-dataset 3 times, resulting in 231,660 and 31,590 images, respectively. The images of the T-dataset are not distorted and used only once.
A multi-label training graph for learning age, gender, and race
We create an efficient training graph by combining three pre-defined graphs: a serial graph for age estimation and two clustered graphs (one for gender and the other for race classification).
Clustered training graphs The clustered graph generates features useful for classification that are equivalent to those of FDA (see Klampfl and Maass 2010, also compare Berkes 2005a and Berkes 2005b). This graph is illustrated in Fig. 9. The optimization problem associated with this graph explicitly demands that samples from the same class should be mapped to similar outputs. If C is the number of classes, \(C-1\) output (slow) features can be extracted and passed to a standard classifier, which computes the final class estimate.
The graph used for gender classification is a clustered graph that has only two classes (female/male) of \(N_F = 56{,}540\) and \(N_M= 175{,}120\) samples, respectively. The graph used for race classification is similar to the graph above: Only two classes are considered (B and W), and the number of samples per class is \(N_B = N_W = 115{,}830\).
Serial training graph for age estimation Serial graphs have been described in Sect. 5.2.3. To extract age-related features, we create a serial graph with \(L=32\) groups, where each group has 7238 images.
Efficient graph for age, race, and gender estimation We use again the method for multiple label learning (Escalante-B. and Wiskott 2016) to learn age, race, and gender labels, by constructing a graph \(\mathcal {G}_\text {3L}\) that combines a serial graph for age estimation, a clustered graph for gender, and a clustered graph for race. Whereas the vertex weights of the clustered graph are constant, the vertex weights of the serial graph are not (first and last groups have smaller vertex weights), but we believe this does not affect the accuracy of the combined graph significantly. For comparison purposes, we also create a serial graph \(\mathcal {G}_{\text {1L}}\) that only learns age.
The graph combination method yields a compact feature representation. For example, one can combine a clustered graph for gender (M or F) estimation and another for race (B or W). The first 2 features learned from the resulting graph are then enough for gender and race classification. Alternatively, one could create a clustered graph with four classes (MB, MW, FB, FW), but to ensure good classification accuracy one must keep 3 features instead of 2. Such a representation would be impractical for larger numbers of classes. For example, if the number of classes were \(C_1=10\) and \(C_2=12\), one would need to extract \(C_1 C_2 - 1 = 119\) features, whereas with the proposed graph combination, one would only need to extract \((C_1-1) + (C_2 -1) = 20\) features.
Supervised post-processing We use the first 20 or fewer features extracted from the S-dataset to train three separate Gaussian classifiers, one for each label. For race and gender only two classes are considered (B, W, M, F). For age, the images are ordered by increasing age and partitioned in 39 classes of the same size. This hyperparameter has been tuned independently of the number of groups in the age graph, which is 32. The classes have average ages of \(\{16.6, 17.6, 18.4, \dots , 52.8, 57.8\} \) years. To compute these average ages, as well as to order the samples by age in the serial graph, the exact birthday of the persons is used, representing age with a day resolution (e.g., an age may be expressed as 25.216 years).
The final age estimation (on the T-dataset) is computed using the soft-GC method (14), except that 39 groups are used instead of 50. Moreover, to comply with the evaluation protocol, we use integer ground-truth labels and truncate the age estimates to integers.
Evaluated algorithms
We compare HiGSFA to other algorithms: HGSFA, PCA, and state-of-the-art age-estimation algorithms. The structure of the HiGSFA and HGSFA networks is described in Table 4. In both networks, the nodes are simply an instance of iGSFA or GSFA preceded by different linear or nonlinear expansion functions, except in the first layer, where PCA is applied to the pixel data to preserve 20 out of 36 principal components prior to the expansion. The method used to scale the slow features is the sensitivity method, described in “Appendix B”. The hyperparameters have been hand-tuned to achieve best accuracy on age estimation using educated guesses, sets \(S_1\), \(S_2\) and \(S_3\) different to those used for the evaluation, and fewer image multiplicities to speed up the process.
The proposed HGSFA/HiGSFA networks are different in several aspects from SFA networks used so far (e.g., Franzius et al. 2007). For example, to improve feature specificity at the lowest layers, no weight sharing is used. Moreover, the input to the nodes (fan-in) originates mostly from the output of 3 nodes in the preceding layer (3\(\times \)1 or 1\(\times \)3). Such small fan-ins reduce the computational cost because they minimize the input dimensionality. The resulting networks have 10 layers.
Table 4 Description of the HiGSFA and HGSFA networks The employed expansion functions consist of different nonlinear functions on subsets of the input vectors and include: (1) The identity function \({\textsc {I}}({\mathbf {x}}) = {\mathbf {x}}\), (2) quadratic terms \({\textsc {QT}}({\mathbf {x}}) {\mathop {=}\limits ^{{\mathrm {def}}}}\{ x_i x_j \}_{i,j=1}^I\), (3) a normalized version of \({\textsc {QT}}\): \({\textsc {QN}}({\mathbf {x}}) {\mathop {=}\limits ^{{\mathrm {def}}}}\{ \frac{1}{1+||{\mathbf {x}}||^2} x_i x_j \}_{i,j=1}^I\), (4) the terms \({\textsc {0.8ET}}({\mathbf {x}}) {\mathop {=}\limits ^{{\mathrm {def}}}}\{ |x_i|^{0.8} \}_{i=1}^I\) of the \({\textsc {0.8Exp}}\) expansion, and (5) the function \({\textsc {max2}}({\mathbf {x}}) {\mathop {=}\limits ^{{\mathrm {def}}}}\{ \max (x_i,x_{i+1}) \}_{i=1}^{I-1}\). The \({\textsc {max2}}\) function is proposed here inspired by state-of-the-art CNNs for age estimation (Yi et al. 2015; Xing et al. 2017) that include max pooling or a variant of it. As a concrete example of the nonlinear expansions employed by the HiGSFA network, the expansion of the first layer is \({\textsc {I}}(x_1, \dots , x_{18}) \, | {\textsc {0.8ET}}(x_1, \dots , x_{15}) \,|\, {\textsc {max2}}(x_1, \dots , x_{17}) \,|\, {\textsc {QT}}(x_1, \dots , x_{10})\), where | indicates vector concatenation. The expansions used in the remaining layers can be found in the available source code.
The parameter \(\varDelta _T\) of layers 3 to 10 is set to 1.96. \(\varDelta _T\) is not used in layers 1 and 2, and instead the number of slow features is fixed to 3 and 4, resp. The number of features given to the supervised algorithm, shown in Table 5, has been tuned for each DR algorithm and supervised problem.
Table 5 Number of output features passed to the supervised step, a Gaussian classifier Since the data dimensionality allows it, PCA is applied directly (it was not resorted to hierarchical PCA) to provide more accurate principal components and smaller reconstruction errors.
Experimental results
The results of HiGSFA, HGSFA and PCA (as well as other algorithms, where appropriate) are presented from three angles: feature slowness, age estimation error, and reconstruction error. Individual scores are reported as \(a \pm b\), where a is the average over the test images (\(S_1\text {-test}\) and \(S_2\text {-test}\)), and b is the standard error of the mean (i.e., half the absolute difference).
Feature slowness The weighted \(\varDelta \) values of GSFA (Eq. 1) are denoted here as \(\varDelta ^{\text {DR},\mathcal {G}_\text {3L}}_j\) and depend on the graph \(\mathcal {G}_\text {3L}\), which in turn depends on the training data and the labels. To measure slowness (or rather fastness) of test data (T), standard \(\varDelta \) values are computed using the images ordered by increasing age label, \(\varDelta ^{\text {T,lin}}_j {\mathop {=}\limits ^{{\mathrm {def}}}}\frac{1}{N-1}\sum _n (y_j(n+1)-y_j(n))^2\). The last expression is equivalent to a weighted \(\varDelta \) value using a linear graph (Fig. 3b). In all cases, the features are normalized to unit variance before computing their \(\varDelta \) values to allow for fair comparisons in spite of the feature scaling method.
Table 6 shows \(\varDelta ^{\text {DR},\mathcal {G}_\text {3L}}_{1,2,3}\) (resp. \(\varDelta ^{\text {T,lin}}_{1,2,3}\)), that is, the \(\varDelta \) values of the three slowest features extracted from the DR-dataset (resp. T-dataset) using the graph \(\mathcal {G}_\text {3L}\) (resp. a linear graph). HiGSFA maximizes slowness better than HGSFA. The \(\varDelta ^\text {T,lin}\) values of the PCA features are larger, which is not surprising, because PCA does not optimize for slowness. Since \(\varDelta ^{\text {DR},\mathcal {G}_\text {3L}}\) and \(\varDelta ^\text {T,lin}\) are computed from different graphs, they should not be compared with each other. \(\varDelta ^\text {T,lin}\) considers transitions between images with the same or very similar ages but arbitrary race and gender. \(\varDelta ^{\text {DR},\mathcal {G}_\text {3L}}\) only considers transitions between images having at least one of a) the same gender, b) the same race, or c) different but consecutive age groups.
Table 6 Average delta values of the first three features extracted by PCA, HGSFA, and HiGSFA on either training (DR) or test (T) data Age estimation error We treat age estimation as a regression problem with estimates expressed as an integer number of years, and use three metrics to measure age estimation accuracy: (1) the mean absolute error (MAE) (see Geng et al. 2007), which is the most frequent metric for age estimation, (2) the root mean squared error (RMSE), which is a common loss function for regression. Although it is sensitive to outliers and has been barely used in the literature on age estimation, some applications might benefit from its stronger penalization of large estimation errors. And (3) cumulative scores (CSs, see Geng et al. 2007), which indicate the fraction of the images that have an estimation error below a given threshold. For instance, \(\mathrm {CS}(5)\) is the fraction of estimates (e.g., expressed as a percentage) having an error of at most 5 years w.r.t. the real age.
Table 7 Accuracy in years of state-of-the-art algorithms for age estimation on the MORPH-II database (test data) The accuracies are summarized in Table 7. The MAE of HGSFA is 3.921 years, which is better than that of BIF+3Step, BIF+KPLS, BIF+rKCCA, and a baseline CNN, similar to BIF+rKCCA+SVM, and worse than the MCNNs. The MAE of HiGSFA is 3.497 years, clearly better than HGSFA and also better than MCNNs. At first submission of this publication, HiGSFA achieved the best performance, but newer CNN-based methods have now improved the state-of-the-art performance. In particular, MRCNN yields an MAE of 3.48 years, and \(\text {Net}^\text {VGG}_\text {hybrid}\) an MAE of only 2.96 years. In contrast, PCA has the largest MAE, namely 6.804 years.
MCNN denotes a multi-scale CNN (Yi et al. 2015) that has been trained on images decomposed as 23 48\(\times \)48-pixel image patches. Each patch has one out of four different scales and is centered on a particular facial landmark. A similar approach was followed by Liu et al. (2017) to propose the use of a multi-region CNN (MRCNN). Xing et al. (2017) evaluated several CNN architectures and loss functions and propose the use of special-purpose hybrid architecture (\(\text {Net}^\text {VGG}_\text {hybrid}\)) with five VGG-based branches. One branch estimates a distribution on demographic groups (black female, black male, white female, white male), and the remaining four branches estimate age, where each branch is specialized in a single demographic group. The demographic distribution is used to combine the outputs of the four branches to generate the final age estimate.
In an effort to improve our performance, we also tried support vector regression (SVR) as supervised post-processing and computed an average age estimate using the original images and their mirrored version (HiGSFA-mirroring). This slightly improved the estimation to 3.412 years, becoming better than MRCNN. Mirroring is also done by MCNN and \(\text {Net}^\text {VGG}_\text {hybrid}\).
Detailed cumulative scores for HiGSFA and HGSFA are provided in Table 8, facilitating future comparisons with other methods. The RMSE of HGSFA on test data is 5.148 years, whereas HiGSFA yields an RMSE of 4.583 years, and PCA an RMSE of 8.888 years. The RMSE of other approaches does not seem to be available.
The poor accuracy of PCA for age estimation is not surprising, because principal components might lose wrinkles, skin imperfections, and other information that could reveal age. Another reason is that principal components are too unstructured to be properly untangled by the soft GC method, in contrast to slow features, which have a very specific and simple structure.
Table 8 Percentual cumulative scores (the larger the better) for various maximum allowed errors ranging from 0 to 30 years The behavior of the estimation errors of HiGSFA is plotted in Fig. 10 as a function of the real age. On average, older persons are estimated much younger than they really are. This is in part due to the small number of older persons in the database, and because the oldest class used in the supervised step (soft-GC) has an average of about 58 years, making this the largest age that can be estimated by the system. The MAE is surprisingly low for persons below 45 years. The most accurate estimation is an MAE of only 2.253 years for 19-year-old persons.
Reconstruction error A reconstruction error is a measure of how much information of the original input is contained in the output features. In order to compute it, we assume a linear global model for input reconstruction.
Let \({\mathbf {X}}\) be the input data and \({\mathbf {Y}}\) the corresponding set of extracted features. A matrix \({\mathbf {D}}\) and a vector \({\mathbf {c}}\) are learned from the DR-dataset using linear regression (ordinary least squares) such that \(\hat{{\mathbf {X}}} {\mathop {=}\limits ^{{\mathrm {def}}}}{\mathbf {D}} {\mathbf {Y}} + {\mathbf {c}}{\mathbf {1}}^T\) approximates \({\mathbf {X}}\) as closely as possible, where \({\mathbf {1}}\) is a vector of N ones. Thus, \(\hat{{\mathbf {X}}}\) contains the reconstructed samples (i.e. \(\hat{{\mathbf {x}}}_n {\mathop {=}\limits ^{{\mathrm {def}}}}{\mathbf {D}} {\mathbf {y}}_n + {\mathbf {c}}\) is the reconstruction of the input \({\mathbf {x}}_n\) given its feature representation \({\mathbf {y}}_n\)). Figure 2 shows examples of face reconstructions using features extracted by different algorithms.
The model is linear and global, which means that output features are mapped to the input domain linearly. For PCA this gives the same result as the usual multiplication with the transposed projection matrix plus image average. An alternative (local) approach for HiGSFA would be to use the linear reconstruction algorithm of each node to perform reconstruction from the top of the network to the bottom, one node at a time. However, such a local reconstruction approach is less accurate than the global one.
The normalized reconstruction error, computed on the T-dataset, is then defined as
$$\begin{aligned} e_\text {rec} {\mathop {=}\limits ^{{\mathrm {def}}}}\frac{\sum _{n=1}^{N} || ({\mathbf {x_n}} - \hat{{\mathbf {x}}_n}) || ^ 2 }{\sum _{n=1}^{N} || ({\mathbf {x_n}} - \bar{{\mathbf {x}}}) || ^ 2 } \, , \end{aligned}$$
(15)
which is the ratio between the energy of the reconstruction error and the variance of the test data except for a factor \(N/(N-1)\).
Table 9 Reconstruction errors on test data using 75 features extracted by various algorithms Reconstruction errors of HGSFA, HiGSFA and PCA using 75 features are given in Table 9. The constant reconstruction \(\bar{{\mathbf {x}}}\) (chance level) is the baseline with an error of 1.0. As expected, HGSFA does slightly better than chance level, but worse than HiGSFA, which is closer to PCA. PCA yields the best possible features for the given linear global reconstruction method, and is better than HiGSFA by 0.127. For HiGSFA, from the 75 output features, 8 of them are slow features (slow part), and the remaining 67 are reconstructive. If one uses 67 features instead of 75, PCA yields a reconstruction error of 0.211, which is still better because the PCA features are computed globally.
HiGSFA network with HGSFA hyperparameters We verify that the performance of HiGSFA for age estimation is better than that of HGSFA not simply due to different hyperparameters by evaluating the performance of an HiGSFA network using the hyperparameters of the HGSFA network (the only difference is the use of iGSFA nodes instead of GSFA nodes). The hyperparameter \(\varDelta _T\), not present in HGSFA, is set as in the tuned HiGSFA network. As expected, the change of hyperparameters affected the performance of HiGSFA: The MAE increased to 3.72 years, and the RMSE increased to 4.80 years. Although the suboptimal hyperparameters increased the estimation errors of HiGSFA, it was still clearly superior to HGSFA.
Sensitivity to the delta threshold\(\varDelta _T\) The influence of \(\varDelta _T\) on estimation accuracy and numerical stability is evaluated by testing different values of \(\varDelta _T\). For simplicity, the same \(\varDelta _T\) is used from layers 3 to 10 in this experiment (\(\varDelta _T\) is not used in layers 1 and 2, where the number of features in the slow part is constant and equal to 3 and 4 features, respectively). The performance of the algorithm as a function of \(\varDelta _T\) is shown in Table 10. The \(\varDelta _T\) yielding minimum MAE and used in the optimized architecture is 1.96.
Table 10 Performance of HiGSFA on the MORPH-II database using different \(\varDelta _T\) (default value is \(\varDelta _T=1.96\)) The average number of slow features in the third layer changes moderately depending on the value of \(\varDelta _T\), ranging from 2.87 to 4.14 features, and the final metrics change only slightly. This shows that the parameter \(\varDelta _T\) is not critical and can be tuned easily.
Evaluation on the FG-NET database The FG-NET database (Cootes 2004) is a small database with 1002 facial images taken under uncontrolled conditions (e.g., many are not frontal) and includes identity and gender annotations. Due to its small size, it is unsuitable to evaluate HiGSFA directly. However, FG-NET is used here to investigate the capability of HiGSFA to generalize to a different test database. The HiGSFA \((\mathcal {G}_\text {3L})\) network that has been trained with images of the MORPH-II database (either with the set \(S_1\) or \(S_2\)) is tested using images of the FG-NET database. For this experiment, images outside the original age range from 16 to 77 years are excluded.
For age estimation, the MAE is 7.32 ± 0.08 years and the RMSE is 9.51 ± 0.13 years (using 4 features for the supervised step). For gender and race estimation, the classification rates (5 features) are 80.85% ± 0.95% and 89.24% ± 1.06%, resp. The database does not include race annotations, but all inspected subjects appear to be closer to white than to black. Thus, we assumed that all test persons have white race.
The most comparable cross-database experiment known to us is a system (Ni et al. 2011) trained on a large database of images from the internet and tested on FG-NET. By restricting the ages to the same 16–77 year range used above, their system achieves an MAE of approximately 8.29 years.