1 Introduction

Gait recognition is a technique to authenticate a person from his/her walking style and has advantages over other physiological biometric modalities (e.g., DNA, fingerprints, irises, and faces) in terms that it works even with relatively low-resolution images [32] (e.g., CCTV footage captured at a large distance without subject cooperation), and that gait is difficult to obscure and imitate. The demand for gait recognition has therefore grown in many applications for surveillance and forensics [1, 9, 24].

However, involving uncooperative subjects makes gait recognition more challenging, as gait may be influenced by various covariates such as views, shoes, surfaces, clothing, carriage, and speed [2, 34]. Among these covariates, speed change is one of the most common challenging factors and often occurs in real scenes depending on the situation (e.g., a perpetrator running from a crime scene). Gait recognition performance may significantly degrade under speed variation because the speed change induces changes in appearance-based gait features (e.g., the gait energy image (GEI) [6] and frequency-domain features [26]), which are often used in the gait recognition community. Extensive efforts to achieve speed-invariant gait recognition have therefore been made, as in the previous studies [5, 16, 17, 23, 27]. While these approaches can mitigate the effect of speed on gait recognition to some extent, most of them work poorly under large speed changes, or suffer from high computational cost, which is an important aspect in real-world applications.

In contrast to the above work, there are approaches to cross-speed gait recognition using dynamic part attenuation because the speed change mainly affects dynamic parts like arm swing and stride length.

Tanawongsuwan and Bobick [41] proposed using silhouettes at the single-support phases as a part of the gait features because the single-support phases, where the limbs are the most closed, as shown in Fig. 1, are not drastically varied as the speed changes, while double-support phases significantly change depending on speed. However, a single key-frame at single-support phase is easily influenced by silhouette segmentation noise, temporary posture changes, and phase estimation errors, which also drop the gait recognition performance.

Fig. 1
figure 1

Examples for comparing GEIs, key-frames, and SSGEIs. Nine frames are evenly chosen from a period in both the gallery (2 km/h) and probe (7 km/h) sequences of the same subject. The corresponding GEIs, single key-frame at the single-support phase, and SSGEIs are shown on the right. The subtraction image for each feature is shown at the bottom. It is clear that the SSGEIs can reduce the appearance differences caused by speed variance, posture change, and phase differences simultaneously

Iwashita et al. [12] applied a mutual subspace method (MSM) to cross-speed gait recognition, in which a set of silhouettes at various phases is represented by a subspace and a dissimilarity measure is obtained as the canonical angle between the subspaces for a matching pair. More specifically, in [12], a silhouette at an arbitrary phase, i.e., a transition from a single-support phase to double-support phase, is well approximated on each subspace and the canonical angle is obtained by minimizing the forming angle between a pair of silhouettes represented in each subspace. Therefore, in the cross-speed scenario, the canonical angle is expected to be obtained at the single-support phases, where the effect of the speed changes is minimal. However, the possibility of a false match at phases other than the single-support phase remains because the subspace represents silhouettes at various phases.

To overcome these defects, we propose a speed-invariant and stable gait representation called single-support GEI (SSGEI) for speed-invariant gait recognition. By combining single-support phases with the concept of GEI [6], in which multiple frames are aggregated for silhouette noise reduction, we also aggregate multiple frames of a certain duration around the single-support phase. Because longer duration leads to more stability but less speed invariance, while shorter duration leads to less stability but more speed invariance, we determine the optimal duration that balances the speed invariance and stability using a training set.

In addition to the above cross-speed gait recognition within a walking mode, there are a few studies [5, 8, 48] on those within a running mode or between walking and running modes (i.e., cross-mode gait recognition). In particular, cross-mode gait recognition is much more challenging than recognition within the same mode, because even the proposed SSGEIs are subject to changes in body inclination angle and leg motion between walking and running modes (see the running probe SSGEI (8 km/h) and walking gallery SSGEI (7 km/h) in Fig. 2).

Fig. 2
figure 2

SSGEIs of walking and running sequences from different subjects. Each column shows the SSGEIs of a running probe sequence (8 km/h) and walking gallery sequence (7 km/h) from the same subject

Of these studies, only Guan and Li’s work [5] tackled cross-mode gait recognition to the best of our knowledge. The method in [5], however, works poorly for cross-mode gait recognition because they apply a common metric learning technique called the random subspace method (RSM), regardless of the mode (i.e., walking or running) and hence they cannot absorb the large differences between the walking and running modes.

Taking a closer look at the SSGEIs between the walking and running shown in Fig. 2, we note that there are some common changes in the upper bodies and leg motions among different subjects, and hence it is reasonable to consider a geometric transformation to register the pose differences between different gait modes, which is a widely used preprocessing procedure in face recognition community, and has not been introduced in the field of cross-speed gait recognition to the best of our knowledge. We therefore generate a generic warping between the walking and running modes across the population to cope with the cross-mode gait recognition. The contributions of this work are four-foldFootnote 1.

  • 1. A speed invariant and stable gait representation. The proposed SSGEI realizes a good trade-off between speed invariance and stability, which can be intuitively understood by the example shown in Fig. 1. The subject exhibits temporary posture changes as he looks down in several frames in the gallery sequence (2 km/h), while he keeps on walking normally in the probe sequence (7 km/h). Moreover, the selected single-support key-frames may contain slight phase differences. In contrast, GEIs mitigate the temporary posture changes and phase differences, but are directly affected by the changes in dynamic parts like stride and arm swing due to speed variation. Consequently, both key-frames and GEIs suffer from significant differences between gallery and probe ones, which may lead to false matches. However, these differences between gallery and probe images are well suppressed with the proposed SSGEI owing to its balance between speed invariance and stability, which are derived from the concepts of key-frames at the single-support phase and aggregation in the GEI, respectively.

  • 2. General framework for three speed-invariant gait recognition cases. We designed a general framework based on the SSGEI to appropriately handle three different scenarios of speed-invariant gait recognition, i.e., within-walking, within-running, and cross-mode scenarios. More specifically, we first define two gait mode classes (i.e., walking and running) considering the trade-off between fine warping fields and difficulties in the gait mode classification, and then apply a mode classification technique and subsequently compensate for the cross-mode difference by morphing the SSGEIs of walking and running modes into those of the intermediate mode by free-form deformation (FFD) [35], which is brought into speed-invariant gait recognition for the first time.

  • 3. State-of-the-art accuracy for speed-invariant gait recognition on two publicly available data sets. We evaluated the proposed method in within-walking, within-running, and cross-mode scenarios with the OU-ISIR Gait Database, Treadmill Data set A [25]. We also tested the within-walking scenario with the CASIA Gait Database, Data set C [38]. The former dataset contains the largest speed variations, and is the only data set that includes running sequences, while the latter dataset contains a larger number of subjects with speed variations in the walking mode, which makes the performance evaluation more statistically reliable. The experimental results for both datasets show that the proposed method yields the state-of-the-art accuracies both in terms of verification and identification scenarios.

  • 4. Low computational cost. The proposed method is also executable with a low computational cost and hence is more suitable for real-world surveillance applications, while the other state-of-the-art methods require relatively high computational costs.

2 Related work

2.1 Speed-invariant gait recognition in the within-walking scenario

Currently, various speed-invariant gait recognition methods have been proposed, and they fall into two categories [16]: i) transforming features from a reference speed to another speed and ii) extracting speed-invariant gait features. The core of the first category is to learn the relationship between features under different walking speeds [17], such as stride normalization for double-support frames [41] and factorization-based speed transformation model [27]. However, the transformation-based approaches suffer from high-computational model fitting and perform relatively poorly when the speed change is large.

In the second category, speed-invariant gait features are employed for gait recognition [11, 12, 14, 16, 23, 39, 40]. For example, in [16], based on Procrustes shape analysis descriptors, the differential composition model (DCM) was introduced to differentiate the effects on each body part caused by speed change. Iwashita et al. [12] applied a mutual subspace method (MSM) to a set of gait silhouette images and a canonical angle between the gallery and probe subspaces is computed as dissimilarity measure, which is often chosen from the single-support phases in the cross-speed case. They further extended this approach in [11] by dividing the human body into multiple areas, and using a matching weight to select the relatively static parts. The elimination of dynamic parts helps to reduce the effects of walking speed variations, but it may fail if temporary posture changes occur on the static parts. On the other hand, it is unsuitable to be extended to the cross-mode gait-recognition, where both static and dynamic parts obviously change across the walking and running modes.

Another direction is to directly apply a metric learning approach to cross-speed gait recognition. Guan and Li [5] employed RSM to combine a large number of weak classifiers, which can reduce the generalization errors caused by different walking speeds. The RSM framework achieves significant performance improvements, but it faces two limitations: 1) the accuracy varies because of its random nature and 2) it is time-consuming to calculate because it needs to construct a large number of random subspaces and execute a matching process for each one.

2.2 Cross-mode gait recognition

Most cross-speed gait recognition studies only focus on within-walking cases, yet running gait recognition, particularly cross-mode gait recognition, is worth further investigation because a running perpetrator may often need to be recognized only with his/her walking gallery in real scenes. Yam et al. [48] proposed an analytical model using the biomechanics of human locomotion and a unique mapping was found between walking and running gait features for each subject. A generic mapping across the population may, however, not exist, which limits its use in surveillance for identifying unknown runners by their walking features only. In a unique study that evaluates both the speed changes in each mode and the cross-mode scenario, Guan and Li [5] applied RSM. Although they achieved high accuracies in within-walking and within-running scenarios, the method still performed poorly in cross-mode tasks.

2.3 Gait mode classification

There has been considerable interest in the classification of gait modes or, more generally, of different types of human actions [4]. Yu et al. [50] used a three-layer feedforward network to classify walking, running, and other action types based on the trajectories in eigenspace. Cheng et al. [3] computed a characteristic frequency using the mean motion magnitude between frames. Kim et al. [13] proposed a tensor canonical correlation analysis method. In [21], a real-time human action recognition solution was proposed based on luminance field trajectory analysis and learning. Fihl et al. [4] introduced the duty-factor (i.e., the fraction of the stride duration over which each foot remains on the ground) to characterize gait modes, which is independent of challenging factors such as varying speeds. Although some of the existing methods have a high classification accuracy, they still require relatively high computational costs.

2.4 Deep learning-based gait recognition

To date, the deep learning-based approaches have demonstrated the state-of-the-art performance in gait recognition, which mainly focus on tackling gait recognition under view variations. While [44] and [43] proposed deep convolutional neural network (CNN) models using raw shilhouette images as the inputs, Shiraga et al. [36] designed GEINet whose input is a single GEI. Some latest works [45, 51] presented the CNN models with two inputs, where the similarities of these two inputs were learnt to discriminate between the same subject pairs and different subject pairs, and in [37], CNN architectures with different input and output were explored for gait verification and identification scenarios respectively. These approaches achieved superior performance in comparison to traditional methods, sufficiently enormous number of training samples, however, are required to obtain reliable CNN models, which are unsuitable to be applied for datasets with small sample size.

3 Gait recognition using SSGEI

3.1 Overview

An overview of the proposed framework is shown in Fig. 3. Given a matching pair of gait silhouette sequences, i.e., a gallery and probe, which can be extracted from raw images by a background subtraction-based graph-cut segmentation [28], or recent state-of-the-art deep learning-based semantic segmentation methods such as RefineNet [22], we first generate the size-normalized and registered silhouette sequences by height normalization and registration using the region center [26]. After detecting the gait period, by aggregating multiple single-support frames over the optimal duration of the period, we extract the SSGEI as a gait feature.

Fig. 3
figure 3

Overview of the proposed framework

Because a subsequent procedure is changed depending on the gait mode (walking or running) for the matching pair (and because the gait mode is not known in advance), we estimate the gait modes based on SSGEIs using a gait mode classifier. In the cross-mode case (i.e., one SSGEI is walking and the other is running), to reduce appearance changes caused by the pose difference between walking and running, we morph both the gallery and probe SSGEIs into intermediate poses using a generic warping field. In addition, because residuals may still remain after the generic warping because of the subject-dependent transition between the gait modes (e.g., some subjects may raise their arms higher in running mode than generic subjects, even if their arm swings are similar in walking mode), we attenuate the SSGEIs at such easily affected positions. We finally apply Gabor filtering and metric learning to the obtained SSGEIs as postprocesses and then compute the L2 distance as the dissimilarity measure. The final performance for verification scenarios (i.e., one-to-one matching) is obtained by comparing the dissimilarity score with an acceptance threshold, while the accuracy for identification scenarios (i.e., one-to-many matching) is calculated using nearest neighbor classifier, which is the most widely used classifier in gait identification community.

Details for the procedures are given in the following sections.

3.2 SSGEI extraction

3.2.1 Representation

A gait period is first detected from the lower body parts of the normalized silhouette sequence. Given the body height H, we set the vertical position of the knee to 0.285HFootnote 2 based on anatomical data statistics, as suggested in [7]. A temporal series of the width of the lower body from the foot bottom to the knee is computed and the local maxima and minima are found as the double-support phases and single-support phases, respectively.

Thus, we define a gait period of T [frames] to start from a double-support phase (t = tds,1 = 0), then go through two single-support phases (t = tss,1 and t = tss,2) as well as another in between double-support phase (t = tds,2), and finally end with the third double-support phase (t = tds,3 = T), which is shown in Fig. 4.

Fig. 4
figure 4

Demonstration of the gait period and duration for an SSGEI along with the normalized silhouette sequences from three different walking speeds. The horizontal axis t/T represents a non-dimensional time normalized by gait period T. Note that the frame intervals for three walking speeds are different because of the gait period difference. Two single-support phases are denoted as pss, k(k = 1,2) in this non-dimensional time domain. The durations composed of multiple single-support phases within the range [pss, kp, pss, k + p](k = 1,2) are selected for constructing the SSGEI, where p is a hyperparameter for duration selection

To define the duration around single-support phases in a walking speed rate-invariant way, we convert a time \(t \in \mathbb {Z}\) [frames] into a non-dimensional time \(p = t / T \in \mathbb {R}\), which is normalized by period T. Suppose that we take a 2p duration around each single-support phases pss, k(k = 1,2) in the non-dimensional time domain. Then, the duration around the k-th single-support phase is defined as [pss, kp, pss, k + p]. Note that the duration parameter p is subject to 0 < p ≤ 1/4 (the duration will cover the whole period if p = 1/4).

Once the durations are defined, we can convert them back into the original time domain and obtain the starting and ending frames for the k-th duration as \(t_{ss, k}^{s}(p) = \lceil (p_{ss, k} - p)T \rceil \) and \(t_{ss, k}^{e}(p) = \lfloor (p_{ss, k} + p)T \rfloor \), respectively, where ⌈⋅⌉ and ⌊⋅⌋ are ceiling and floor functions, respectively.

An SSGEI can now be computed based on the durations. Let a binary silhouette value at position (x, y) from the t-th frame in the size-normalized silhouette sequence be I(x, y, t), where 0 and 1 indicate the background and foreground, respectively. SSGEI S(x, y;p) is defined using duration parameter p as

$$ S(x, y; p) = \frac{1}{2} \sum\limits_{k = 1}^{2} \frac{1}{t_{ss, k}^{e}(p) - t_{ss, k}^{s}(p) + 1} \sum\limits_{t = t_{ss, k}^{s}(p)}^{t_{ss, k}^{e}(p)} I(x, y, t). $$
(1)

Examples of SSGEIs can be found in Fig. 1. The SSGEI shows its effectiveness clearly when compared with GEI and a single single-support key-frame.

3.2.2 Optimal duration estimation

We next need to carefully select optimal duration parameter p to realize a good trade-off between the speed invariance and stability for the proposed SSGEI. Consequently, we introduce a well-known criterion for discrimination capability, i.e., the Fisher ratio of between-class distance and within-class distance using a training set including speed variations. Note that composition of the training set varies depending on the scenario such as within-walking, within-running, and cross-mode matching. More specifically, for within-walking or within-running scenarios, the training set only includes walking or running SSGEIs, respectively, while for the cross-mode case, the training set includes morphing results at the intermediate pose of walking and running SSGEIs, which is introduced in later sections. As a result, the optimal duration parameter p is obtained to maximize the Fisher ratio of the between-class distances and within-class distances. We refer readers to [46] for more details about the acquisition of the Fisher ratio.

3.3 Classification of walking and running

Firstly, two gait modes, i.e., walking and running, are defined based on the walking/running pose of a subject, which mainly differs in body inclination angle and leg motion (see Fig. 2). Because dynamic part variation caused by speed changes within the same gait mode may degrade the classification accuracy of walking and running modes, we adopt the SSGEI to address this problem by considering the trade-off between speed invariance and stability within the same mode. Note that we can see relatively common (i.e., subject-independent) pose changes between walking and running modes, as shown in Fig. 2. Moreover, because the gait period of the running mode is generally much shorter than that of the walking mode, we exploit the gait period T [frames] as a useful feature for gait mode classification. More specifically, we define a concatenated feature vector of the gait period T [frames] and SSGEI and feed it to a linear support vector machine for classification into walking or running mode. Note that the gait mode classifier is trained using the training set composed of walking and running SSGEIs with diverse speeds.

3.4 Morphing by FFD

To overcome the large intra-class differences between running and walking modes, we utilize a generic warping field between them across the population. For this purpose, we utilize the notion of FFD with piece-wise linear interpolation, because the FFD provides a high degree of flexibility for describing the transformation of non-rigid objects such as a human, as well as maintains derivative continuity at adjacent regions [35], i.e., gait characteristics for person authentication, unlike some other example-based view transformation approaches such as [15, 26, 30] may corrupt the geometric continuity of the gait features.

Instead of a conventional bi-directional cost function to minimize the error between the target and transformed source as well as between the source and inverse-transformed target [19], we introduce a cost function to minimize the error between targets and sources that are both transformed into intermediate SSGEIs. More specifically, we allocate a set of control points on the SSGEI and then define a set of two-dimensional displacement vectors from the walking to intermediate SSGEI on the control points as \(\vec {u}\). We then define a warping field \(F(\vec {u})\) from walking SSGEI to intermediate SSGEI by piece-wise linear interpolation. Similarly, we consider a reverse version of the displacement vector \(-\vec {u}\), and its warping field \(F(-\vec {u})\) from running SSGEI to intermediate SSGEI. We finally match the morphed SSGEIs in the intermediate domain. The advantages of this deformation representation are i) the deformation between walking and running is treated symmetrically and ii) the degree of deformation from each walking and running mode to the intermediate mode is equal to each other (i.e., \(\|\vec {u}\| = \|-\vec {u}\|\)).

Using the above concept, we denote a pair of source and target SSGEIs (i.e., running and walking SSGEIs) as \(S^{S}_{i,j}, S^{T}_{i,j}\in \mathbb {R}^{H_{S} \times W_{S}} (i = 1, \ldots , N_{c}, j = 1, \ldots , k_{i})\), respectively, where Nc and ki are the number of training subjects and source/target pairs for the i-th training subject, respectively. Suppose a mapping of the warping field from the source to the intermediate mode is obtained as \(F(\vec {u})\) by a piece-wise linear interpolation of \(\vec {u}\). Then, the transformed source SSGEI is represented as \(S^{S}_{i,j} \circ F(\vec {u})\), where ∘ indicates a transformation operator. Similarly, the transformed target SSGEI is represented as \( S^{T}_{i,j} \circ F(\vec {-u})\). Consequently, we obtain the optimal displacement vector \(\vec {u}^{*}\) by minimizing the summation of differences between the transformed source and target GEIs in the intermediate domain as

$$ \vec{u^{*}} = \arg\min_{\vec{u}} E(\vec{u}), $$
(2)

where

$$ E(\vec{u}) = \sum\limits_{i=1}^{N_{c}} \sum\limits_{j = 1}^{k_{i}} \Vert S^{S}_{i,j} \circ F(\vec{u}) - S^{T}_{i,j} \circ F(-\vec{u}) {\Vert_{F}^{2}} + \lambda R(\vec{u}). $$
(3)

Here, \(R(\vec {u})\) is a smoothness term, i.e., a linear elastic constraint on the displacements between adjacent control points [19] and λ is a hyperparameter to control the smoothness. We solve the optimization of (2) by gradient descent. More specifically, the displacement vectors \(\vec {u}\) are set to be zero at initialization, and then the gradient descent of \(E(\vec {u})\) is computed to update \(\vec {u}\) iteratively until convergence. As such, we obtain the intermediate SSGEIs \( S^{S^{\prime }}_{i,j} = S^{S}_{i,j} \circ F(\vec {u^{*}})\) and \(S^{T^{\prime }}_{i,j} = S^{T}_{i,j} \circ F(\vec {-u^{*}})\) transformed from the source and the target, as shown in Fig. 6a–f.

3.5 Attenuation field

While the above generic warping mitigates the intra-subject inter-mode differences to some extent, residuals may still remain because of subject-dependent transitions between the gait modes (e.g., some subjects may raise their arms higher in running mode than generic subjects even if their arm swings are similar in walking mode.) as shown in Fig. 2. We therefore introduce an attenuation field to suppress such subject-dependent residuals.

For this purpose, we further employ an outlier detection method [31] in a framework of transportation minimization-based morphing called the earth mover’s morphing framework [29] to determine appearing/disappearing regions between a source and target derived from the subject-dependent residuals. We refer the reader to [29, 31] for more details. Specifically, given a transformed source and target SSGEIs \(S^{S^{\prime }}_{i,j}\) and \(S^{T^{\prime }}_{i,j}\), we regard the brightness at each pixel in the transformed SSGEIs as a sort of mass assigned to the pixel and then try to transport all the pixels in the transformed source SSGEI into those in the transformed target SSGEIs with the minimal cost (i.e., the weighted sum of travelling distances by mass). Here, the subject-dependent residual such as arm swing difference from the generic warping may require a large transportation cost, and hence we prepare an exceptional path to a trash bin, which is automatically assigned to a pixel whose transportation cost exceeds a certain threshold in the transportation minimization framework. Consequently, we regard such trash-bin pixels as outliers for the generic warping field between walking and running modes, and then construct an attenuation field by aggregating the trash-bin pixels over all the training subjects, as shown in Fig. 6i.

Once the attenuation field is obtained, we attenuate the intensities of both the transformed source and target SSGEIs for each pixel, e.g., if an attenuation value at a certain pixel is 70%, the intensities of the transformed SSGEIs at the same pixel are reduced by 70%.

3.6 Update morphing

To mitigate the effect of the subject-dependent residuals in the acquisition process of the generic warping field, we recompute the optimal displacement vector \(\vec {u}^{*}\) by introducing the attenuation field. More specifically, when computing the Frobenius norm in (3), we reduce the intensities of both the transformed source and target SSGEIs for each pixel depending on the attenuation field. As such, we obtain the updated displacement vector, and then update the attenuation field in turn.

4 Postprocessing

The effectiveness of Gabor filtering has been demonstrated in the context of biologically inspired image understanding processes [18, 42] and its effectiveness in gait recognition has been also demonstrated in [5, 42, 47]. We therefore also introduce Gabor filtering as a postprocessing step for the proposed SSGEIs in the within-walking and within-running cases, as well as for the transformed SSGEIs in the cross-mode case (referred to as Gabor-SSGEI). In the Gabor feature space, we further employ two-dimensional principle component analysis (2DPCA) [49] to reduce the feature dimensions in the column direction while retaining 99% of the variance in our applications. Two-dimensional linear discriminant analysis (2DLDA) [20] is then exploited to obtain a discriminative projection in the row direction. We refer readers to the supplemental material for the details of postprocessing.

5 Experiments

5.1 Data sets and parameter settings

To evaluate the proposed method, we adopted two publicly available data sets , i.e., the OU-ISIR Gait Database, Treadmill Dataset A (OUTD-A) [25] and CASIA Gait Database, Dataset C (CASIA-C) [38].

The first data set contains image sequences of 34 subjects with speed variations ranging from 2 km/h to 10 km/h in 1 km/h intervals. We use this dataset to evaluate our method for all the following experiments because of its largest speed variations, and because it is the only data set that includes running sequences. Following the settings of the dataset [25], walking speeds from 2 km/h to 7 km/h are used for the within-walking case, while running speeds from 8 km/h to 10 km/h are used for the within-running case. The cross-mode case includes all speeds, where the galleries are walking speeds while the probes are running speeds and vice versa. Nine subjects were used for training parameter p, the generic warping field, and 2DPCA and 2DLDA. The other disjoint 25 subjects were used for testing according to the protocol suggested in [27]. In identification scenarios, we followed an uncooperative setting, i.e., subjects in a specific gallery may have different speeds and/or gait modes, which makes the identification task more challenging than the cooperative setting, i.e., subjects in a specific gallery have the same speed and gait mode. Therefore, gait mode classification was applied to both probe and gallery sequences before subsequent procedures.

The second data set is composed of 153 subjects with three different walking speeds, i.e., slow (fs), normal (fn), and fast (fq) walking. This data set was used for experiments in Section 5.7 to make the performance evaluation more statistically reliable. Following [17], 33 subjects were randomly selected to make up the training set, and the rest of the 120 subjects were used for the testing set. To mitigate the effect of the random selection on performance evaluation, we repeated this random selection processes 10 times, and report the mean accuracies for the identification scenarios and accuracies in the verification scenarios using the entire set of dissimilarity scores. Eight sequences were collected for each subject, which were composed of four fn sequences, two fs, and two fq sequences. Three fn sequences, one fs sequence, and one fq sequence were chosen as the fn, fs, and fq galleries, respectively, while the other sequences were used as probes. For example, when three fn sequences were used as the fn gallery, the remaining one fn, two fs, and two fq sequences were probes.

In our applications, the dimensions of 2DLDA were all chosen within the range [1,10,20,…,250] to maximize the accuracy of the training set in both the verification and identification scenarios.

5.2 Analysis on the optimal duration parameter

As described in Section 3.2.1, the optimal duration parameter p is selected within 0 < p ≤ 1/4. Concretely speaking, we empirically prepared a discrete set of parameter candidates as p ∈{i/40}(i = 1,2,…,10) at 1/40 intervals (when p = 10/40, the duration includes the whole period). We report the Fisher ratio of the training set as well as the corresponding rank-1 identification rate (i.e., identification accuracy) of the testing set for each parameter candidate p under within-walking and cross-mode cases in Fig. 5. We refer readers to the supplemental material for the within-running case. Note that for the cross-mode case, parameter p was chosen using the transformed SSGEIs. These results show that the best rank-1 identification rates are obtained at the optimal durations by the Fisher ratios for all the three cases, which shows the generality of duration parameter p. As a result, we adopted p = 3/40, 8/40, and 2/40 in our experiments for the within-walking, within-running, and cross-mode cases, respectively.

Fig. 5
figure 5

Fisher ratio and corresponding rank-1 identification rate of the testing set for each duration parameter candidate using the training set

5.3 Gait mode classification

Because it is an important preprocessing step, we report gait mode classification accuracy. For comparison, we also tested GEI and the key-frame at the single-support phase (simply called the key-frame) concatenated with the gait period in addition to the proposed SSGEI. A training set for the gait mode classification was composed of sequences with multiple periods under nine speeds from the nine training subjects, which summed up to 279 samples, while a testing set from the other 25 test subjects contained 775 samples.

The results in Table 1 show that only SSGEI yields 100% correct classification rates for all three subsets, and hence we can avoid the accuracy decrease due to the misclassification of the gait modes. In other words, the high correct classification rate implies the existence of a common transformation between walking and running SSGEIs among different subjects, and hence indicates the technical soundness of using a generic warping field between walking and running SSGEIs across the population.

Table 1 Correct classification rate [%] of the gait mode using GEI, key-frame, and SSGEI for each subset

5.4 Visualization of morphing process

To better understand the effectiveness of the transformed SSGEI, we visualize the morphing process using two typical examples, i.e., the easiest cross-mode case of running at 8 km/h and walking at 7 km/h and the most difficult cross-mode case of running at 10 km/h and walking at 2 km/h, as shown in Fig. 6.

Fig. 6
figure 6

Two examples of transformed SSGEIs. a Original running SSGEI. b Generic warping field from running to intermediate SSGEI. c SSGEI transformed from (a). d Original walking SSGEI. e Generic warping field from walking to intermediate SSGEI. f SSGEI transformed from (d). g Subtraction of original SSGEIs (a) and (d). h Subtraction of transformed SSGEIs (c) and (f). i Visualization of the attenuation field. In (i), a brighter value indicates more attenuation. Most of the arm regions as well as partial front and back leg regions are highly attenuated, which is consistent with our intuition that the transition of arm swings and leg motions between walking and running modes is highly dependent on the subject

Given a pair of running and walking SSGEIs (Fig. 6a and d), they are transformed into intermediate SSGEIs (Fig. 6c and f) using the generic warping fields from running (Fig. 6b) and from walking (Fig. 6e). Because of pose differences between the original running and walking SSGEIs, there are relatively large residuals (Fig. 6g) in both examples. By transforming them with the generic warping fields (Fig. 6b and d) as well as the attenuation field (Fig. 6i), the residuals are significantly reduced (Fig. 6h), which illustrates the effectiveness of the morphing techniques.

5.5 Feature comparison

In this section, three features, key-frame, GEI, and SSGEI, were tested with OUTD-A before the postprocessing steps of Gabor filtering and metric learning were applied for all three cases, within-walking, within-running, and cross-mode cases. We evaluated the accuracies in identification and verification scenarios using the rank-1 identification rate and equal error rate (EER) of the false acceptance rate (FAR) and false rejection rate (FRR), respectively.

First, the accuracies for both verification (EER with and without z-normaliz-ation [33]) and identification scenarios in the within-walking case are shown in the left columns of Table 2(a). Because the within-walking case contains various speed changes between the probe and gallery, GEI performs the worst as it is very sensitive to the walking speed change. Key-frame yields the second-best accuracy, as this method uses frames at the single-support phases, which are insensitive to speed changes, but it is less stable at the same time. In contrast, the proposed SSGEI feature achieves the best accuracy for both the verification and identification scenarios.

Table 2 Overall rank-1 identification rate (denoted as Rank-1)[%], EER with and without z-normalization (denoted as z-EER and EER)[%] of key-frame, GEI, and SSGEI over all combinations of speeds in the probe and gallery for all three cases

Similarly, we show the accuracies in the within-running case in the right columns of Table 2(a). Because the running speed variation in OUTD-A is smaller (i.e., from 8 km/h to 10 km/h) than the within-walking case (2 km/h to 7 km/h) and most of subjects increase their speed by shortening their gait periods rather than widening their stride length, appearance variation caused by speed changes is limited. The stability is therefore more important than the speed invariance in the within-running case. Hence, the key-frame method performs worse than GEI in the within-running case because it sacrifices the stability by aggregating only two single-support frames. Finally, the proposed SSGEI still yields the highest accuracy among the three features.

As for the cross-mode case, to demonstrate the effectiveness of the morphing process, we compared the above mentioned three features without and with morphing. As shown in Table 2(b), all three features perform poorly without morphing because of the large pose differences between walking and running. After the morphing procedure, the accuracies of the three features significantly improve, and the morphed SSGEI achieves the best performance.

For a more intuitive understanding, we present typical examples of the above six features in Fig. 7 with a pair of true and false matches and their corresponding subtraction images. The subtraction images illustrate that the morphing procedure greatly reduces the differences between walking and running features. However, morphing a GEI by a generic warping field across various speeds does not work well, because the GEI itself is highly affected by the speed changes, which badly affects the generation of the generic warping field. A morphed key-frame better reduces the residuals than a morphed GEI in this visualization example, the accuracy is, however, still the worst of the three morphed features because of its low stability. As a result, only the morphed SSGEI achieves the true match here because of its good trade-off between speed invariance and stability as well as the generic warping field.

Fig. 7
figure 7

Matching examples of six features in the cross-mode case. The gallery and probe speeds are 7 km/h and 8 km/h, respectively. (a) Probe. (b) False match in the gallery (imposter). (c) True match in the gallery (genuine). (d) Subtraction image for the false match. (e) Subtraction image for the true match. The red bounding box indicates the subtraction with a smaller Euclidean distance

5.6 Contributions of individual components

To confirm the contributions of the individual components of the proposed method, we compared the proposed method with methods excluding individual components: SSGEI + metric learning (excluding Gabor filtering), Gabor-GEI + metric learning and Gabor-key-frame + metric learning (both excluding SSGEI), and Gabor-SSGEI (excluding metric learning). We further combined the proposed SSGEI with the state-of-the-art deep learning-based method, i.e., Local @ Bottom (LB) [45], to compare its contribution with that of traditional metric learning employed in the proposed method. Considering the limited training samples in OUTD-A, we fine-tuned a pre-trained model of LB on the OU-ISIR Large Population Dataset (OULP) [10], which is one of the existing largest gait datasets containing over 4,000 subjects with view variations. Data augmentation for training samples was not applied, because the performance improvement is not obvious compared with the network trained without data augmentation, as reported in [45]. To make a fair comparison, we fine-tuned different network for within-walking, within-running and cross-mode, respectively. In the testing stage, gait mode classification was first applied to the pair of probe and gallery SSGEI, which were then fed into the corresponding network according to the classification results.

We report the accuracies of the above methods for OUTD-A with receiver operating characteristic (ROC) curves with z-normalization and cumulative matching characteristics (CMC) curves. While the ROC curve shows a trade-off between FAR and FRR when an acceptance threshold changes, the CMC indicates rates at which the genuine subjects are included within each rank. We also report the rank-1 identification rate for the CMC curve, and EER with and without z-normalization from the ROC curve.

First, we show the accuracies in the within-walking and within-running cases. The ROC curves with z-normalization and the CMC curves for two pairs of speeds, i.e., 7 km/h gallery versus 2 km/h probe for the within-walking case and 8 km/h gallery versus 10 km/h probe for the within-running case, are shown in Fig. 8. The overall accuracies for all speed combinations of the within-walking and within-running cases are also provided in Table 3(a). In the within-walking case, the proposed method yields the best performance in both identification and verification scenarios, which indicates that the individual components substantially contribute to the proposed method. Under the within-running case, the proposed method achieves the best accuracy as a whole and yields the second-best for EER with z-normalization, which is still a sufficiently low error (0.4%). The deep learning-based framework does, however, not improve the performance compared with the traditional yet effective metric learning method. This is understandable because the dataset we used is quite small although it contains the largest speed variations, which easily leads to the overfitting problem for deep learning models. Nonetheless, the LB still achieves competitive results in the verification scenarios.

Fig. 8
figure 8

ROC and CMC curves in OUTD-A for within-walking and within-running cases (left: 7 km/h gallery versus 2 km/h probe, right: 8 km/h gallery versus 10 km/h probe) to analyze individual component contributions

Table 3 Overall rank-1, z-EER, and EER[%] for all speed combinations for the three cases in OUTD-A to analyze the individual component contributions. Metric learning are denoted as ML

Next, we evaluate the accuracies of the cross-mode case. Because the morphing process is an additional important component for the cross-mode case, we added the morphing to the above benchmarks and also prepared Gabor-SSGEI + metric learning (excluding morphing) as another benchmark. To demonstrate the effectiveness of the attenuation field, we also report the results of Gabor-SSGEI + morphing (w/o AF) + metric learning (excluding attenuation field from the morphing procedure). The ROC curves with z-normalization and the CMC curves for both pairs of speeds in the cross-mode case, i.e., 8 km/h running gallery versus 7 km/h walking probe and 10 km/h running gallery versus 2 km/h walking probe, are shown in Fig. 9, while the overall accuracies for all speed combinations of the cross-mode case are provided in Table 3(b). As a result, the proposed method still achieves the best overall performance similarly to the within-walking and within-running cases, and Gabor-SSGEI + morphing (w/o AF) + metric learning yields the second-best performance, which shows the mitigation of subject-dependent residuals using the attenuation field is necessary for the proposed generic warping between walking and running modes. In contrast, if we exclude the morphing component, the rank-1 identification rate as well as the EER with and without z-normalization significantly drops below the best ones, 81.0%, 6.9%, and 12.9% for the proposed method, to 58.1%, 15.3%, and 24.8% for Gabor-SSGEI + metric learning (excluding morphing), respectively, which demonstrates that not only the SSGEI, Gabor-filtering, and metric learning components, but also the additional morphing component make considerable contributions to the high accuracy of the proposed method.

Fig. 9
figure 9

ROC and CMC curves in OUTD-A for the cross-mode case (left: 8 km/h running gallery versus 7 km/h walking probe, right: 10 km/h running gallery versus 2 km/h walking probe) to analyze individual component contributions

5.7 Comparison with state-of-the-art methods

5.7.1 CASIA-C

In this section, the proposed method is compared with three latest benchmark methods of speed-invariant gait recognition which provided the results on CASIA-C, i.e., DCM [17], RSM [5] and mutual subspace method using divided area (MSM-DA) [11]. We first compare the rank-1 identification rates of each combination of three walking speeds fs, fn, and fq with DCM, which also used 120 subjects for the testing set, in Table 4. The proposed method clearly outperforms DCM for all combinations, particularly for large speed changes (e.g., fq versus fs). Next, following the experimental protocol in [5], we evaluated on pairs of gallery fn versus probe fs, fn, and fq to compare the results with DCM, RSM and MSM-DA as well as the baseline (i.e., simply using GEI), as shown in Fig. 10Footnote 3. Although it is difficult to make a fair comparison between RSM and the other benchmarks because of the slight difference in gallery size, RSM, MSM-DA and the proposed method achieve almost saturated accuracies (approximately 100% rank-1 identification rate) for all cases.

Table 4 Rank-1 identification rates [%] of DCM [17] (before slash) and the proposed method (after slash) for each combination of walking speeds fs, fn, and fq on CASIA-C
Fig. 10
figure 10

Rank-1 identification rates [%] of benchmark algorithms for gallery speed fn on CASIA-C. Here, fs, fn, and fq represent the probe speeds. Note that RSM [5] uses all 153 subjects for the test set because it does not require auxiliary training subjects, while the other four methods use 120 subjects for the testing set

5.7.2 OUTD-A

In this section, the proposed method is compared with additional state-of-the-art methods of speed-invariant gait recognition, i.e., the hidden Markov model (HMM)-based approach [23], stride normalization (SN) [41], speed transformation model (STM) [27], DCM [17], RSM [5], MSM [12], MSM-DA [11], and the state-of-the-art deep learning-based method, i.e., LB [45] using GEI on OUTD-A. For LB, we applied two strategies, i.e., separately fine-tuned different models for within-walking, within-running, and cross-mode cases (LB-sep), and fine-tuned a unified model using all the GEIs regardless of the gait mode variations (LB-uni). Although some of the benchmarks employed different data sets, the number of subjects and speed difference are almost consistent with those in OUTD-A and hence we also follow the same setting as suggested in [5, 27] to make a comparison that is as fair as possible.

More specifically, HMM was evaluated with a different gait data set whose walking speed pair are 3.3 km/h and 4.5 km/h, and hence the other methods were compared using the matching results between 3 km/h and 4 km/h. To compare with SN, which also employed a different gait data set whose walking speed pair are 2.5 km/h and 5.8 km/h, we chose the matching results between 2 km/h and 6 km/h for the other methods.

Results are shown in Table 5. In addition, Table 7Footnote 4 listed the rank-1 identification rates averaged over all combinations of speeds in the within-walking, within-running, and cross-mode cases for last seven methods in Table 5. Because RSM only provided results for the cross-mode of gallery speeds from 2 km/h to 7 km/h while the probe speeds are from 8 km/h to 10 km/h, the averaged rank-1 identification rate of cross-mode are computed over these 36 combinations. Moreover, the rank-1 identification rates of 81 combinations of all walking and running speeds for the proposed method are reported in Table 6.

Table 5 Rank-1 identification rates [%] of the benchmark algorithms for small (3 and 4 km/h) and large (2 and 6 km/h) speed changes on OUTD-A
Table 6 Rank-1 identification rates [%] of the proposed method for all 81 combinations of speeds on OUTD-A

In Tables 57, the proposed method achieves the second-best performance in the within-walking case, which is competitive with the best one (i.e., MSM-DA) considering the small number of test subjects in this dataset (i.e., 25 subjects). Although MSM-DA obtains the highest accuracies in the walking case via focusing on static parts that are less affected by walking speed variations, we point out that this method is unsuitable to be extended to the cross-mode case, where both static parts and dynamic parts vary between the walking and running modes (see Fig. 2). In within-running and cross-mode cases, the proposed method clearly outperforms the other algorithms, which even yields better results than the state-of-the-art deep learning-based method (i.e., LB) by approximately 10% with respect to the averaged rank-1 identification rate for the cross-mode case.

Table 7 Overall rank-1 identification rates [%] of the proposed method and other benchmarks for all three modes on OUTD-A

5.8 Evaluation of computational time

To evaluate the computational cost, MATLAB code of the proposed method was run on a PC with an Intel Core i7 4.00 GHz processor and 32 GB RAM. The training time of the generic warping field, the optimization time for the duration parameter p and metric learning, as well as the query time of each sequence are listed in Table 8. Although training the generic warping field takes a relatively long time, this process can be done offline beforehand. We further compare the computation time with RSM [5] in Table 9 for the within-walking case. Because of the different numbers of gallery sequencesFootnote 5 and machine specifications, we estimate the proposed method under a comparable setting. The result illustrates that the computational cost of the proposed method is much lower than that of RSM and hence more suitable for real applications.

Table 8 Computation time of the proposed method. Metric learning is denoted as ML
Table 9 Computation time [s] of the proposed method and RSM [5] in within-walking case

5.9 Effect of number of gait mode classes

To evaluate the effect of the number of gait mode classes, we tested the performance of the proposed method by classifying the gait mode into three classes, i.e., slow-walking (from 2 km/h to 4 km/h), fast-walking (from 5 km/h to 7 km/h), and running (from 8 km/h to 10 km/h). Considering the effectiveness of the classifier using SSGEI and the gait period T [frames] reported in Section 5.3, we first adopted SSGEI concatenated with the gait period T [frames] to classify the walking and running mode, and then used GEI and the gait period T [frames] for the classification of slow-walking and fast-walking, which appears obvious changes in the dynamic parts affected by the speed variation. The results in Table 10 show that the classification accuracy degrades by using three classes, which is understandable as the difficulty of classification raised with the increase of the number of classes.

Table 10 Correct classification rates [%] for each subset by classifying the gait mode into two classes and three classes

The performance in both verification and identification scenarios of using three gait modes and two gait modes are compared in Table 11. For a fair comparison, the results of three gait modes in the within-walking case are computed as an overall performance of within slow-walking, within fast-walking, and slow-walking versus fast-walking, while the results in the case of walking versus running are computed for both slow-walking versus running and fast-walking versus running, respectively. As shown in Table 11, the performance of three gait modes in the within-walking case are worse than two modes, which is mainly caused by the misclassification of gait modes. On the other hand, using three gait modes yields higher identification accuracy in the case of walking versus running, because the finer warping fields of three modes generated better transformation results than a general warping field between walking and running of two gait modes. Therefore, it is a trade-off between fine warping fields and difficulties in the gait mode classification when choosing the appropriate number of gait mode classes.

Table 11 Overall rank-1, z-EER, and EER [%] of using two gait mode classes and three gait mode classes for within-walking, within-running, and walking vs. running cases

6 Conclusion

This paper presented a framework for speed-invariant gait recognition using a speed invariant and stable gait representation called SSGEI. To realize a good trade-off between the speed invariance and stability, SSGEI is computed by aggregating multiple frames over the optimal duration around single-support phases, which are chosen by maximizing the Fisher ratio using a training set. For the challenging cross-mode case, SSGEI is further morphed into intermediate poses between walking and running using an FFD-based generic warping field across the population as well as an attenuation field based on the trash bin concept to suppress subject-dependent residuals. For better performance, Gabor filters and metric learning are combined with SSGEI as postprocessing steps. Comprehensive experiments using two publicly available gait data sets, CASIA-C and OUTD-A, demonstrated the effectiveness and efficiency of the proposed method.

In this work, we applied the proposed SSGEI only to speed-invariant gait recognition. Because the static part enhancement of the SSGEI may be also effective for other covariates in gait recognition (e.g., the forward-backward arm swing observed from a side view may not be observed from a frontal view), a future direction is to evaluate the accuracy of gait recognition under other covariates using the SSGEI. On the other hand, the static parts may be more affected than the dynamic parts for some covariates such as clothing and carrying status, and another future research avenue is therefore to seek a gait representation that highlights the dynamic parts, in contrast to the proposed SSGEI. Additionally, although the performance under the within-walking and within-running cases seems to be saturated, the cross-mode gait recognition still requires more exploration. Rather than generating generic warping field across the population, we plan to extend it to a subject-dependent deep learning-based framework after sufficient data are collected, which helps to improve the accuracy in the cross-mode scenario.