1 Introduction

Gait recognition is a computer vision process used to recognize or identify people based on body shape and gait style by remotely acquiring biometric data. The gait style contains the behavioral and physical characteristics of a person [1]. Compared to physical biometrics such as face, fingerprint, iris, and voice, gait (1) gets without any user collaboration or a scanner, (2) achieves success from long-distance and low-resolution data and (3) is hard to fake. Nowadays, the widespread use of cameras for video surveillance makes gait recognition a useful tool for 'real world' applications such as social security, crime prevention, and forensic identification.

Besides its advantages, gait is not as robust as other popular physical biometrics. Recognition performance drops drastically due to 1) the inconspicuous inter-class differences between different people, and 2) the large intra-class variations such as view angle, carrying, and clothing conditions from the same person [2]. To deal with these problems, two kinds of gait recognition approaches are studied: model-based and appearance-based. Model-based approaches are based on human body structure and movements. These approaches generally fit a human model to the input image and then extract movement patterns. They are robust against view angle or appearance-based variations. However, their computational costs are high, and they usually rely on multiple high-resolution cameras [10]. On the other hand, appearance-based approaches directly extract discriminative features from the input image, they do not need to fit a model and do not require high-resolution images [30]. Therefore, these approaches are more suitable for outdoor surveillance. Numerous studies on appearance-based gait recognition have focused on useful representations to compress a gait cycle into one image, e.g. Gait Energy Image(GEI) [3], Chrono Gait Image (CGI) [4], Gait Entropy Images (GEnI) [5], Gait Flow Image(GFI) [6], Frame Difference Energy Images-(FDEI) [7], and Period Energy Images(PEI) [8]. In the literature, these representations are frequently used as input data for deep learning networks. GEI is the most popular feature among them, with its low computational cost, easy implementation, and high efficiency. Wu et al. [14] proposed deep convolutional neural network (CNN), which automatically learns the most discriminative gait features and predicts similarity between a pair of GEIs. Wang et al. [45] achieved gait recognition based on the relationship between GEIs and various parts of the human body, the output of the non-local neural networks is horizontally segmented into three sections. Elharrouss et al. [46] utilized multi-task convolutional neural network models and GEIs to estimate the angle and recognize the gait. Silhouettes and optical flows [12] are also frequently used as input data to extract features in many different CNN-based network architectures. Hou et al. [47] proposed a novel network named Gait Lateral Network (GLN) which can learn both discriminative and compact representations from the silhouettes and Castro et al. [48] proposed AttenGait, for gait recognition equipped with trainable attention mechanisms that automatically discover interesting areas of the optical flow data. Recently, multi-modal network structures [15,16,17] for gait recognition have achieved remarkable results. However, the recognition rates drop drastically under cross-view angle and appearance-based variations, especially for the CASIA-B dataset. These network structures aim to build a richer and more compact gait representation by combining or fusing information from many input modalities. For this reason, in this study, three different modalities that give the best combination with GEI are tried to be found on three different CNN architectures: EfficientNet-B0 [20], MobileNet-V1 [21], and ConvNeXt-base [49]. The main framework of our study is as follows:

  • We evaluate the performance of different modalities, including silhouettes, optical flows, and concatenated image of the GEI head and leg regions, to determine the best combination with the GEI, as well as the leg and head regions separately.

  • We evaluate the success of three different multi-modal CNN architectures, based on EfficientNet-B0, MobileNet-V1, and ConvNeXt-base that best improve the performance of the modalities combinations.

  • We examine the success of twelve different situations under cross-view angle and appearance-based variations, carrying, and clothing conditions. We also consider this for the identical view.

  • We investigate how the fusion of modalities features can help gait recognition rate under cross-view angle and appearance-based variations.

  • We evaluate the robustness of the last modality, the concatenated image of the GEI head and leg regions, and its combination with GEI, particularly against appearance-based variations.

The rest part of the paper is organized as follows. Section 2 presents the related work. Section 3 describes the proposed method. Experiments and results are presented in Section 4. Finally, last section concludes the paper.

2 Related work

Gait recognition approaches can be grouped into two main categories: model-based and appearance-based. Model-based approaches first model the human body using 2D or 3D body structures and then generate discriminative features by utilizing this model [22,23,24]. For example, Ariyanto and Nixon [22] used a structural model including articulated cylinders with 3D Degrees of Freedom (DoF) at each joint to model the human lower legs. Feng et al. [24] used the human body joint heatmap extracted from a RGB image instead of using a binary silhouette to describe the human body pose. The heatmaps were sent into the long short-term memory (LSTM) network to extract temporal features. Liao et al. [25] recently proposed PoseGait exploiting human 3D pose estimated from RGB images by CNN as the input feature for gait recognition. Model-based approaches are robust to cross-view angle variations and appearance-based variations such as carrying and clothing conditions. However, 2D and 3D human body modeling processes are complex and also require cameras to capture the video to fit the model well enough.

Appearance-based approaches generally aim to extract gait representations directly from raw input data. These approaches can be further divided into three main categories. The first category proposes spatio-temporal templates. These approaches encode or compress a gait cycle into one image [3,4,5,6,7,8]. The most popular template in this category is GEI [3]. The second category aims to extract discriminative features from original silhouettes [11, 13, 26]. Chao et al. [9] proposed a network named Gaitset that extracted the frame-level and set-level features from independent silhouettes. Fan et al. [1] presented a temporal part-based framework that consists of two designed components: a Frame-level Part Feature Extractor(FPFE) and a Micromotion Capture Module(MCM). The FPFE is for learning the part-level spatial features, and the MCM is for modeling the local micro-motions features and the global understanding of the entire gait sequence. Zhang et al. [10] proposed an effective spatial–temporal feature learning model with an LSTM attention model based on horizontally divided silhouette images for gait recognition. The last category is generative approaches, which often transform different representations of gait in different view angles or conditions into a common view angle or condition [29]. Yu et al. [27] employed the Stacked Progressive Autoencoders (SPAE) to transform GEI from any given view angle and condition to the same view angle and normal walking condition. In another study developed for a similar purpose using generative adversarial networks (GAN) [28]. He et al. [8] proposed a multi-task generative adversarial network (MGANs) to transform view-specific gait features to another view, based on the assumption that gait images with view variations lie on a low-dimensional manifold [30].

Recently, several different approaches fuse different modalities features at the same time for recognition. For example, Hoffman et al. [15] combined information from a visual RGB image sequence, a depth image sequence, and a four-channel audio stream for multi-modal gait recognition. Castro et al. [16] proposed a unified approach for using audio, depth, and visual information for gait, gender, and shoe type recognition. Zhao et al. [31] presented a multi-modal network, mmGaitSet, using the Gaitset as the backbone to mine shape-based body features from gait silhouettes and pose-based part features from 2D pose heatmaps. In recent studies, Jimenez et al. [17] proposed a multi-modal network UGaitNet handles and combines four modalities for gait recognition: optical flow, gray, depth, and silhouettes. In addition, this network is robust to missing modalities.

In this study, in addition to the optical flow and silhouette modalities most frequently used for gait recognition, the concatenated image of the GEI head and leg regions is obtained as another modality, and the performance of the combination of GEI with these modalities under cross-view angle and appearance-based variations is investigated separately. Furthermore, an assessment of the individual efficacy of the GEI head and leg regions is conducted. This investigation is carried out on three distinct multi-modal CNN architectures based on EfficientNet-B0 [20], MobileNet-V1 [21], and ConvNeXt-base [49].

3 Proposed methods

In the proposed methods, rather than relying solely on a single modality, we utilize a combination of several modalities. We investigate the impact of different modality combinations with GEI on recognition success. To ensure a fair comparison, we evaluate these different modality combinations across various multi-modal deep CNN architectures, including EfficientNet-B0, MobileNet-V1, and ConvNeXt-base.

GEI is selected as the primary modality due to its frequent usage in the literature and its high efficiency. Subsequently, the recognition performances of combinations of the most frequently used input data silhouettes, after GEI, optical flows, and the concatenated image of the GEI head and leg regions with GEI are examined across multi-modal CNN architectures. Additionally, the combination of head and leg regions is also evaluated. The implementation details are presented in the following part of this section.

3.1 Utilized modalities

In this section, we will introduce selected gait representation modalities to feed directly into network architectures. These are GEIs, silhouettes, optical flows, and the concatenated image of the GEI head and leg regions modalities, respectively.

A GEI [3] is obtained by averaging properly aligned silhouettes of gait sequences. Given the preprocessed binary gait silhouette images \({B}_{t}(x,y)\) at time \(t\) in a sequence, the gray-level GEI is defined as Eq. (1):

$$G\left(x,y\right)=\frac{1}{n}\sum_{t=1}^{n}{B}_{t}(x,y)$$
(1)

where n is the total number of frames in the complete cycle(s) of a silhouette sequence, \(t\) is the frame number in the sequence, and \(x\) and \(y\) are values in the 2D image coordinate. Some GEI examples from the CASIA-B [34] dataset with cross-view and appearance-based variations are shown in Fig. 1a.

Fig. 1
figure 1

Four different modalities used for the inputs of network architectures by utilizing the CASIA-B [34] dataset. a Some GEI examples, b Obtaining the silhouettes (SHs), c Obtaining the optical flows (OFs), d Concatenated image of the GEI Head and Leg regions(HConL)

The other modality, silhouette, has been the most frequently used input data in state-of-the-art methods [1, 9, 10]. It takes up a small proportion of an image. The effect of the background and distance factor, caused by the view angles, needs to be reduced. For this reason, silhouettes taken directly from the CASIA-B [34] dataset are aligned based on methods in Takemura et al. [35]. Some examples of silhouette cropping are presented in Fig. 1b.

Using optical flow as input data for action recognition [36] with CNN demonstrated high performance. This is the same for gait recognition [12]. Therefore, The Farneback [37] method is utilized to obtain the third modality optical flow. The optical flow is calculated between two consecutive RGB gait frames and the final image is aligned like silhouettes [35]. The process of computing optical flow is detailed in Fig. 1c.

Gait recognition by dividing the human body into horizontal parts has been used in previous studies. Rida et al. [50] partitioned the human body into 4 parts based on group Lasso of motion to select the most discriminative human body parts. Rokanujjaman et al. [51] divided the human body into 5 parts and asserted the importance of the head, waist, and leg regions depending on positive or negative effects for recognition. Zhang et al. [10] partitioned the human body into four different horizontal parts and trained multiple separate CNNs for each local part. Morever, various previous studies have proven that the GEI leg region is highly distinctive in recognition. Choudhury et al. [38] argued that compared to other parts of the human body, the limb region of the GEI better captures discriminative information and is least affected by most carrying conditions and clothing variations. Bashir et al. [39] proposed a GEnI to distinguish dynamic gait information and static shape information contained in a GEI. In the study, it is clear that the most dynamic areas in the GEI result images with the feature selection mask applied are the arm and leg regions. In another study [40], an adaptive outlier detection method is proposed to mitigate the impact of clothing on human gait recognition. This approach tries to detect and eliminate clothing effects on silhouettes. It is seen that the leg regions are discriminative in the outlier detection applied result images. When considering variations in clothing and carrying, based on previous studies, only two parts of the human body are selected: the head and leg regions. The selection of these parts takes into consideration the high distinctiveness of the leg region and the low probability of complete coverage of the head region. Subsequently, these regions of GEI are concatenated into a single image for robustness under clothing and carrying condition variations. The concatenating process of the image of the GEI head and leg regions is shown in Fig. 1d.

When considering the modalities evaluated in Fig. 1, GEI is one of the most effective features in gait recognition, it contains the entire walking cycle in a single image. While several studies [32, 33] have shown its efficiency and stability, it is important to acknowledge the information loss inherent in the averaging process. For silhouette(SH) modality, although they are commonly used in gait recognition, they are susceptible to variations in body shape, clothing, and environmental factors [55] such as changes in illumination and dynamic backgrounds. Optical flow (OF), on the other hand, describes the motion between video frames, which makes the recognition easier. However, since it offers a global approach to the human body, such as GEI and SHs, it is likely to be affected by appearance-based variations. Lastly, the concatenated image of the GEI head and leg regions(HConL) presents a local approach, particularly enhancing robustness against variations in clothing and carrying conditions. Consequently, the combination of GEI with modalities exhibiting distinct advantages and disadvantages holds promise for achieving a higher recognition rate.

3.2 Network structures

The CNN architectures are the most popular deep learning framework. They are designed to work specifically on images and achieve remarkable success by extracting spatio-temporal features with a minimum number of parameters. EfficientNet, MobileNet, and ConvNeXt are CNN-based models that have been tested on the ImageNet [41] dataset and have achieved high accuracy results compared to state-of-the-art CNN models. While EfficientNet-B0 has demonstrated impressive performance in gait recognition [52] using RGB video frames, there remains a gap in evaluation concerning silhouette data. Additionally, the success of MobileNet [53] and ConvNeXt [54] in action recognition has been discussed, but their performance in gait recognition has not yet been assessed. Consequently, for comprehensive performance assessment, three distinct CNN-based network models, namely EfficientNet-B0, MobileNet-V1, and ConvNeXt-base, have been selected for comparative analysis.

Figure 2 depicts the proposed gait recognition frameworks. This figure shows that two different modalities are given to two separate branches of the multi-modal network structures. These modalities, GEIs and SHs, GEIs and OFs, and GEIs and HConLs, as well as the combination of leg and head regions separately are shown in Fig. 1a, b, c, d respectively. Branches consist of two fine-tuned CNN architectures with the same characteristics. These architectures extract features from modalities. Then the features obtained from the different modalities are combined or fused by an concatenate (CON) process. During this process, two feature vectors of dimension \(n\) are vertically concatenated, resulting in a final feature dimension of \(2n\). In the network branches, EfficientNet-B0, MobileNet-V1, and ConvNeXt-base are used as fine-tuned CNN network structures, respectively.

Fig. 2
figure 2

Framework of proposed architectures. a GEIs and SHs b GEIs and OFs c GEIs and HConL d GEI Leg and GEI Head regions separately are passed through the multi-modal fine-tuned CNNs, and features are fused

To understand the impact on recognition performance, all different modality combinations are provided to deep CNN architectures based on a multi-modal approach. Two separate convolution branches of all CNN networks are utilized for feature extraction from two different modalities, and then these features are combined to obtain a single feature vector. This approach allows us to train two separate convolutional branches of the network simultaneously and is useful for evaluating the success of different modality combinations with cross-view and appearance-based variations. Moreover, the fusion process provides a richer and more compact gait representation [16]. Also, using different CNN architectures allows comparison of the performance of these networks for gait recognition, simultaneously.

For the transfer learning process, the network is first initialized with the weights of the pre-trained model, which was trained on the ImageNet dataset. The network is then fine-tuned on the gait recognition dataset. Transfer learning provides more efficient network usage and high-performance training since training is performed on the rather large dataset ImageNet.

4 Experiments and results

4.1 Datasets and metric

4.1.1 CASIA-B

We chose the CASIA-B [34] gait dataset since it contains the original RGB video frames and a large number of video sequences. It is also commonly used. It contains 124 subjects and 11 views (0º, 18º,..., 162º, 180º) with fixed 18-degree spacing. There are 10 sequences for each subject, 6 sequences for normal walking (NM), 2 sequences for walking with bag (BG), and 2 sequences for walking with coat (CL). For the experimental settings, the first 74 subjects for all conditions are used for training and the rest of the 50 subjects for testing. In the test set, the first 4 sequences of the NM condition (NM #1–4) are regarded as the gallery and the remaining 6 sequences NM #5–6, BG #1–2, and CL #1–2 are used as probe sets, respectively as shown in Table 1.

Table 1 Experimental settings on CASIA-B

4.1.2 Outdoor-gait

Outdoor-Gait [43] is a comprehensive dataset with complex outdoor backgrounds. It contains 138 people with 3 different clothing conditions (NM: normal, CL: with coat, BG: with bag) in 3 distinct scenes (SCENE-1: simple background, SCENE-2: static and complex background, SCENE-3: dynamic and complex background with moving objects). In the experiments, 69 subjects are used as the training set and the remaining 69 subjects are used as the test set. For each condition, there are a minimum of 2 video sequences in both the gallery and the probe sets.

4.1.3 Rank-1

In the experiments, we use average rank-1 accuracies to evaluate the effectiveness of the proposed models. Rank-1 accuracy is determined by the proportion of correctly identified IDs by comparing the probe sequence to all sequences in the gallery (excluding identical view). The average Rank-1 is computed by the ratio of the sum of all Rank-1 values in the specified angle to the total number of angle views (e.g. 11 views for CASIA-B). The average rank-1 accuracy, denoted as \({C}_{A}\), is presented in Eq. (2):

$${C}_{A}=\frac{1}{11}\sum_{a=0}^{10}{C}_{a}$$
(2)

where \({C}_{a}\) is the rank-1 accuracy at the view angle \(a\) (excluding identical view) [31].

4.2 Implemantation details

Training

The inputs of the networks are six different modalities: GEIs, SHs, OFs, HConLs, and GEI Head and GEI Leg parts. They are fed into the networks as GEIs and SHs, GEIs and OFs, GEI Head and GEI Leg parts, and GEIs and HConLs. All modalities (the inputs of all networks) are resized to size the size of 224 × 224. Due to the cost of memory and time, the number of training set samples for each of SH and OF is determined as 30. Therefore, 30 SHs and 30 OFs are randomly taken from the SH and OF training sets, respectively.

We use Keras version of Tensorflow [44] for all experiments. The models are trained with Nvidia Geforce Rtx 3060 GPUs and the experimental environment is Windows 10. The Stochastic Gradient Descent (SGD) optimizer is used with the learning rate of 0.0001 and the momentum of 0.9. The output layer has a softmax activation function and cross-entropy loss is used as the loss function.

Testing

At the test phase, the similarity measure between gallery and probe features is calculated using cosine similarity. The features obtained from the GEIs and SHs (CNNGEI + CNNSH), GEIs and OFs, (CNNGEI + CNNOF), GEIs and HConLs(CNNGEI + CNNHConL), and GEI Head and GEI Leg parts(CNNHead + CNNLeg) are fused.

4.3 Analysis of different modalities combines

4.3.1 Experiments on CASIA-B Datasets

Experiments under NM variation

For a comprehensive comparison, all modality combinations are evaluated on three CNN architectures, specifically EfficientNet-B0, MobileNet-V1, and ConvNeXt-base. Additionally, performance evaluations of VGG16 [18] and ResNet-50 [19] are also presented for the NM variation.When the difference between angles is small, under NM, GEI has a successful performance as a single modality. However, when there is a large difference between angles, the performance drops significantly. This performance achieves individual recognition performances with different modality combinations. The combination of CNNGEI + CNNSH, CNNGEI + CNNOF, CNNHead + CNNLeg, and CNNGEI + CNNHConL avaraged rank-1 accuracy (%) for cross-view angles (excluding the identical views) are detailed in Table 2.

Table 2 Comparison of CNNGEI + CNNSH, CNNGEI + CNNOF, CNNHead + CNNLeg (CNNH + CNNL) and, CNNGEI + CNNHConL combinations in terms of avaraged rank-1 accuracy (%) for cross-view angles (excluding the identical views), under NM variations

From Table 2, it can be understood that the combination of CNNGEI + CNNOF has a higher accuracy than the combination of CNNGEI + CNNSH for almost all cross-view rank-1 results. This situation can be considered valid for all networks, namely VGG16, ResNet-50, EfficientNet-B0, MobileNet-V1, and ConvNeXt-base. When evaluating network performances, it is clear that MobileNet often has the best results in both modalities, SH and OF and this is also obvious in the mean value. However, the recognition rate of ConvNeXt with OF modality at 0º and 180º view angles significantly exceeds that of MobileNet. When part-based modality combinations are examined, the CNNHead + CNNLeg combination yields similar results to the CNNGEI + CNNSH and CNNGEI + CNNOF combinations. However, the CNNGEI + CNNHConL combination significantly enhances the performance of all networks except ConvNext. ConvNext achieves its highest performance with the CNNHead + CNNLeg combination. The comparison results are also shown in Fig. 3.

Fig. 3
figure 3

Comparison of all combinations in terms of avarage rank-1 accuracy (%), under NM variations for all 11 views (excluding identical-view cases)

It can be observed from Fig. 3 that the CNNGEI + CNNHConL combination based on EfficientNet and MobileNet achieves the most successful results among other network-based combinations under the NM variation. This is followed by the CNNGEI + CNNOF combination. However, this is valid for Convnext in the combination of CNNHead + CNNLeg.

Experiments under BG and CL variations

In this section, all models are also tested under BG and CL conditions. However, as in the previous section under the NM conditions, VGG16 and Resnet50 achieve rather poor recognition rates for these variations. Therefore, only EfficientNet, MobileNet, and ConvNeXt recognition rates are presented for BG and CL variations. The mean rank-1 accuracy (%) of CNNGEI + CNNSH, CNNGEI + CNNOF, CNNHead + CNNLeg, and CNNGEI + CNNHConL combinations under BG and CL variations for cross-view angles (excluding the identical views) are shown in Table 3.

Table 3 Comparison of CNNGEI + CNNSH, CNNGEI + CNNOF, CNNHead + CNNLeg (CNNH + CNNL) and, CNNGEI + CNNHConL combinations in terms of avaraged rank-1 accuracy (%) for cross-view angles (excluding the identical views), under BG and CL variations

It can be concluded from Table 3 that under BG variation, the CNNGEI + CNNOF combination exhibits higher performance across all networks compared to the CNNGEI + CNNSH combination for almost all cross-view rank-1 results. For these combinations, the performances of EfficientNet and MobileNet are similar; however, MobileNet has achieved the highest mean rank-1 value in both combinations. Considering the combination of CNNHead + CNNLeg, the recognition success of all three networks has increased significantly in this combination. Furthermore, for the CNNGEI + CNNHConL combination, these successes continue to increase, although this tendency is reversed for ConvNext. It is observed that under the CL variation, the recognition rates of CNNGEI + CNNSH and CNNGEI + CNNOF combinations based on all networks decrease significantly. However, it is seen that the recognition rate increases slightly in part-based modality combinations. Especially the CNNHead + CNNLeg combination based on MobileNet reaches the highest average rank-1 value under the CL variation. Under all appearance-based variations, namely NM, BG, and CL, the mean rank-1 comparison of each combination is presented in Fig. 4.

Fig. 4
figure 4

The mean rank-1 comparison of each combination under NM, BG, and CL appearance-based variations. a CNNGEI + CNNSH b CNNGEI + CNNOF c CNNHead + CNNLeg d CNNGEI + CNNHConL

When examining Fig. 4, it can be observed that the CNNGEI + CNNSH combination yields the best results when based on MobileNet. This observation is valid for the CNNGEI + CNNOF combination as well. For the CNNHead + CNNLeg combination, MobileNet and ConvNext perform similarly and achieve superior results compared to EfficientNet. The last combination, CNNGEI + CNNHConL, achieves the highest performance based on EfficientNet and Mobilenet under the NM variation, while under BG and CL variations, it only achieves superior performance with MobileNet. Finally, the optimal outcome is attained with the MobileNet-based CNNGEI + CNNHConL combination for the BG variation, and the MobileNet-based CNNHead + CNNLeg combination for the CL variation.

Comparison with the state-of-the-art medhods

The above experiments have shown that combinations of different modalities achieve good performance, especially with CNNGEI + CNNHConL combination. We compare some proposed combinations with some state-of-the-art methods on the CASIA-B dataset. For this purpose, we organize three comparison groups. The first group of comparisons is made with state-of-the-art method GaitNet [43] which present all the cross-view angle recognition rates, and have the same experimental settings for the NM variation as in Table 1. For comparison, it is selected from the proposed multi-modal networks, CNNGEI + CNNHConL combination based on EfficientNet (Eff + HConL) and MobileNet (Mobile + HConL). Comparison results are presented in Table 4.

Table 4 Cross-view recognition rates of proposed some multi-modal architectures and GaitNet [43] on CASIA-B, under NM variations

The findings presented in Table 4 suggest that the proposed multi-modal networks, Eff + HConL and Mobile + HConL, exhibit performance levels that are very close and comparable to GaitNet. However, it should be noted that when the difference between the angles increased, this situation could not be consistently observed at the mean value. Commonly, according to GaitNet, there is an improvement in performance between angles that are close to each other or symmetric (for example, the gallery angle is 0º and its symmetric is 180º, or the reverse).

The second group of comparisons is made in terms of avaraged rank-1 accuracy (%) (excluding the identical view) for all appearance-based variations (NM, BG, CL). The state-of-the-art methods are GEI + PCA [42], GEI-Net [13], DeepCNN [14], GaitNet, and PoseGait [25], which have the same experimental settings as in Table 1. It is chosen from the proposed multi-modal networks for comparison, Eff + HConL and Mobile + HConL for NM variation, Mobile + HConL for BG variation, and CNNHead + CNNLeg based on MobileNet (Mobile + H + L) for CL variation, respectively. Comparison analyzes are shown in Table 5.

Table 5 Comparison of some proposed multi-modal networks with cross-view methods on CASIA-B in terms of avaraged rank-1 accuracy (%) (excluding the identical views), under all apearrance-based variations

It is clear from Table 5 that, Eff + HConL and Mobile + HConL multi-modal networks achieve higher average recognition rates under NM conditions than GEI + PCA, GEI-Net, and a model-based method PoseGait. However, DeepCNN and GaitNet have the best results for average recognition. Mobile + HConL and Mobile + H + L combinations reach the best performance under BG and CL variations, respectively.

The last comparison is made with MGan [8], which is a generative method trained with the first 74 subjects under NM conditions. In Table 6, recognition rates are presented for cross-view angles 54º, 90º, and 126º, as in MGan [8]. According to the results presented in Table 6, the Mobile + HConL multi-modal network achieves the best recognition rate on average for these angles.

Table 6 Comparison of some proposed multi-modal networks with MGan [8] on CASIA-B, under NM variations

4.3.2 Experiments on outdoor-gait datasets

Comparisons of the prepared modality combinations on the Outdoor-gait data set are presented in Table 7. From Table 7, it is clear that the CNNGEI + CNNHConL combination achieves the best performance in all networks. Moreover, the MobileNet-based CNNGEI + CNNHConL combination demonstrates the highest level of success compared to other networks.

Table 7 Comparison of proposed multi-modal networks on Outdoor-Gait dataset in terms of avaraged rank-1 accuracy (%) under all apearrance-based variations

Comparison with the state-of-the-art medhods

Among the prepared modality combinations, MobileNet-based CNNGEI + CNNHConL (Mobile + HConL) and ConvNext-based CNNGEI + CNNHConL(ConvNeXt + HConL) combinations are compared with the appearance-based GaitNet and model-based Human3D [56] methods in Tables 8 and 9, respectively.

Table 8 Comparison of some proposed multi-modal networks with appearance-based GaitNet method on Outdoor-Gait dataset in terms of avaraged rank-1 accuracy (%) under all apearrance-based variations
Table 9 Comparison of some proposed multi-modal networks with model-based Human3D [56] method on Outdoor-Gait dataset in terms of avaraged rank-1 accuracy (%) (under all apearrance-based variations)

When Table 8 is evaluated, it is seen that Mobile + HConL outperforms GaitNet in CL variations, namely NM-CL, BG-CL, CL-NM, and CL-BG. At the mean value, the results are comparable. In Table 9, the ConvNext-based CNNGEI + CNNHConL combination showed superior recognition rate in NM-NM, BG-BG, and CL-CL variations. This situation cannot be observed in the mean value. However, the model-based 3DHuman method is more costly due to the 3D human body reconstruction process.

4.3.3 Computation and efficiency analysis

It is evident that among the various combinations prepared in the preceding sections, the MobileNet-based combinations have exhibited superior performance. Mobilenet aims to maximize accuracy while taking care of limited resources. Therefore, it is characterized by its small size, low power consumption, speed, and cost-effectiveness. Consequently, a cost comparison is conducted with the GaitSet and GaitNet methods, considering any combination based on multi-modal MobileNet that is prepared. GaitSet has significantly improved the performance of gait recognition. However, this method is characterized by a complex network architecture, with a large number of parameters and floating-point operations (FLOPs). It has 2.59 M parameters, 8.6G floating point operations (FLOPs), and a final feature dimension of 15872 [57]. Similarly, GaitNet utilizes a Fully Convolutional Network (FCN) [58] as a backbone network with a high number of parameters. The comparison results of these networks with multi-modal MobileNet (mm- MobileNet)are shown in Table 10.

Table 10 Cost analysis of networks. Dim represents the dimension of final feature

When examining Table 10, the number of parameters of mm-MobileNet is very close to GaitSet, but the FLOPs are approximately 8 times less than GaitSet. Furthermore, the parameter count is considerably lower compared to FCN-based networks. Additionally, the final feature dimension, crucial for the testing phase, is notably lower than that of GaitSet.

5 Conclusion

Gait recognition approaches often focus on completing the recognition process through a single modality. In this study we investigated the performance of using different modalities jointly for recognition. For this purpose, we combined different most common gait representations using multi-modal network architectures and acquired more robust and rich representations. We utilized four different modalities: silhouettes, optical flows, and concatenated image of the GEI Head and Leg regions (HConL), with the main modality GEI, and evaluated the success of their combination with GEI for cross-view angle and appearance-based variations. To obtain fair and reliable results, we used MobileNet-V1, EfficientNet-B0, and ConvNext-base for the branches of multi-modal networks. We also evaluated the success of these networks for gait recognition. We compared the best performances of all proposed network combinations with some state-of-the-art methods on the CASIA-B and Outdoor-Gait datasets. Experimental results indicated that combinations of different input modality features have different performances depending on cross-view angle and appearance-based variations. In particular, the combination of GEI with HConL based on MobileNet and ConvNext achieved remarkable recognition rate when the difference between angles is small, and the HConL modality increased recognition rates under appearance-based variations. Moreover, the cost of multi-modal networks, particularly MobileNet based, is notably low.