1 Introduction

Gait is a complex function of body weight, limb lengths, skeletal and bone structures as well as muscular activity. From a biomechanics point of view, human walk consists of synchronized, integrated movements of body joints and hundreds of muscles. Although these movements share the same basic bipedal patterns across all humans, they vary from one individual to another in several aspects, such as their relative timing, rhythmicity, magnitudes and forces involved in producing the movements. Since gait is largely determined by its musculoskeletal structure, it is unique to each individual. Cues such as walking speed, stride length, rhythm, bounce, swagger as well as physical lengths of human limbs all contribute to unique walking styles [41].

The human can identify individual people from movement cues alone. In a study of the motion perception from a psychological point of view, Johansson used Moving Light Display (MLD) and shoved that the relative movements of certain joints in the human body carry information about personal walking styles and dynamics [35]. It turned out that humans can in less than one second identify that MLD patterns correspond to a walking human. Cutting and Kozlowski [22] shoved that friends can be recognized by their gait on the basis of reduced visual stimuli. Barclay et al. [8] showed that the identity of a friend and the person’s gender can be determined from the movement of light spots only. They investigated both temporal and spatial factors in gender recognition on the basis of data from point light displays. They showed that in the spatial domain, the shoulder movement for males and hip movement for females are important factors in the recognition. Another research finding is that the duration of dynamic stimulus plays crucial role in gender recognition.

Much research in biomechanics and clinical gait analysis is dedicated to studying the inter-person and intra-person variability of gait, primarily to determining normal vs. pathological ranges of variation. Several measures have been proposed to quantify the degree of gait deviation from normal gait, stratify the severity of pathology and measure changes in gait patterns over time [18]. Although the pioneering research on gait analysis was performed using 2D techniques, such 2D analysis is currently rather rarely used in clinical practice, and 3D methods are now the standard. There are two main reasons for this: parallax errors and perspective errors [37].

In the last decade, gait recognition has been extensively studied as a behavioral biometric technique [17, 19, 20, 70]. Identity recognition on the basis of gait is non-invasive, can be done at a distance, is non-cooperative in nature, and is difficult to disguise. However, the recognition performance of existing algorithms is limited by the influence of a large number of nuisance or so-called covariate factors affecting both appearance and dynamics of the gait, such as viewpoint variations, variations in footwear and clothing, changes of floor surface characteristic, various carrying conditions and so on. Vast of present approaches to gait recognition make use of extracted human body silhouettes in image sequences from a single camera as input [50, 76]. In addition to clothing and carrying variations, the view angle is found to be the most influential covariate factor on recognition performance. Many methods have been elaborated to establish a cross-view mapping between gallery and probe templates [19]. However, their effectiveness is restricted to small variations of view angle. Thus, how to extract informative features, which of them are robust, and how to represent them in a form that is best suited for gait recognition is still an active research topic. Clearly, being a natural representation of human gait perceived by human, 3D gait data conveys more information than 2D data. Moreover, the research results obtained by clinicians [37] demonstrate that 2D data is often inadequate for carrying out an accurate and reliable biomechanical assessment since many gait features are inherently unique to 3D data. However, until now, only limited research on 3D data-based biometric gait recognition has been done because of restricted availability of synchronized multi-view data with proper camera calibration [50]. Motivated by this, taking into account a forecast formulated at Stanford for an evolution of methods for the capture of human movement [52], promising results that were obtained using marker-less 3D motion tracking [39] and marker-based 3D motion tracking [5, 32] for gait recognition, in this work we introduce a dataset for marker-less 3D motion tracking and 3D gait recognition. Since gait abnormalities are often impossible to detect by eye or with video based systems, we also share data from marker-based moCap, which is synchronized and calibrated with marker-less motion capture system.

2 Background and relevant work

At present, the most common methods for precise capture of three-dimensional human movement usually require the attachment of markers or sensors to the body segments and can only be applied in laboratory environments [47, 64]. In [16], in order to determine disease-specific gait characteristics a marker-based motion capture system (moCap) was employed for investigating and understanding body movement. One of the major limitations of marker-based moCap systems is the need of attaching the markers on the person. The markers are placed manually on the skin with respect to anatomical landmarks. Since the markers are attached to the skin, their position is influenced by the movement of the soft tissue underneath. Benoit et al. [11] investigated the effect of skin movement on the kinematics of the knee joint during gait through comparing skin markers to pin in bone markers. They found that kinematic analysis is burdened with errors resulting from movement of the soft tissue, which should be considered when interpreting kinematic data. The accuracy of the measurements is also influenced by the placement of the markers. Differences can occur between different clinicians applying the markers on the same patient and also between placements made by the same clinician. In general, marker-based moCap systems are costly and thus they are not available in many clinical settings [57].

Recently, increasing interest in using the Kinect sensor for motion capture has emerged, including clinical and scientific analysis of gait [9, 21, 65]. The experimental results achieved in [74] showed that the Kinect sensor can follow the trend of the joint trajectories with considerable error. Gait analysis of healthy subjects using Kinect V2 in single-camera [9] or multi-camera [21] setups showed high accuracy of the motion sensors for estimation of gait speed, stride length, stride time but lower accuracy for other parameters like stride width or speed variability. A recently proposed 3D method [1] employs the spatiotemporal changes in relative distances and angles among different skeletal joints to represent the gait signature. A comparison of traditional marker-based motion capture technologies and Kinect-based technology for gait analysis is presented in [57]. The experiments demonstrated that for the Kinect and Vicon the correlation of hip angular displacement was very low and the error was considerable, whereas the correlation between Kinect and Vicon stride timing was high and the error was fairly small.

Over recent years, the research community has given much interest in gait as a biometric modality [19], primarily due to its non-intrusive nature as well as ease of use in surveillance [54]. As already mentioned, the pioneering works on human motion analysis and gait recognition fall into the category of marker based techniques [8, 22]. Since then, various methods [76] and modalities [10, 26, 77] were proposed to determine one’s identity. Usually, the vision-based methods start with extracting the human silhouette in order to obtain the spatiotemporal data describing the walking person. The extracted silhouettes are then pre-processed, normalized and aligned. Afterwards, various computer vision and machine learning techniques are utilized to extract and model gait signatures, which are finally stored in a dictionary/database. During the authentication a test gait signature is calculated and compared with the dictionary formed in advance.

Human gait analysis and recognition techniques can be divided into main categories, namely model-free and model-based methods. The methods belonging to the first category characterize the entire human body motion using a concise representation without taking into account the underlying structure. They can be in turn divided into two major categories based on the way of preserving temporal information. The methods belonging to the first subcategory consider temporal information in the recognition. Liu et al. [42] utilized a population hidden Markov model (pHMM) defined for a set of individuals to model human walking and generated the dynamics-normalized stance-frames to identify pedestrians. For such probabilistic temporal models, a considerable number of training samples are generally needed to achieve satisfactory performance. The methods from the second subcategory convert an image sequence into a single template. Han and Bhanu [29] proposed the gait energy image (GEI) to improve the accuracy of gait recognition. The disadvantage of template-based methods is that they may lose the temporal information. Several limitations can arise in real scenarios since GEI relies heavily on shape information, which is significantly altered by changes of clothing types and carrying conditions. As demonstrated in many studies, e.g. [75], single-view GEI-based gait recognition performance can drop significantly when the view angle changes. Model-based methods infer gait dynamics directly by modeling the underlying kinematics of human motion. A method proposed in [12] uses a motion-based model and elliptic Fourier descriptors to extract the key features. A recently proposed method [25] combines spatiotemporal and kinematic gait features. The fusion of two different kinds of features gives a comprehensive characterization of gait dynamics, which is less sensitive to variation in walking conditions.

Generally speaking, gait-based person identification is achieved through extracting the image and/or gait signatures using the appearance or shape of the subject undergoing monitoring, and/or the dynamics of the motion itself [23, 42]. What constrains the application of biometric gait recognition is the influence of several of so-called covariate factors, which affects both appearance and dynamics. Those include not only viewpoint, footwear and walking surface, clothing, carried luggage, but also illumination and camera setup. Viewpoint is considered as the most crucial of those covariate factors [76]. Thus, view-invariance to achieve more reliable gait recognition has been studied by several research groups [19, 33, 34, 43, 58, 69]. Clothing and carrying conditions are other important covariate factors that are frequently investigated [2, 24, 53].

In the last few years a number of datasets have been designed to study the effect of covariate factors [50] in gait recognition. A SOTON database with five covariates has been introduced in 2002 [63]. The USF database [61] has been specifically designed to investigate the influence of covariate factors on identity classification. CASIA Gait Database [13] is a newly developed challenge dataset for evaluation of gait recognition techniques. Most current gait recognition approaches performs the recognition on the basis of silhouettes captured in side-view, i.e. when the individual walks in a plane parallel to the camera. When the view angle deviates from the side-view, the gait representation on the silhouette is not so informative, and the recognition performance tends to degrade heavily. In [44, 48], view-invariance has been enhanced using 3D reconstructions.

Three-dimensional approaches to gait recognition are resistant to changes in viewpoint. In general, 3D data provides richer gait information in comparison to 2D data and thus has strong potential to improve the recognition performance. However, only a little research on 3D data-based gait recognition has been done due to the limited availability of synchronized multi-view data with proper calibration parameters [50]. Though the CMU MoBo [28] and the CASIA [75] multi-view databases have been available for a long time, there are no significant results on 3D data, because either the data was recorded on a treadmill and thus does not represent 3D gait or the calibration parameters of the multi-camera system are not available. In the MoBo multi-view dataset [28], six cameras are used to provide full view of the walking person on the treadmill. It comprises 25 individuals, where each one performed four gait types – slow walk, fast walk, ball walk, and inclined walk. Each sequence is recorded at 30 frames per second and is eleven seconds long in duration. An inherent problem associated with gait analysis on the basis of walking on a treadmill is that the human gait is not natural. The main reason for this is that the gait speed is usually constant, and the subjects cannot turn left or right. The INRIA Xmas Motion Acquisition Sequences (IXMAS) database [72], comprises five-view video and 3D body model sequences with eleven activities and ten subjects. The data were collected by five calibrated and synchronized cameras. However, this dataset cannot serve as a benchmark dataset for 3D gait recognition since the gait data has only been registered on closed circle paths. CASIA Dataset B is a multi-view gait database with 124 subjects, where the gait data was captured from eleven views [75]. Three covariate factors, namely view angle, clothing and carrying condition changes are considered separately. However, neither the camera position nor the camera orientation are provided for this frequently employed dataset. A multi-biometric tunnel [62] at the University of Southampton is a constrained environment similar to an airport for capturing multimodal biometrics of walking pedestrians, i.e., gait, face and ear in an unobtrusive way for automatic human identification. For gait recognition, the walking subject was recorded at 30 fps by eight synchronized cameras of resolution 640 × 480. The face and ear biometrics have been captured by two other high-resolution cameras, which had been placed at the exit of the tunnel. The walls of the tunnel were painted with a non-repeating rectangular lattice of different colors to support automatic camera calibration. However, the dataset has limited applicability for 3D tracking-based gait recognition. The main reason for this is that this dataset has no ground-truth data of 3D joints location. This means that the accuracy of 3D tracking of the joints and 3D motion estimation cannot be easily determined. 3D tracking [68], 3D motion descriptors [36], and 3D reconstructions [59] can provide useful information for gait recognition.

Since motion capture technology became recently more affordable, 3D structure-based gait recognition has attracted more interest from researchers [5, 7, 32, 66]. Recently, a new benchmark data and evaluation protocols for moCap-based gait recognition have been proposed in [6]. As recently shown in [60], marker-less motion capture systems can provide reliable 3D gait kinematics in the sagittal and frontal plane. In [38], a system for view-independent human gait recognition using marker-less 3D human motion capture has been proposed. The accuracy of the 3D motion reconstruction at the selected joints has been determined on the basis of marker-based moCap. In [14], a comparison of marker-less and marker-based motion capture technologies through simultaneous data collection during gait has been presented. A recently conducted study [56] showed that a marker-based and a marker-less systems have similar ranges of variation in the angle from the start of a squat to peak squat in the pelvis and lower limb in a single leg squat.

3 Databases for gait recognition

Table 1 contains a collective summary of major databases for gait recognition, which were evoked in the relevant literature. As we can observe, only a few datasets provide calibration data and/or were recorded using synchronized cameras. To our knowledge, only one publicly available dataset contains moCap data. However, the benchmark dataset mentioned above contains only moCap data.

Table 1 Comparison of major gait recognition databases

Our dataset has been designed for research on vision-based 3D gait recognition. It can also be used for evaluation of the multi-view (where gallery gaits from multiple views are combined to recognize probe gait on a single view) and the cross-view (where probe gait and gallery gait are recorded from two different views) gait recognition algorithms. Unlike related multi-view and cross-view datasets, our dataset has been recorded using synchronized cameras, which allows to reducing the influence of motion artifacts between different views. Other application of our dataset is gait analysis and recognition on the basis of moCap data. For such research the data were stored in commonly used c3d and Acclaim asf/amc data formats. Last but not least, the discussed dataset can be used to evaluate the accuracy of marker-less motion capture algorithms as well as their performance in 3D gait recognition.

4 3D gait dataset

The motion data were captured by 10 moCap cameras and four calibrated and synchronized video cameras. Figure 1 depicts the placement of the cameras of both systems. During the recording session, the actor has been requested to walk on the scene of size 6.5m × 4.2m along a line joining the cameras C2 and C4 as well as along the diagonal of the scene.

Fig. 1
figure 1

Camera layout

In a single recording session, every performer walked from right to left, then from left to right, and afterwards on the diagonal from upper-right to bottom-left and from bottom-left to upper-right corner of the scene. Some performers were also asked to attend in additional recording sessions, i.e. after changing into other garment and putting on a backpack. Figure 2 demonstrates sample images from three sessions with the same person. The first row depicts a passage from left to right (the transition from right to left is not shown), the second row depicts a passage on the diagonal from upper-right to bottom-left (the transition from bottom-left to upper-right is not shown), the third and fourth rows illustrate the same passages as above but with different garment, whereas the fifth row illustrates sample images from the transition with the backpack (the remaining passage from the session with the backpack is not shown).

Fig. 2
figure 2

Sample images from the dataset. First row: a walk from left to right, second row: a walk on diagonal from upper-right to bottom-left, third and four rows: walks in other clothes from left to right and on diagonal from upper-right to bottom-left, respectively, fifth row: a walk on diagonal from upper-right to bottom-left with a backpack

The 3D gait dataset consists of 166 data sequences. The data represents the gait of thirty-two people (10 women and 22 men). In 128 data sequences, each of thirty-two individuals was dressed in his/her clothes, in 24 data sequences, 6 of 32 performers (person #26 – #31) changed clothes, and in 14 data sequences, 7 of the performers attending in the recordings had a backpack on his/her back. Each sequence consists of videos with RGB images of size 960 × 540, which were recorded by four synchronized and calibrated cameras with 25 frames per second, together with the corresponding moCap data. The moCap data were registered at 100 Hz by Vicon system consisting of ten MX-T40 cameras of resolution 2352 × 1728 pixels. The synchronization between RGB images and moCap data has been realized using Vicon MX Giganet.

The calibration data of the vision system are stored in the xml data format. The camera model is the well known Tsai model [67], which has been chosen due its frequent use to calibrate multi-camera systems. Every data sequence consists of:

  • four videos, see sample images on Fig. 2, which were compressed with Xvid video codec,

  • foreground images that were extracted by background subtraction [78], and which are stored in videos compressed by Xvid codec,

  • images stored in the png image format with the extracted edges using Sobel masks,

  • moCap data in c3d and Acclaim asf/amc data formats.

The moCap data contains 3D positions of 39 markers, which are placed at the main body joints, see Fig. 3. From the above set of markers, 4 markers were placed on the head, 7 markers on each arm, 12 on the legs, 5 on the torso and 4 markers were attached to the pelvis. The marker-less and marker-based systems have a common coordinate system, whose origin is located in the geometric center of the scene, see also Fig. 1.

Fig. 3
figure 3

Placement of the markers

The GPJATK dataset is freely available at http://bytom.pja.edu.pl/projekty/hm-gpjatk/. The dataset has size 6.5 GB and is stored in .7z format. The names of the data sequences are in the format pXsY, where X denotes person name, whereas Y stands for the sequence number. The assumed name convention is as follows:

  • s1, s2 - straight walk, s1 from right to left, s2 from left to right, clothing 1

  • s3, s4 - diagonal walk, s3 from right to left, s4 from left to right, clothing 1

  • s5, s6 - straight walk, s5 from right to left, s6 from left to right, clothing 2

  • s7, s8 - diagonal walk, s7 from right to left, s8 from left to right, clothing 2

  • s9, s10 - walk with backpack, s9 from right to left, s10 from left to right

All data sequences, except s9 and s10, have corresponding marker-based moCap data.

5 Baseline algorithm for 3d motion tracking

At the beginning of this Section, we outline 3D motion tracking on the basis of image sequences from four calibrated and synchronized cameras. Afterwards, we explain how the accuracy of the system has been calculated using ground-truth from the marker-based moCap system. Finally, we present the accuracy of 3D motion tracking.

5.1 3D motion tracking

The human body can be represented by a 3D articulated model formed by 11 rigid segments representing the key parts of the body. The pelvis is the root node in the kinematic chain and at the same time it is the parent of the upper legs, which are in turn the parents of the lower legs. The model is constructed from truncated cones and is used to generate contours, which are then matched with the image contours. The configuration of the body is parameterized by the position and the orientation of the pelvis in the global coordinate system and the angles between the connected limbs. Figure 4 (left) illustrates the 3D model utilized in marker-less motion tracking, whereas Fig. 4 (right) depicts the 3D model employed by the marker-based system.

Fig. 4
figure 4

3D models used by marker-less and marker-based motion capture systems

Estimating 3D motion can be cast as a non-linear, high-dimensional optimization problem. The degree of similarity between the real and the estimated pose is evaluated using an objective function. The motion tracking can be achieved by a sequence of static PSO-based optimizations, followed by re-diversification of the particles to cover the possible poses in the next frame. In this work the 3D motion tracking is achieved through the Annealed Particle Swarm Optimization (APSO) [40]. The fitness function expresses the degree of similarity between the real and the estimated human pose. Figure 5 illustrates the algorithm for calculation of the objective function. For single camera c it is determined in the following manner: fc(x) = 1 − (f1,c(x)w1f2,c(x)w2), where x stands for the state (pose), whereas w denotes weighting coefficients that were determined experimentally. The function f1(x) reflects the degree of overlap between the extracted body and the projected 3D model into a 2D image. The function f2(x) reflects the edge distance-based fitness value. A background subtraction algorithm [40] is employed to extract the binary image of the person, see Fig. 5b. The binary image is then utilized as a mask to suppress edges not belonging to the person, see Fig. 5d. The projected model edges are then matched with the image edges using the edge distance map, see Fig. 5g. Moreover, in multi-view tracking, the 3D model is projected and then rendered in each camera’s view. The fitness value for all four cameras is determined as follows: \(f(\mathbf {x}) = {\sum }_{c=1}^{4} f_{c}(\mathbf {x})\).

Fig. 5
figure 5

Calculation of the fitness function. Input image a), foreground b), gradient magnitude c), masked gradient image d), edge distance map e), 3D model h) projected onto image 2D plane i) and overlaid on binary image f) and edge distance map g)

5.2 Evaluation metrics

Given the placement of the markers on the human body in the marker-based system, see Fig. 3, the virtual positions of the markers were determined for the marker-less system. A set of M = 39 markers has been defined for the 3D articulated model that is utilized in the marker-less moCap. Given the estimated 3D human pose by the algorithm discussed in Section 5.1, as well as the position of the virtual markers with respect to the skeleton, the virtual 3D positions of the markers were determined and then utilized in calculating the Euclidean distance between the corresponding 3D position of the markers as follows:

$$ dist(\mathbf{p}, \hat{\mathbf{p}}) = \sqrt{ (x - \hat{x})^{2} + (y - \hat{y})^{2} + (z - \hat{z})^{2}} $$
(1)

where the x −, y − and z −variables represent the X-, Y- and Z-coordinates of the physical markers, whereas the \(\hat {x}-\), \(\hat {y}-\) and \(\hat {z}-\) are estimates of the X-, Y- and Z-coordinates of the virtual markers. On the basis of 3D Euclidean distance between ground truth and estimated joint positions, Root Mean Squared Error (RMSE), which is also referred as the average joint error over all joints has been calculated in the following manner:

$$ \text{RMSE}(\mathbf{P}, \hat{\mathbf{P}}) = \frac{1}{MF} \sum\limits_{i=1}^{M} \sum\limits_{j=1}^{F} dist\left( \mathbf{p}_{i}^{(j)}, \hat{\mathbf{p}}_{i}^{(j)}\right) $$
(2)

where \(\mathbf {p}_{i}^{(j)}\in \mathbf {P},\hat {\mathbf {p}}_{i}^{(j)}\in \hat {\mathbf {P}}, \mathbf {P}=\{ \mathbf {p}_1^{(1)}, \dots , \mathbf {p}_M^{(1)}, \dots , {\mathbf {p}}_M^{(F)}\}\) represents the set of ground-truth joint positions, \(\hat {\mathbf {P}}=\left \{ \hat {\mathbf {p}}_{1}^{(1)}, \dots , \hat {\mathbf {p}}_{M}^{(1)}, \dots , \hat {\mathbf {p}}_{M}^{(F)}\right \}\) represents the set of estimated joint positions, M represents the total number of joints, whereas F represents the total number of frames.

For each marker i the average Euclidean distance \(\bar {d}_{i}\) between the physical and virtual markers has been calculated in the following manner:

$$ \bar{d}_{i} = \frac{1}{F} \sum\limits_{j=1}^{F} dist(\mathbf{p}^{j}, \hat{\mathbf{p}}^{j}) $$
(3)

For each marker i the standard deviation σi has been used to measure the spread of joint errors around the mean error. It has been determined as follows:

$$ \sigma_{i} = \sqrt {\frac{1}{F-1} \sum\limits_{j=1}^{F} \left( dist(\mathbf{p}^{j}, \hat{\mathbf{p}}^{j}) - \bar{d_{i}}\right)^{2}} $$
(4)

The standard deviation for all M markers has been obtained through averaging σi values.

5.3 Accuracy of 3D motion tracking

Table 2 presents RMSE errors that were achieved on sequences p1s1, p1s2, p28s1 and p28s2. The results were obtained in ten runs with unlike initializations.

Table 2 RMSE for M = 39 markers in four sample image sequences

Figure 6 shows sample plots of Euclidean distance (1) over time for head, torso, left forearm, right forearm, left tibia and right tibia, which were obtained on the sequence p1s1. The average Euclidean distances \(\bar {d}_{i}\) and the standard deviations for mentioned above body parts are equal to: 36.8 ± 18 mm (head), 37.7 ± 13.0 mm (torso), 32.6 ± 14.9 mm (left forearm) 24.5 ± 10.7 mm (right forearm), 39.3 ± 25.0 mm (left tibia) and 47.5 ± 27.4 mm (right tibia). The discussed results were obtained by APSO consisting of 300 particles and executing 20 iterations.

Fig. 6
figure 6

Tracking errors [mm] versus frame number for p1s1 sequence, achieved by APSO consisting of 300 particles and executing 20 iterations

The plots in Fig. 7 depict the distance between ankles for marker-less and marker-based motion capture systems, which were obtained on p1s1 and p1s2 sequences.

Fig. 7
figure 7

Distance [m] over time between ankles for marker-less and marker-based motion capture systems

6 Performance of baseline 3D gait recognition

At the beginning, we discuss the evaluation methodology. Then, we outline the Multilinear Principal Component Analysis algorithm, afterwards in Section 6.3 we present the evaluation protocol. Afterwards, in Section 6.4 we present the recognition performance, which has been achieved by the baseline algorithm as well as a convolutional neural network. Finally, in Section 6.5 we analyze the performance of individual identification.

6.1 Methodology

The single gait cycle is a basic entity describing the gait during ambulation, which includes the time when one heel strikes the floor to the time at which the same limb contacts the floor again. In our approach, gait is recognized on the basis of a single gait sample, which consists of two strides. Since the number of frames registered in gait samples differs slightly from the average number of frames, the time dimension was chosen to be 32, which roughly equals to the average number of video frames in each gait sample. Having on regard that the marker-based system has four times higher frame rate, the time dimension for moCap data was set to 128. The data extracted by the motion tracking algorithm were stored in ASF/AMC data format. For such a single gait cycle, a third order tensor 32 × 11 × 3 for marker-less data is determined, whereas for marker-based a tensor 128 × 25 × 3 is calculated. The marker-less data was filtered by applying a running mean of length nine samples to the original data. For data from the marker-less system, the second dimension of the tensor is equal to the number of bones (excluding pelvis), i.e. 10 plus one element in which the distance between ankles and person height are stored. The third mode accounts for three angles, except the eleventh vector that contains distance between ankles and person’s height. For data from marker-based system, the second dimension of the tensor is equal to the number of bones, see also Fig. 4, whereas the third mode accounts for three angles. Such a gait signature was then reduced using Multilinear Principal Components Analysis (MPCA) algorithm [45], which is overviewed in Section 6.2.

A benchmark dataset for gait recognition should have two subsets: the gallery and the probe. Gait samples in the gallery set are labeled with person identities and they are utilized in training, while the probe set encompasses the test data, which are gait samples of unknown identities, and which are matched against data from the gallery set. The introduced GPJATK dataset has gallery and probe sets. The gallery set contains a known class label, and a model is learned on this data in order to be generalized to probe set later on. The evaluation methodology is discussed in Section 6.3

6.2 Feature extraction using multilinear principal component analysis

Tensors are denoted by calligraphic letters and their elements are denoted by indexes in brackets. An Nth–order tensor is denoted as \(\mathcal {A} \in \mathbb {R}^{I_{1} \times I_{2} \times {\dots } I_{N}}\). The tensor is addressed by N indexes in, where \(n=1,\dots ,N\), and each in addresses the n-mode of \(\mathcal {A}\). The n-mode product of a tensor \(\mathcal {A}\) by a matrix \(\mathbf {U} \in \mathbb {R}^{J_{n} \times I_{n}}\) is a tensor \(\mathcal {A} \times _{n} \mathbf {U} \) with elements: \((\mathcal {A} \times _{n} \mathbf {U} )(i_{1}, \dots , i_{n-1}, j_{n}, i_{n+1}, \dots , i_{N}) = {\sum }_{i_{n}} \mathcal {A} (i_{1}, \dots , i_{N}) \boldsymbol {\cdot } \mathbf {U} (j_{n}, i_{n})\). A rank-1 tensor \(\mathcal {A} \) is equal to the outer product of N vectors \(\mathcal {A} = \mathbf {u}^{(1)} \circ \mathbf {u}^{(2)} \circ {\dots } \circ \mathbf {u}^{(N)} \), which means that for all values of indexes, \(\mathcal {A}(i_{1}, i_{2}, \dots , i_{N}) = \mathbf {u}^{(1)}(i_{1}) \boldsymbol {\cdot } \mathbf {u}^{(2)}(i_{2}) \boldsymbol {\cdot } {\dots } \boldsymbol {\cdot } \mathbf {u}^{(N)}(i_{N}) \).

The MPCA operates on third-order gallery gait samples \(\{ \mathcal {X}_{1}, \dots , \mathcal {X}_{L} \in \mathbb {R}^{I_{1} \times I_{2} \times I_{3}} \}\), where L stands the total number of data samples in the gallery subset. The MPCA algorithm seeks for a multilinear transformation \(\{ {\tilde {\mathbf {U}}^{(n)} } \in \mathbb {R}^{I_{n} \times P_{n}},n=1,2,3 \}\), where Pn < In for n = 1,2,3, which transforms the original gait tensor space \( \mathbb {R}^{I_{1}} \mathbb {R}^{I_{2}} \mathbb {R}^{I_{3}} \) into a lower-dimensional tensor subspace \( \mathbb {R}^{P_{1}} \mathbb {R}^{P_{2}} \mathbb {R}^{P_{3}} \), in which the feature tensor after the projection is obtained in the following manner: \(\mathcal {Y}_{l} = \mathcal {X}_{l} \times _{1} \tilde {\mathbf {U}}^{(1)^{T}} \times _{2} \tilde {\mathbf {U}}^{(2)^{T}} \times _{3} \tilde {\mathbf {U}}^{(3)^{T}}\), where ×n is the n-mode projection, \({l}=1, \dots , {L}\), and \(\tilde {\mathbf {U}}^{(1)} \in \mathbb {R}^{I_{1} \times P_{1}}\) is a projection matrix along the first mode of the tensor, and similarly for \(\tilde {\mathbf {U}}^{(2)}\) and \(\tilde {\mathbf {U}}^{(3)}\), such that the total tensor scatter \({\Psi }_{\mathcal {Y}} = {\sum }_{l=1}^{L} || \mathcal {Y}_{l} - \bar {\mathcal {Y}} ||_{F}^{2} \) is maximized, where \( \bar {\mathcal {Y}} = \frac {1}{L}{\sum }_{l=1}^{L} \mathcal {Y}_{l}\) denotes the mean of the training samples. The solution to this problem can be obtained through an iterative alternating projection method. The projection matrixes \(\{ \tilde {\mathbf {U}}^{(n)}, n=1,2,3 \} \) can be viewed as \({\prod }_{n=1}^{3} P_{n}\) so-called EigenTensorGaits [45]: \( \tilde {\mathcal {U}}_{p_{1}p_{2}p_{3}} = \tilde {\mathbf {u}}_{p1}^{(1)} \circ \tilde {\mathbf {u}}_{p2}^{(2)} \circ \tilde {\mathbf {u}}_{p3}^{(3)} \), where \(\tilde {\mathbf {u}}_{p_{n}}^{(n)} \) is the \(p_{n}^{th}\) column of \(\tilde {\mathbf {U}}^{(n)} \).

In the Q-based method [45], for each n, the first Pn eigenvectors are kept in the n-mode such that \(Q^{(1)}= Q^{(2)}=,\dots , Q^{(N)}=Q\), where Q is a ratio determined as follows: \(Q^{(n)} = {\sum }_{i_{n}=1}^{P_{n}} \lambda _{i_{n}}^{(n)*} / {\sum }_{i_{n}=1}^{I_{n}} \lambda _{i_{n}}^{(n)*} \), and where \(\lambda _{i_{n}}^{(n)*}\) is the inth full-projection n-mode eigenvalue.

Having on regard, the distance between two tensors \(\mathcal {A}\) and \(\mathcal {B}\), which can be measured by the Frobenius norm \(dist(\mathcal {A}, \mathcal {B}) = || \mathcal {A}-\mathcal {B} ||_{F}\), equals the Euclidean distance between their vectorized representations \(vect(\mathcal {A})\), \(vect(\mathcal {B})\), the feature tensors extracted by the MPCA were vectorized. They were then utilized in the gait recognition using k-Nearest Neighbors (kNN), Naïve Bayes (NB), Support Vector Machine (SVM) and Multilayer Perceptron (MLP) classifiers. To the best of our knowledge, MPCA and its extended versions were only used in conjunction with kNN classifiers, c.f. [46].

6.3 Evaluation protocol

From 166 video sequences, 414 gait samples were extracted (325 – clothing 1, 58 – clothing 2, 31 – backpack). From every sequence, 2 or 3 gait samples were determined and then employed in the evaluations. The performance of the system has been evaluated as follows:

  • clothing 1

    • 10–fold cross–validation on 325 gait samples

    • gallery–probe: 164 gait samples in training set (sequences s1 and s2), 161 gait samples in probe set (sequences s3 and s4)

  • clothing 2

    • gallery–probe: training data – 325 gait samples (clothing 1: sequences s1, s2, s3 and s4), test set – 58 gait samples (clothing 2: sequences s5, s6, s7 and s8 with persons p26–p31).

  • backpack

    • gallery–probe: training data – 325 gait samples (clothing 1: sequences s1, s2, s3 and s4), test set – 31 gait samples (backpack: sequences s9 and s10 with persons p26–p32).

The Correct Classification Rate (CCR) has been used to quantify the deviations between real outcomes and their predictions. The CCR is a ratio of correctly classified number of subjects to the total number of the subjects in the test subset. The classification rates for rank 2 and rank 3 were also determined. They correspond to percentages of correctly recognized gait instances in the first two and first three indications of a classifier, respectively.

6.4 Evaluation of recognition performance

Gait-based person identification has been performed using Naïve Bayes (NB), Support Vector Machine (SVM), Multilayer Perceptron (MLP) and k-Nearest Neighbor (1NN, 3NN and 5NN) classifiers, operating on features extracted by the Multilinear PCA. The classification and the performance evaluation were performed using WEKA package. Sequential Minimal Optimization (SMO) is one of the common optimization algorithms for solving the Quadratic Programming (QP) problem that arises during the training of SVMs. SMO operates by breaking down the dual form of the SVM optimization problem into many smaller optimization problems that are more easily solvable. The C parameter of the SMO algorithm has been determined using cross–validation and grid search.

6.4.1 Cross–validation on gait data

In order to assess how the results of statistical analyses will generalize to independent data set we performed evaluations on the basis of cross–validation. In the ten-fold cross–validation, the 325 gait samples are randomly partitioned into ten equal size sample subsets. Nine of ten sample subsets are used as training data, and the remaining one is retained as the validation data. The cross-validation process is then repeated ten times, with each of ten sample subsets used exactly once as the validation data. Ten results were then averaged to produce a single evaluation. To evaluate the recognition performance we determined the classification accuracies for rank 1, 2 and 3.

Table 3 shows the classification accuracies that were obtained in 10-fold cross-validation on the basis of data from marker-less motion capture. As we can observe, the MLP achieves the best results for ranks 1–3. At this point, it is worth emphasizing the high classification accuracy that was achieved by the baseline algorithm on data from markerless motion capture system, where almost 90% classification accuracy has been achieved for rank 1, and almost 97% classification accuracy has been obtained for rank 3. These promising results are a strong argument for the necessity to introduce this dataset for the pattern recognition community.

Table 3 Correctly classified ratio [%] in 10–fold cross-validation using data from marker-less motion capture

The experimental results in Table 4 were obtained in 10-fold cross-validation on the basis of data from marker-based motion capture. As we can observe, the best classification accuracies are obtained by the MLP classifier, which achieves CCR higher than 98% for rank 1 and classification accuracy better than 99% for rank 2. The SMO classifier achieves competitive results in comparison to the MLP, whereas the kNN classifiers, which are frequently used in biometric systems as baseline algorithms, achieve far worse results. The practical advantage of kNN classifiers is that they deliver the closest identities, and thus they permit analysis of the causes of miss classification. The availability of precise motion data for several joints allows identification of most important body segments for gait recognition.

Table 4 Correctly classified ratio [%] in 10–fold cross-validation using data from marker-based motion capture

As seen from the above presented results, marker-less motion capture systems are able to deliver useful information for view-invariant gait recognition. They were obtained for Q = 0.99, for which the number of attributes in the vectorized tensor representations is equal to 144. Such value of Q gives the best results and it will be utilized in remaining evaluations.

6.4.2 Train/test split - clothing 1

In this subsection, experimental results that were obtained in train/test data split are presented. In the following tables, we present classification accuracies that were obtained on data from marker-less and marker-based motion capture systems.

Table 5 illustrates the classification accuracy for train/test (train - clothing 1, test - clothing 1) split of data from marker-less motion capture system. As we can observe, for rank 1 the best classification accuracy is slightly better than 80% and it was achieved by the MLP, whereas for ranks 2 and 3 the best results were achieved by the SMO and they are equal to 89.1% and 94.2%, respectively. In this multi-view scenario, where a straight walk that was perpendicular to a pair of cameras was used in gallery/training, whereas diagonal walk was used in probe/testing, promising results were obtained despite noisy data from marker-less system. Slightly better than 94% classification accuracy for rank 3 is promising since the gait data was recorded at significantly different observation perspectives.

Table 5 Correctly classified ratio [%] for train/test (train - clothing 1, test - clothing 1) split of data from marker-less motion capture

Table 6 shows the classification accuracy for train/test (train - clothing 1, test - clothing 1) split of data from marker-based motion capture system. As we can observe, the classification accuracies are slightly worse in comparison to results demonstrated in Table 4.

Table 6 Correctly classified ratio [%] for train/test (train - clothing 1, test - clothing 1) split of data from marker-based motion capture

6.4.3 Train - clothing 1 / test - clothing 2

This subsection is devoted to the analysis of classification results for the clothing covariates. We present results that were achieved using clothing 1 in gallery/training and clothing 2 in probe/testing data split.

Table 7 presents the classification accuracy for train/test (train - clothing 1, test - clothing 2) split of data from marker-less motion capture. In such a scenario the best classification accuracy was obtained by the MLP classifier. As we can observe, for rank 3 the classification accuracy is better than 96%.

Table 7 Correctly classified ratio [%] for train/test (train - clothing 1, test - clothing 2) split of data from marker-less motion capture

The experimental results in Table 8 illustrate the classification accuracies, which were obtained on data from marker-based system. As expected, the classification accuracy does not change noticeably with respect to previously considered scenarios with data from marker-based system.

Table 8 Correctly classified ratio [%] for train/test (train - clothing 1, test - clothing 2) split of data from marker-based motion capture

6.4.4 Train - clothing 1 / test - backpack

In this subsection, we present the classification accuracies that were achieved in a scenario in which clothing 1 was used in gallery/training, whereas in probe/testing the backpack data was used. As shown in Table 9, the system achieves 77.4%, 87.1% and 93.55% accuracies for rank 1, 2 and 3, respectively.

Table 9 Correctly classified ratio [%] for train/test (train - clothing 1, test - knapsack) split of data from marker-less motion capture

6.4.5 End-to-end biometric gait recognition by convolutional neural network

Recently, deep learning methods such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have shown vast potential for automatically learning features for action recognition and motion analysis [4]. Several CNN-based methods have been applied to biometric gait recognition [70]. Deep learning-based methods usually benefit from big training datasets. Due to limitations of currently available datasets for multi-view [20] and 3D biometric gait recognition [7], limited research has been dedicated to this research topic. In [73] a 3D convolutional neural network has been applied to recognize gait in multi-view scenarios. A method proposed in [27] can be applied to cross-view gait recognition. Heatmaps extracted on the basis of CNN-based pose estimates are used to describe the gait in one frame. An LSTM recurrent neural network is then applied to model gait sequences. Below we show that promising results can be achieved on the proposed dataset in end-to-end biometric gait recognition by a neural network built on 1D convolutions. We evaluated the gait recognition performance in clothing 2 and backpack scenarios.

One of the benefits of using CNNs for sequence classification is that they can learn from the raw time series data directly, and thus do not necessitate a feature extraction for the subsequent classification. The input of a neural network built on 1D convolutions are time series with a predefined length. The input data is 33-dimensonal and can consist of 25 or 32 time steps. The output consists of a probability distribution over number of persons (C = 32) in the dataset, i.e. it is a softmax classifier with C neurons. The CNN model consists of a convolutional block and a fully connected layer. The convolutional block comprises two 1D CNN layers, which are followed by a dropout layer for regularization and then a pooling layer. The length of the 1D convolution window in the first layer is set to three, whereas the number of output filters is set to 256. The length of the 1D convolution window in the second layer is set to three and the number of filters is set to 64. The fraction of units to drop is equal to 0.5, whereas the size of max pooling windows is equal to two. After the pooling, the learned features are flattened to one long vector and pass through a fully connected layer with one hundred neurons. Glorot’s uniform initialization, also called Xavier uniform initialization, is used to initialize all parameters undergoing learning. The parameters are learned using the Adam optimizer (with learning rate set to 0.0001, and the exponential decay rates of the first and second moment estimates set to 0.9 and 0.999, respectively) and the categorical cross-entropy as the objective cost function.

The experimental results that were achieved by the convolutional neural network are shown in Table 10. The neural network has been trained on 325 gait samples belonging to Clothing 1 covariate. Comparing results in Tables 7 and 10 for Clothing 1 – Clothing 2 covariate, we can observe that for the rank 1 the CCR is better than CCR achieved by NB, equal to CCR achieved by SMO, worse than CCR obtained by the MLP, and better in comparison to CCRs achieved by k-NNs. On the Clothing 1 – Knapsack covariate, the CCR achieved by the CNN is better in comparison to CCRs achieved by k-NNs, equal to CCR achieved by SMO, and worse in comparison to CCRs achieved both by the NB and the MLP.

Table 10 Correctly classified ratio [%] achieved by 1D convolutional neural network on data from marker-less motion capture

As seen in Table 11, on precise motion data acquired by the marker-based system, the convolutional neural network achieved better results in comparison to results achieved by classical classifiers operating on MPCA features, c.f. Table 8. It is worth noting that in order to relate results achieved on the basis of marker-less and marker-based data, the discussed evaluation has been performed on time-series of length equal to 32, i.e. sub -sampled motion data. The temporal convolutional neural network demonstrated that it is capable of extracting characteristic spatiotemporal 3D patterns generated by human motions.

Table 11 Correctly classified ratio [%] achieved by 1D convolutional neural network on data from marker-based motion capture

6.5 Individual identification based on gait

In this subsection, we analyze the performance of individual identification on the basis of data from marker-less system. Figure 8 depicts the confusion matrix for train/test (train - clothing 1, test - clothing 1) split of data from marker-less motion capture, that was determined on the basis of MLP classification results.

Fig. 8
figure 8

Confusion matrix for train/test (train - clothing 1, test - clothing 1) split of data from marker-less motion capture, achieved by the MLP classifier, see also Table 5

The left plot in Fig. 9 depicts the recognition rates for persons p1-p32, which were achieved by the MLP classifier in train/test (train - clothing 1, test - clothing 1) split of data from marker-less motion capture, see also confusion matrix in Fig. 8. The right plot shows recognition rates for persons p26–p31 in clothing 1/clothing 1, clothing 1/clothing 2 and clothing 1/backpack data splits, respectively.

Fig. 9
figure 9

Recognition rate [%] vs. person ID, (left) recognition rate for train/test (train - clothing 1, test - clothing 1) split of data from marker-less motion capture, achieved by the MLP classifier, corresponding to confusion matrix depicted in Fig. 8, (right) ) recognition rate for persons p26–p31 in clothing 1/clothing 1, clothing 1/clothing 2 and clothing 1/backpack data splits

7 Conclusions and discussion

We have introduced a dataset for analysis of 3D motion as well as evaluation of gait recognition algorithms. This is a comprehensive dataset that contains synchronized video from multiple camera views with associated 3D ground truth. We compared performances of biometric gait recognition, which were achieved by algorithms on marker-less and marker-based data. We discussed recognition performances, which were achieved by a convolutional neural network and classic classifiers operating on handcrafted gait signatures. They were extracted on the basis of 3D gait data (third-order tensors) using multilinear principal component analysis. All data are made freely available to the research community. The experimental results obtained by the presented algorithms are promising. Much better results achieved by algorithms operating on marker-based data suggest that the precision of motion estimation has strong impact on performance of biometric gait recognition. The availability of synchronized multi-view image sequences with 3D locations of body joints creates a number of possibilities for extraction of gait signatures with high identification power.