Keywords

1 Introduction

Biometrics is the science that combines the study of human physiology with computer science for measuring and analyzing human characteristics. Each individual has unique traits, both from the physical point of view, such as the color of iris, the retina, the shape of hand and the fingerprints, and under the behavioral point of view, such as the timbre of voice, the handwriting and the walking style [1]. The biometric recognition aims to exploit these features, taken singularly or in combination, for identifying individuals. Nowadays, there are many contexts in which it is necessary to identify a person: for access control to protected systems, in order to ensure access only to authorized persons, for identification of people responsible for fraud, crimes, and so on. Unlike the access control case, in which the subject directly and actively cooperates with the system because he/she want to gain an access, in the case of people identification the collaboration of the subject cannot be guaranteed. Therefore, in situations like an un-controlled surveillance scenario it is important to exploit a human biometric feature which can be captured at a distance and without the subject’s collaboration: the walking style, also called gait, is a promising candidate for this task.

The study of gait for recognition purposes has recently gained interest due to it advantages compared to other biometric features [2]. First of all, gait can be capture at a distance, unobtrusively and, as said before, without the subject’s cooperation. In addition, the motion capture does not require sophisticated and expensive instrumentations: a simple and inexpensive RGB or RGBD (video and depth) camera is sufficient. As a counterpart, gait is not considered as a valid system for recognizing individuals over time, because the walking style of an individual could change during lifetime, like other behavioral features [3].

In this paper we investigate if gait can really not be considered a distinctive biometric trait during the years, or rather if is possible to recognize a person comparing his gait sequences with the ones observed few years before. In order to acquire gait samples of different individuals we employ the Microsoft Kinect sensor, a device originally designed for gaming that is quickly become a valuable support for scientific research on gesture recognition, gait and motion analysis, and so on [4]. This device is a RGBD camera which provides, in addiction to RGB and depth streams, the skeleton tracking feature, that allows to track a human body in real-time. Such “skeleton" is a body model composed by 20 joints, each of which with coordinates in 3D space. By means of the 3D skeleton data, we have designed a recognition method related to the distance between joints and the sway of joints: the peculiarity of our method is that the recognition for sways does not depend on gait trajectories, letting the acquisition of skeleton be independent from the mutual position between Kinect sensor and shooting subject. We have collected two datasets of gait samples three years apart: we employ the old one for training our classification system, and the new one for testing. These datasets have been designed for reproducing realistic unconstrained surveillance scenarios, letting people walk in no preset path, in direction of the camera or away from it, and even carrying basic accessories: in this way we also analyze how these challengingly scenarios can influence the accuracy of gait recognition task.

The rest of this paper is organized as follows: Sect. 2 presents an overview of the related work on gait recognition systems; in Sect. 3 the proposed method for gait features extraction and classification is described, while Sect. 4 presents the datasets of gait sequences we have employed in our study. The experimental analysis is worked out in Sect. 5 and the conclusions are reported in Sect. 6.

2 Related Work

In this paper, we propose an application of gait analysis in the fields of surveillance [5] and forensic science [6], but it has also been widely applied in the health care field, for example for postural control [7], rehabilitation [8], falls detection for elderly people [9], and so on. Regardless of the specific application context, in the literature we can find two ways for collecting data on human body used for analyzing the gait [10]. In the past decades, the most common methods were the appearance-based ones, that employ the body silhouette from 2D images for reconstructing the shape of human being. This kind of methods are quite inexpensive in terms of computational cost, but they are usually not robust to variation of scale and viewpoint of the camera. A notable example of this approach is the work of Wang et al. [11], where the background subtraction procedure has been combined with the image segmentation in order to isolate the spatial silhouettes of a walking person. Recently, Hu et al. [12] have reduced the gait features space taken from different views by implementing a unitary projection method. Ortells et al. [13] have extracted reliable gait measures from corrupted silhouettes in gait sequences by applying a weighted averaging method.

The second kind of methods are based on the model, using some technique for reconstructing a 3D model of the body: they require more computational power than the appearance-based techniques, with the advantage of being independent on viewpoint and scale. Some examples of model-based methods are those of Bouchrika and Nixon [14] which have proposed a methodology based on elliptic Fourier descriptors for extracting body joints and build motion templates for gait description. Argyropoulos and Araujo [15] have used a channel coding approach for constructing a model of gait employed in their gait recognition system. Jung et al. [16] have developed a system to enable face acquisition and recognition in surveillance environments, by analyzing the 3-D gait trajectory.

Thanks to the spread of RGBD cameras, there was a significant increase of model-based approaches for gait studies. To mention a few, Ahmed et al. [17] have defined two types of features, i.e. joint relative distance and joint relative angle, which are robust against pose and view variations. Dikovski et al. [18] have investigate the role of different types of features and body parts in the gait recognition process. Andersson and Araujo [19] have extracted anthropometric attributes, gait kinematic and spatio-temporal parameters.

We have already explored the gait recognition task through Kinect sensor in other works, in particular in [20], where we discriminate between pair of subjects with similar anthropometric features, and in [21], where we face people recognition in a controlled surveillance scenario. In the present paper we collect new gait data on some subjects we had already studied in these works, and we verify if they are still recognized by the old datasets collected few years ago. To the best of our knowledge, the present paper is the first attempt of recognizing individuals by their walking style using gait data from the past.

3 Proposed Method

In our analysis we employ the Microsoft Kinect sensor as input device for capturing the gait sequences. In this section we describe how we collect, clean and manage the skeleton data provided from this sensor for performing gait recognition task. Kinect provides both RGB and depth streams and, thanks to the skeletal tracking capability, it allows to follow in real-time the body movement represented by a human skeleton map. In detail, the body model (shown in Fig. 1) is constructed in each frame of the video stream, and it is made up of 20 joins, labeled \(J_0, \ldots , J_{19}\). Each joint is a 3D point in the Kinect coordinates system (x, y, z), centered on the sensor, where x, y and z represent the horizontal, vertical and depth direction respectively. Thus, we define \(J_{k}^i=(J_{k,x}^i, J_{k,y}^i, J_{k,z}^i)\) the coordinates of the \(k\)-th joints at time \(i\) and a gait sequence as an interrupted sequence of skeleton maps, from when the user is detected by the sensor, until it comes out of the field of view. Moreover, the sensor provides an estimate of the floor plane where the user is walking, given by the equation \(ax+by+cz+d=0\), where \(\mathbf n =(a, b, c)\) is the normal vector and d is the height of the camera center with respect to the floor.

Fig. 1.
figure 1

The Kinect skeleton body model.

Collection of Skeleton Maps

We collect the skeleton map only if all of its joints are fully tracked: the sensor provides in automatic this kind on information. For each sample, we also take track of the direction of the walk, distinguishing if it is toward or away from the camera. Kinect does not provide any automatic mechanism for recognizing frontal and rear poses, then we have checked this condition by simply monitoring the position of the center of mass (joint \(J_0\)), and in particular its depth coordinate z: if this value decreases over time, the subject is walking toward the camera (we remind that the origin of the coordinates system is the sensor), otherwise he/she is walking in the other direction. In this way we can properly collect data on left/right arms and legs when a change of direction is detected. Obviously this mechanism fails if the subject walks backwards, but within the scope of this paper we assume that this case is not applicable. For comparing gait samples from different viewpoints and with no linear walking paths, we also apply a translation and a rotation to each joint: our objective is to obtain, for each collected walk, a sample in linear direction and totally in front of the sensor, that is a sample where the axis z coincides with the walking direction. To this end, for each frame we need to detect the angle of walking direction with respect to axis z, and to rotate all the joints of such angle. We can define the direction of walk as the tangent line to the trajectory of \(J_0\), in the point \((J_{0,x}^{i},J_{0,z}^{i})\), then the i-th walking direction angle \(\theta _i\) is estimated by approximating the derivative with the incremental ratio:

$$\begin{aligned} \theta _i = \tan ^{-1} \left( \frac{J_{0,x}^{i+1}-J_{0,x}^{i}}{J_{0,z}^{i+1}-J_{0,z}^{i}} \right) . \end{aligned}$$
(1)

For changing the local coordinates system we first translate the axes origin to \(J_0\) and then we rotate the axes according to the walking direction and the floor normal \(\mathbf n \). The transformation matrix is defined as

$$\begin{aligned} \mathbf {T}_i= \begin{bmatrix} \cos \theta _i&0&\sin \theta _i\\ a&b&c \\ -\sin \theta _i&0&\cos \theta _i\\ \end{bmatrix} \cdot \begin{bmatrix} 1&0&0&-J_{0,x}^{i}\\ 0&1&0&-J_{0,y}^{i}\\ 0&0&1&-J_{0,z}^{i}\\ \end{bmatrix} \cdot \end{aligned}$$
(2)

This process is repeated for each frame, obtaining a gait sequence in a new coordinate system (X, Y, Z), where axis Y is the normal vector of the Kinect floor plane and Z coincides with the walking direction.

Features Extraction

We extract two kinds of gait features, the first one related to the distance between pair or group of joints, and the second one related to the sway of single joints. The first kind of features aims at describing the gait under the anatomical point of view, by computing the length of the different body parts. These length are computed every frame as the distance, in the 3D space, between pairs or groups of joints. Each distance features is represented by the temporal average of the corresponding collected measures, in order to obtain a single value for each kind of distance in a gait sequence. We collect 27 distance features, summarized in Tables 1 and 2.

Table 1. Classical anthropometric measures.
Table 2. Distances between adjacent (a) and not-adjacent (b) joints (\(L=left, R=right, C=center\)).

The second kind of features aims at characterize the gait style, by computing the sway of body joint, along the lateral (X axis) and vertical (Y axis) directions. For each walking sample, composed by few strides, we consider the time series \(J_{k,X}^i\) and \(J_{k,Y}^i\) of each joint: for each one of these two directions, we compute the temporal average and the median absolute deviation. As a result, for each one of the 19 joints (the center of mass is excluded, because it coincides with the system origin after the change of coordinates) we obtain 4 sway features (temporal average and median absolute deviation for both lateral and vertical directions), for a total of 76 sway features. Considering both distance and sway features we collect 103 gait parameters. We decided to choose these particular features because in our previous investigation [21] we have noticed that the classical gait parameters (such as stride length and walking speed) are poorly estimated by Kinect sensor, due mainly to the limited depth range, which allows to perform only 3 or 4 strides in a single gait sequence. On the contrary, features like distance between elbows and knees or head oscillation appears to be more effective for recognition task.

We use these features for building a features matrix containing the features of all the gait sequences collected for each subject: this matrix will be used for the classification task. We construct this matrix as follows. First of all, we start with a matrix where each row contains the 103 features collected in a gait sequence. Then we normalize the features by column, in order to obtain values in the range [0,1]. Finally we reduce the number of features by means of Principal Component Analysis (PCA): the number of components is chosen by selecting the minimum number of components having sum of their explained variances greater than a fixed threshold. The classification task is carried out through the Support Vector Machine (SVM), a consolidate machine learning technique for performing supervised classification. Considering the purpose of the paper, the traing phase is accomplished usign the samples in our old dataset, and the testing samples have been collected recently.

4 Datasets

In this section we introduce the datasets used in our experimentation. In [20] we have collected a dataset of 20 subjects (called \(KinectUNITO'13 )\): each subject has been recorded while walking along a straight corridor, in front or rear to the camera. For each subject we have collected 20 gait samples, 10 for each of the two directions (approaching to or moving away from the camera). In this particular configuration each gait sequence follows approximately a straight path. For our study we also collect a brand new dataset, slightly different from the previous one. Due to the transformation matrix in Eq. 2, we can compare gait sequences with different walking paths, not only straight. For this reason, the new dataset (called \(KinectUNITO'16 \)) was acquired in a big lecture hall, letting the subjects to follow a curvilinear path. In addition, for reproducing a more realistic uncontrolled surveillance scenario we also collect gain sequences where the subjects carry some accessory: a shoulder bag, a backpack and a smartphone, for a total of 4 different scenarios (with any accessory or with one of these accessories). For each of these 4 scenarios we collect 4 walking sequences, 2 from the frontal view and 2 from the rear one, for a total of 16 samples per subject. This dataset contains the gait sequences of 10 subjects, 8 males and 2 females aged from 30 to 50, all of them involved in the old trial. Such scenarios have been chosen for a particular reason: in fact, the Kinect sensor is set up only for recognizing people standing in front of the camera, any other configuration represents a challenge for the skeleton tracking capability. The major inaccuracies in skeleton acquisition takes place when the arms are partially occluded, as in rear poses; when something covers the arm, like a shoulder strap; or when the arm is bent for keeping something. We have added scenarios concerning all these conditions to the dataset for analyzing in detail how much serious is the performance loss with such obstacles.

The gait sequences for both datasets have been collected using the Kinect for Windows v1.

5 Experimental Results

In this section we present the experiments we have worked out for analyzing the performance of the proposed method in terms of person identification accuracy. We apply the methods described in Sect. 3 on the samples acquired from the two datasets presented in Sect. 4: in particular, for the classification task we employ C-SVM, where parameter C has been computed through 10-fold cross-validation procedure, with both linear (LIN) and radial basis functions (RBF) as kernel functions.The performance has been evaluated as portion of samples classified correctly among all samples. We use the samples in \(KinectUNITO'13 \) for training the SVM, while the samples in \(KinectUNITO'16 \) are used in the testing phase, accordingly to the paper purpose, that is to investigate if gait can be a distinctive biometric trait during the years: to do so we try to recognize an individual by comparing his current gait sequences with the ones observed few years before. As reported in Sect. 3, the Kinect sensor has some difficulty in tracking the skeleton in rear samples. For this reason, in a first attempt we employ only the front samples of the two datasets. Later we will also add the rear samples, and we will compare the performances for analyzing how much the rear poses may have an impact on the recognition task.

We start by describing the first classification experiment. Considering only front samples, we train the SVM with 10 samples per subject. For the testing phase we have four different sub-cases, that we label as follow: walking (a) without any objects, (b) with a bag, (c) with a backpack, (d) with a smartphone. For each subject we have two front samples per case, for a total of 8 gait sequences. As described in Sect. 3, we apply the PCA in order to reduce the number of features: in this case we fix this threshold to \(90\%\), with a reduction of features from 103 to 20. As kernel function for SVM, we select the linear one, because we observe better performances for this dataset. The results of the classification task are shown in Fig. 2, where the x-axis represents the subjects and the y-axis the recognition accuracy in terms of correctly classified samples: each color of the block represent the number of correctly classified samples in each sub-category. We remind that each sub-category contains 2 samples, for a total of 8 samples per subject. We can notice that 7 subject over 10 have been recognized in more the \(50\%\) of cases: three of them in more the \(75\%\) of cases and one subject is correctly classified in all 8 cases. The worst results concern the subjects 2 and 7, which are recognized only in one case. As expected, the classification task is more effective if the individual carries anything: in this case, all the samples of 7 subjects have been correctly classified, and just 1 subject has been never classified in the correct way. Also the backpack is not a great obstacle: 6 subjects have been always well classified, and just 1 subject has been always misclassified. The bag decreases more the accuracy (4 subjects always classified correctly, but 5 subjects always misclassified), but the worst performance is obtained when an individual is walking when talking on the phone: in this scenario only 3 subjects have been classified correctly, while all the other ones have been always misclassified.

Fig. 2.
figure 2

Classification accuracy considering only front samples.

In the second classification experiment we also consider the rear samples of the two datasets: the training set is now composed by 20 samples per subject, and the testing set contains 16 samples per subject, 4 for each sub-case listed above. In this experiment we fix threshold for PCA to \(80\%\), because we have noticed that higher values produce over-fitting of data: the features reduce from 103 to 14. As kernel function we select the RBF, because we have observed that in this case it produces an accuracy higher than the linear kernel. The accuracy of this second classification task is shown in Fig. 3. As in the first experiment, we notice that 7 subject over 10 have been correctly classified in more the \(50\%\) of cases. We can also observe a slight improvement on classification for subject 2, and a marked improvement for subject 7. On the other hand, for some subject (e.g. the number 10) the recognition rate decreases when we consider also the rear samples: such worsening is due to the limits of Kinect sensor, that is less efficient in acquiring rear poses, with a negative impact on the training phase. If we analyze each sub-case, we can notice that the classification accuracy for an individual carrying anything is still high, at least \(50\%\) for each subject. For the sub-case of backpack, the accuracy is become slightly worse respect to the previous experiment, with 3 subjects classified correctly in less than \(50\%\) of cases: during the skeleton tracking we have noticed that the straps of backpack sometimes occlude an arm. On the other hand, the accuracy of classification if an individual is carrying a bag or is making a phone call improves when we also consider rear poses: in fact 5 subjects with a smartphone have been correctly classified at least \(50\%\) of times, and even 9 subjects with a bag have been correctly classified at least \(50\%\) of times.

Fig. 3.
figure 3

Classification accuracy considering both front and rear samples.

Fig. 4.
figure 4

Comparison between classification with only front samples (pink line) and with both front and rear samples (light blue line). (Color figure online)

Finally, in order to understand if the rear poses acquired from Kinect sensor can be considered for gait analysis, in Fig. 4 we compare the accuracy rate of the two experiments, without considering the distinction in sub-cases: in such figure each subject is reported in the x-axis, while the y-axis shows the percentage of samples correctly classified in the two experiments (the pink line for the first one, and the light blue line for the second one). In 7 subjects over 10 the use of only front samples gives a better accuracy, then apparently the rear poses decrease the performances of classification task. But if we look the graphs more carefully, we notice that the line of front poses presents more peaks, both negative and positive, respect to the line with also rear poses, that results more stable, with only a negative peak of accuracy lower than \(30\%\). Obviously the accuracy in second case should be lightly due to the employment of more samples for training the SVM, but we have to consider that the rear samples are less accurate, because of the limits of the Kinect sensor in acquiring the figure in rear pose.

6 Conclusion

In this paper we have proposed a gait recognition system that collects a rich set of static and dynamic gait features, by computing the distance between joints and the sway of joints respectively, from a set of gait samples acquired through the Kinect sensor: the distinctive trait of our system is that it makes the gait samples invariant with respect to the acquisition viewpoint. The recognition systems is based on the SVM supervised classification, preceded by a phase of feature space reduction through PCA. The experimental analysis has been performed using two datasets acquired three years away from each other: both datasets include samples in front and rear poses, but the second one also includes gait samples in which people carry some object. The objective of study is double: we want to understand if gait can be considered an invariant biometric trait over years in our lifetime, and we also want to analyze how the use of unusual gait samples, i.e. rear poses or people carrying objects, can modify the accuracy of gait recognition. Results show that gait allows to recognize a person even after years, or at least after few years, making it compliant to forensic and security applications: just think of those situations where the perpetrator of crime has been observed through surveillance cameras years before trial. Moreover we have also observed that the presence of accessories makes the recognition task worse in case of only frontal acquisitions, but the addition of rear poses during the learning task can make the recognition task more stable, at the cost of lowering a little its accuracy. Future works will be devoted to the extension of experiments to a larger dataset of subjects, and the exploitation of the proposed gait recognition approach in security and forensic applications, as a support to investigation.