Keywords

1 Introduction

In recent years, the task of estimating the human pose has been widely explored in the computer vision community. Many deep learning-based algorithms that tackles the 2D human pose estimation have been proposed [5, 19, 22] along with a comprehensive set of annotated datasets, collected both in real world [1, 8, 11] or in simulations [7, 17]. However, the majority of these works and data collections are based on standard intensity images (i.e. RGB and gray-level data) while datasets and algorithms based only on depth maps, i.e. images in which the value of each pixel represents the distance between the acquisition device and that point in the scene, have been less explored, even though this kind of data contains fine 3D information and it can be used in particular settings, like the automotive one [4, 18], since depth maps are usually acquired through IR.

A milestone in the human pose estimation on depth maps is the work of Shotton et al. [15], based on the Random Forest algorithm, that has been implemented in both commercial versions of the Microsoft Kinect SDK. This real time algorithm has been widely used to automatically produce body joints annotations in depth-based public datasets. However, these annotations have limited accuracy: in [15], the authors report a mean average precision of 0.655 on synthetic data with full rotations.

For these reasons, in this paper we present Watch-R-Patch, a novel refined set of annotations of the well-known Watch-n-Patch dataset [20], which contains annotations provided by Shotton et al. ’s method [15].

Original wrong, imprecise, or missing body joints have been manually corrected for 20 training sequences and 20 testing sequences, equally split between the different scenarios of the dataset, i.e. office and kitchen.

Furthermore, we present a deep learning-based architecture, inspired by [5], that performs the human pose estimation on depth images only. The model is trained combining the original Watch-n-Patch dataset with the manually-refined annotations, obtaining remarkable results. Similar to [15], the proposed system achieves real time performance and can run at more than 180 fps.

2 Related Work

The majority of the literature regarding the Human Pose Estimation task is focused on intensity images [6, 13, 22]. In [19] a sequential architecture is proposed in order to learn implicit spatial models. Dense predictions, that corresponds to the final human body joints, are increasingly refined through different stage into the network model. The evolution of this method [5] introduces the concept of Part Affinity Fields that allows learning the links between the body parts of each subject present in the image.

Only a limited part of works is based on depth maps, i.e. images that provide information regarding the distance of the objects in the scene from the camera, One plausible limitation of depth-based methods is the lack of rich depth-based datasets which have been specifically collected for the human pose estimation task and contains manual body joint annotations. Indeed, available datasets are often small, both in terms of number of annotated frames and in terms of subjects. limiting their usability for the training of deep neural networks. In 2011, a method to quickly predict the positions of body joints from a single depth image was proposed in [15]. An object recognition approach is adopted, in order to shift the human pose estimation task in a per-pixel classification problem. The method is based on the random forest algorithm and on a wide annotated dataset, which has not been publicly released. A viewpoint invariant model for the human pose estimation was recently proposed in [9], in which a discriminant model embeds local regions into a particular feature space. This work is based on the Invariant-Top View Dataset, a dataset with frontal and top-view recordings of the subjects.

Recently, approaches performing the head detection directly on depth maps were proposed in [2, 3]. In [3], a shallow deep neural network is exploited to classify depth patches as head or non-head in order to obtain an estimation of the head centre joint. The Watch-n-Patch dataset [20, 21] has been collected for the unsupervised learning of relations and actions task. Its body joints annotation are obtained applying an off-the-shelf method [15], therefore they are not particularly accurate, in particular when subjects stand in a non-frontal position.

Fig. 1.
figure 1

Annotation tool overview. The tool initially shows the original joint locations (a). Then, each joint can be selected to view its name (b) or to move it in the correct location (c). Missing joints can be added in the right position (d) (e). Finally, the annotations (f) can be saved and the next sequence frame is shown.

3 Dataset

In this section, we firstly report an overview of the Watch-n-Patch dataset [20]. Then, we present the procedure we used to improve the original joint annotations and the statistics of the manually refined annotations which are referred as Watch-R(efined)-Patch. The dataset will be publicly availableFootnote 1.

3.1 Watch-n-Patch Dataset

Watch-n-Patch [20] is a challenging RGB-D dataset acquired with the second version of the Microsoft Kinect sensor: differently from the first one, it is a Time-of-Flight depth device. The dataset contains recordings of 7 people performing 21 different kinds of actions. Each recording contains a single subject performing multiple actions in one room chosen between 8 offices and 5 kitchens.

The dataset contains 458 videos, corresponding to about 230 min and 78k frames. The authors provide both RGB and depth frames (with a spatial resolution of \(1920 \times 1080\) and \(512 \times 424\), respectively) and human body skeletons (composed of 25 body joints) estimated and tracked with the method proposed in [15].

Fig. 2.
figure 2

Watch-R-Patch dataset overview. Kitchen and office sequences are shown in the first and second row, respectively.

3.2 Annotation Procedure

We collect refined annotations for the Watch-n-Patch dataset using a quick and easy-to-use annotation tool. In particular, we develop a system that shows the original body joints (i.e. the Watch-n-Patch joints) on top of the acquired depth image. The user is then able to move the incorrect joints in the proper positions using the mouse in a drag-and-drop fashion. Once every incorrect joint has been placed in the correct location, the user can save the new annotation and move to the next frame. It is worth noting that, in this way, the user has only to move the joints in the wrong position while already-correct joints do not have to be moved or inserted. Therefore, original correct joints are preserved, while improving wrongly-predicted joints. We have ignored finger joints (tip and thumb) since original annotations are not reliable and these joints are often occluded. An overview of the developed annotation tool is shown in Fig. 1. The annotation tool is publicly releasedFootnote 2.

3.3 Statistics

We manually annotate body joints in 20 sequences from the original training set and 20 sequences from the original testing set. Sequences are equally split between office and kitchen sequences. To speed up the annotation procedure and increase the scene variability, we decided to fine-annotate a frame every 3 frames in the original sequences. In some test sequences, every frame has been fine-annotated. The overall number of annotated frames is 3329, 1135 in the training set, 766 in the validation one, and 1428 in the testing one. We also propose an official validation set for the refined annotations, composed of a subset of the testing set, in order to standardize the validation and testing procedures.

For additional statistics regarding the annotated sequences and the proposed train, validation, and test splits, please refer to Table 1. A qualitative overview of the dataset is reported in Fig. 2.

Table 1. Statistics of the Watch-R-Patch dataset.

4 Proposed Method

In the development of the human pose estimation architecture, we focus on both the performance (in terms of mean Average Precision (mAP)) and the speed (in terms of frames per second (fps)).

To guarantee high performance, we decided to develop a deep neural network derived from [5] while, to guarantee high fps, even on cheap hardware, we do not include the Part Affinity Fields section (for details about PAF, see [5]).

4.1 Network Architecture

An overview of the proposed architecture is shown in Fig. 3.

The first part of the architecture is composed of a VGG-like feature extraction block which comprises the first 10 layers of VGG-19 [16] and two layers that gradually reduce the number of feature maps to the desired value. In contrast to [5], we do not use ImageNet pre-trained weights and we train these layers from scratch in conjunction with the rest of the architecture since the input is represented by depth maps in place of RGB images.

The feature extraction module is followed by a convolutional block that produces an initial coarse prediction of human body joints analyzing the image features extracted by the previous block only. The output of this part can be expressed as:

$$\begin{aligned} \mathbf {P}^1 = \phi (\mathbf {F}, \theta ^1) \end{aligned}$$
(1)

where \(\mathbf {F}\) are the feature maps computed by the feature extraction module and \(\phi \) is a parametric function that represents the first convolutional block of the architecture with parameters \(\theta ^1\). Here, \(\mathbf {P}^1 \in \mathbb {R}^{k \times w \times h}\).

Then, a multi-stage architecture is employed. A common convolutional block is sequentially repeated \(T - 1\) times in order to gradually refine the body joint prediction. At each stage, this block analyzes the concatenation of the features extracted by the feature extraction module and the output of the previous stage, refining the earlier prediction. The output at each step can be represented with

$$\begin{aligned} \mathbf {P}^t = \psi ^t(\mathbf {F} \oplus \mathbf {P}^{t-1}, \theta ^t) \quad \forall t \in [2, T] \end{aligned}$$
(2)

where \(\mathbf {F}\) are the feature maps computed by the feature extraction module, \(\mathbf {P}^{t-1}\) is the prediction of the previous block, \(\oplus \) is the concatenation operation, and \(\psi ^t\) is a parametric function that represents the repeated convolutional block of the architecture with parameters \(\theta ^t\). As in the previous case, \(\mathbf {P}^t \in \mathbb {R}^{k \times w \times h}\).

The model is implemented in the popular framework Pytorch [14]. Further details regarding the network architecture are reported in Fig. 3.

Fig. 3.
figure 3

Proposed architecture overview. Each block contains its type (C: convolution, MP: max pool), the kernel size, the number of feature maps, and the number of repetitions (if higher than 1). In our experiments, \(K = 21\) and \(T = 6\).

4.2 Training Procedure

The architecture is trained in an end-to-end manner applying the following objective function

$$\begin{aligned} L^t = \sum _{k = 1}^{K} \alpha _k \cdot \sum _{\mathbf {p}} \Vert \mathbf {P}^t_{k}(\mathbf {p}) - \mathbf {H}_{k}(\mathbf {p}) \Vert ^{2}_{2}, \end{aligned}$$
(3)

where K is the number of considered body joints, \(\alpha _k\) is a binary mask with \(\alpha _k = 0\) if the annotation of joint k is missing, t is the current stage, and \(\mathbf {p} \in \mathbb {R}^2\) is the spatial location.

Here, \(\mathbf {P}^t_{k}(\mathbf {p})\) represents the prediction at location \(\mathbf {p}\) for joint k while \(\mathbf {H}_{k} \in \mathbb {R}^{w \times h}\) is the ground-truth heatmap for joint k, defined as

$$\begin{aligned} \mathbf {H}_{k}(\mathbf {p}) = e^{- ||\mathbf {p}-\mathbf {x}_{k}||_2^2 \, \cdot \, \sigma ^{-2}} \end{aligned}$$
(4)

where \(\mathbf {p} \in \mathbb {R}^2\) is the location in the heatmap, \(\mathbf {x}_{k} \in \mathbb {R}^2\) is the location of joint k, and \(\sigma \) is a parameter to control the Gaussian spread. We set \(\sigma = 7\).

Therefore, the overall objective function can be expressed as \(L = \sum _{t = 1}^{T} L^t\) where T is the number of stages. In our experiments, \(T = 6\).

As outlined in [5], applying the supervision at every stage of the network mitigates the vanishing gradient problem and, in conjunction with the sequential refining of the body joint prediction, leads to a faster and more effective training of the whole architecture.

The network is trained in two steps. In the first stage, the original body joint annotations of Watch-n-Patch are employed to train the whole architecture from scratch. It is worth noting that the Watch-n-Patch body joints are inferred by the Kinect SDK which makes use of a random forest-based algorithm [15].

In the second stage, the network is finetuned using the training set of the presented dataset. During this phase, we test different procedures. In the first tested procedure, the whole architecture is fine-tuned, in the second one the feature extraction block is frozen and not updated, while in the last procedure all the blocks but the last one are frozen and not updated.

During both training and finetuning, we apply data augmentation techniques and dropout regularization to improve the generalization capabilities of the model. In particular, we apply random horizontal flip, crop (extracting a portion of \(488 \times 400\) from the original image with size \(512 \times 424\)), resize (to the crop dimension), and rotation (degrees in \([-4^\circ , +4^\circ ]\)). Dropout is applied between the first convolutional block and each repeated block.

In our experiments, we employ the Adam optimizer [10] with \(\alpha = 0.9\), \(\beta = 0.999\), and weight decay set to \(1 \cdot 10^{-4}\). During the training phase, we use a learning rate of \(1 \cdot 10^{-4}\) while, during the finetuning step, we use a learning rate of \(1 \cdot 10^{-4}\) and apply the dropout regularization with dropout probability of 0.5.

5 Experimental Results

5.1 Evaluation Procedure

We adopt an evaluation procedure following what proposed for the COCO Keypoints Challenge on the COCO website [12].

In details, we employ the mean Average Precision (mAP) to assess the quality of the human pose estimations compared to the ground-truth positions. The mAP is defined as the mean of 10 Average Precision calculated with different Object Keypoint Similarity (OKS) thresholds:

$$\begin{aligned} \text {mAP} = \frac{1}{10} \, \sum _{i = 1}^{10} \text {AP}^{\text {OKS} = 0.45 + 0.05 i} \end{aligned}$$
(5)
Table 2. Comparison of the mAP reached by different methods computed on the Watch-R-Patch dataset. See Sect. 4 for further details.

The OKS is defined as

$$\begin{aligned} \text {OKS} = \frac{\sum _i^K [ \delta (v_i> 0) \cdot \exp {\frac{-d_i^2}{2 s^2 k_i^2}} ]}{\sum _i^K [ \delta (v_i > 0) ]} \end{aligned}$$
(6)

where \(d_i\) is the Euclidean distance between the ground-truth and the predicted location of the keypoint i, s is the area containing all the keypoints, and \(k_i\) is defined as \(k_i = 2 \sigma _i\). Finally, \(v_i\) is a visibility flag: \(v_i=0\) means that keypoint i is not labeled while \(v_i=1\) means that keypoint i is labeled.

The values of \(\varvec{\sigma }\) depend on the dimension of each joint of the human body. In particular, we use the following values: \(\sigma _i = 0.107\) for the spine, the neck, the head, and the hip joints; \(\sigma _i = 0.089\) for the ankle and the foot joints; \(\sigma _i = 0.087\) for the knee joints; \(\sigma _i = 0.079\) for the shoulder joints; \(\sigma _i = 0.072\) for the elbow joints; \(\sigma _i = 0.062\) for the wrist and the hand joints.

Table 3. mAP of each body joint present in the Watch-R-Patch dataset.

5.2 Results

Following the evaluation procedure described in Sect. 5.1, we perform extensive experimental evaluations in order to assess the quality of the proposed dataset and method. Results are reported in Table 2.

Fig. 4.
figure 4

Qualitative results obtained on the Watch-R-Patch dataset.

Firstly, we have assessed the accuracy obtained by our architecture after a training step employing the original Watch-n-Patch dataset. This experiment corresponds to \(\text {Ours}_{{\textit{orig}}}\) in Table 2. As expected, when trained on the Kinect annotations, our model is capable of learning to predict human body joints accordingly to the Shotton et al. ’s method [15], reaching a remarkable mAP of 0.777 on the Watch-n-Patch testing set.

We also test the performance of the network employing our annotations as the ground-truth. In this case, our method reach a mAP of 0.729, outperforming the Shotton et al. ’s method with an absolute margin of 0.119. It is worth noting that our method has been trained on the Kinect annotations only, but the overall performance on the manually-annotated sequences is considerably higher than the one of [15]. We argue that the proposed architecture has better generalization capabilities than the method proposed in [15], even if it has been trained on the predictions of [15], therefore it obtains a higher mAP when tested on scenes with actual body joint annotations.

Then, we report the results obtained applying different finetuning procedures. In particular, we firstly train the proposed network on the original Watch-n-Patch annotations then we finetune the model with the proposed annotations updating different parts of the architecture. In the experiment \(\text {Ours}_{{\textit{last}}}\), we freeze the parameters of all but the last repeated block, which means updating only the parameters \(\theta ^6\) of the last convolutional block \(\psi ^6\). In \(\text {Ours}_{{\textit{blk}}}\), we freeze the parameters of the feature extraction block, i.e. only the parameters \(\theta ^t\) of \(\phi \) and \(\psi ^t\) are updated. Finally, we finetune updating the whole network in the experiment \(\text {Ours}\). As shown in Table 2, finetuning the whole architecture leads to the highest \(\text {AP}^{\text {OKS}=0.50}\), \(\text {AP}^{\text {OKS}=0.75}\), and mAP scores. The proposed model, trained on the original Watch-n-Patch dataset and finetuned on the presented annotations, reaches a remarkable mAP of 0.797, outperforming the Shotton et al. ’s method with an absolute gain of 0.187.

Finally, we report per-joint mAP scores in Table 3. As it can be observed, the proposed method outperforms the competitor and the baseline in nearly every joint prediction, confirming the capabilities and the quality of the model and the employed training procedure. Qualitative results are reported in (Fig. 4).

The model is able to run in real-time (5.37 ms, 186 fps) on a workstation equipped with an Intel Core i7-6850K and a GPU Nvidia 1080 Ti.

6 Conclusions

In this paper we have investigated the human pose estimation on depth maps. We have proposed a simple annotation refinement tool and a novel set of fine joint annotations for a representative subset of the Watch-n-Patch dataset, which we have published free-of-charge. We have presented a deep learning-based architecture that performs the human pose estimation by means of body joints, reaching state-of-the-art results on the challenging fine annotations of the Watch-R-Patch dataset. As future work, we plan to publicly release the annotation tool and to complete the annotation of the Watch-n-Patch dataset.