Manual Annotations on Depth Maps for Human Pose Estimation

D’Eusanio, Andrea; Pini, Stefano; Borghi, Guido; Vezzani, Roberto; Cucchiara, Rita

doi:10.1007/978-3-030-30642-7_21

Andrea D’Eusanio¹⁴,
Stefano Pini¹⁴,
Guido Borghi¹⁴,
Roberto Vezzani¹⁴ &
…
Rita Cucchiara¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11751))

Included in the following conference series:

International Conference on Image Analysis and Processing

1847 Accesses
5 Citations

Abstract

Few works tackle the Human Pose Estimation on depth maps. Moreover, these methods usually rely on automatically annotated datasets, and these annotations are often imprecise and unreliable, limiting the achievable accuracy using this data as ground truth. For this reason, in this paper we propose an annotation refinement tool of human poses, by means of body joints, and a novel set of fine joint annotations for the Watch-n-Patch dataset, which has been collected with the proposed tool. Furthermore, we present a fully-convolutional architecture that performs the body pose estimation directly on depth maps. The extensive evaluation shows that the proposed architecture outperforms the competitors in different training scenarios and is able to run in real-time.

You have full access to this open access chapter, Download conference paper PDF

3D human pose estimation by depth map

Article 03 September 2019

Jianzhai Wu, Dewen Hu, … Jiongming Su

Semi Automatic Hand Pose Annotation Using a Single Depth Camera

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Keywords

1 Introduction

In recent years, the task of estimating the human pose has been widely explored in the computer vision community. Many deep learning-based algorithms that tackles the 2D human pose estimation have been proposed [5, 19, 22] along with a comprehensive set of annotated datasets, collected both in real world [1, 8, 11] or in simulations [7, 17]. However, the majority of these works and data collections are based on standard intensity images (i.e. RGB and gray-level data) while datasets and algorithms based only on depth maps, i.e. images in which the value of each pixel represents the distance between the acquisition device and that point in the scene, have been less explored, even though this kind of data contains fine 3D information and it can be used in particular settings, like the automotive one [4, 18], since depth maps are usually acquired through IR.

A milestone in the human pose estimation on depth maps is the work of Shotton et al. [15], based on the Random Forest algorithm, that has been implemented in both commercial versions of the Microsoft Kinect SDK. This real time algorithm has been widely used to automatically produce body joints annotations in depth-based public datasets. However, these annotations have limited accuracy: in [15], the authors report a mean average precision of 0.655 on synthetic data with full rotations.

For these reasons, in this paper we present Watch-R-Patch, a novel refined set of annotations of the well-known Watch-n-Patch dataset [20], which contains annotations provided by Shotton et al. ’s method [15].

Original wrong, imprecise, or missing body joints have been manually corrected for 20 training sequences and 20 testing sequences, equally split between the different scenarios of the dataset, i.e. office and kitchen.

Furthermore, we present a deep learning-based architecture, inspired by [5], that performs the human pose estimation on depth images only. The model is trained combining the original Watch-n-Patch dataset with the manually-refined annotations, obtaining remarkable results. Similar to [15], the proposed system achieves real time performance and can run at more than 180 fps.

2 Related Work

The majority of the literature regarding the Human Pose Estimation task is focused on intensity images [6, 13, 22]. In [19] a sequential architecture is proposed in order to learn implicit spatial models. Dense predictions, that corresponds to the final human body joints, are increasingly refined through different stage into the network model. The evolution of this method [5] introduces the concept of Part Affinity Fields that allows learning the links between the body parts of each subject present in the image.

Only a limited part of works is based on depth maps, i.e. images that provide information regarding the distance of the objects in the scene from the camera, One plausible limitation of depth-based methods is the lack of rich depth-based datasets which have been specifically collected for the human pose estimation task and contains manual body joint annotations. Indeed, available datasets are often small, both in terms of number of annotated frames and in terms of subjects. limiting their usability for the training of deep neural networks. In 2011, a method to quickly predict the positions of body joints from a single depth image was proposed in [15]. An object recognition approach is adopted, in order to shift the human pose estimation task in a per-pixel classification problem. The method is based on the random forest algorithm and on a wide annotated dataset, which has not been publicly released. A viewpoint invariant model for the human pose estimation was recently proposed in [9], in which a discriminant model embeds local regions into a particular feature space. This work is based on the Invariant-Top View Dataset, a dataset with frontal and top-view recordings of the subjects.

Recently, approaches performing the head detection directly on depth maps were proposed in [2, 3]. In [3], a shallow deep neural network is exploited to classify depth patches as head or non-head in order to obtain an estimation of the head centre joint. The Watch-n-Patch dataset [20, 21] has been collected for the unsupervised learning of relations and actions task. Its body joints annotation are obtained applying an off-the-shelf method [15], therefore they are not particularly accurate, in particular when subjects stand in a non-frontal position.

3 Dataset

In this section, we firstly report an overview of the Watch-n-Patch dataset [20]. Then, we present the procedure we used to improve the original joint annotations and the statistics of the manually refined annotations which are referred as Watch-R(efined)-Patch. The dataset will be publicly available^{Footnote 1}.

3.1 Watch-n-Patch Dataset

Watch-n-Patch [20] is a challenging RGB-D dataset acquired with the second version of the Microsoft Kinect sensor: differently from the first one, it is a Time-of-Flight depth device. The dataset contains recordings of 7 people performing 21 different kinds of actions. Each recording contains a single subject performing multiple actions in one room chosen between 8 offices and 5 kitchens.

The dataset contains 458 videos, corresponding to about 230 min and 78k frames. The authors provide both RGB and depth frames (with a spatial resolution of $1920 \times 1080$ and $512 \times 424$, respectively) and human body skeletons (composed of 25 body joints) estimated and tracked with the method proposed in [15].

3.2 Annotation Procedure

We collect refined annotations for the Watch-n-Patch dataset using a quick and easy-to-use annotation tool. In particular, we develop a system that shows the original body joints (i.e. the Watch-n-Patch joints) on top of the acquired depth image. The user is then able to move the incorrect joints in the proper positions using the mouse in a drag-and-drop fashion. Once every incorrect joint has been placed in the correct location, the user can save the new annotation and move to the next frame. It is worth noting that, in this way, the user has only to move the joints in the wrong position while already-correct joints do not have to be moved or inserted. Therefore, original correct joints are preserved, while improving wrongly-predicted joints. We have ignored finger joints (tip and thumb) since original annotations are not reliable and these joints are often occluded. An overview of the developed annotation tool is shown in Fig. 1. The annotation tool is publicly released^{Footnote 2}.

3.3 Statistics

We manually annotate body joints in 20 sequences from the original training set and 20 sequences from the original testing set. Sequences are equally split between office and kitchen sequences. To speed up the annotation procedure and increase the scene variability, we decided to fine-annotate a frame every 3 frames in the original sequences. In some test sequences, every frame has been fine-annotated. The overall number of annotated frames is 3329, 1135 in the training set, 766 in the validation one, and 1428 in the testing one. We also propose an official validation set for the refined annotations, composed of a subset of the testing set, in order to standardize the validation and testing procedures.

For additional statistics regarding the annotated sequences and the proposed train, validation, and test splits, please refer to Table 1. A qualitative overview of the dataset is reported in Fig. 2.

Table 1. Statistics of the Watch-R-Patch dataset.

Full size table

4 Proposed Method

In the development of the human pose estimation architecture, we focus on both the performance (in terms of mean Average Precision (mAP)) and the speed (in terms of frames per second (fps)).

To guarantee high performance, we decided to develop a deep neural network derived from [5] while, to guarantee high fps, even on cheap hardware, we do not include the Part Affinity Fields section (for details about PAF, see [5]).

4.1 Network Architecture

An overview of the proposed architecture is shown in Fig. 3.

The first part of the architecture is composed of a VGG-like feature extraction block which comprises the first 10 layers of VGG-19 [16] and two layers that gradually reduce the number of feature maps to the desired value. In contrast to [5], we do not use ImageNet pre-trained weights and we train these layers from scratch in conjunction with the rest of the architecture since the input is represented by depth maps in place of RGB images.

The feature extraction module is followed by a convolutional block that produces an initial coarse prediction of human body joints analyzing the image features extracted by the previous block only. The output of this part can be expressed as:

$$\begin{aligned} \mathbf {P}^1 = \phi (\mathbf {F}, \theta ^1) \end{aligned}$$

(1)

where $\mathbf {F}$ are the feature maps computed by the feature extraction module and $\phi $ is a parametric function that represents the first convolutional block of the architecture with parameters $\theta ^1$. Here, $\mathbf {P}^1 \in \mathbb {R}^{k \times w \times h}$.

Then, a multi-stage architecture is employed. A common convolutional block is sequentially repeated $T - 1$ times in order to gradually refine the body joint prediction. At each stage, this block analyzes the concatenation of the features extracted by the feature extraction module and the output of the previous stage, refining the earlier prediction. The output at each step can be represented with

$$\begin{aligned} \mathbf {P}^t = \psi ^t(\mathbf {F} \oplus \mathbf {P}^{t-1}, \theta ^t) \quad \forall t \in [2, T] \end{aligned}$$

(2)

where $\mathbf {F}$ are the feature maps computed by the feature extraction module, $\mathbf {P}^{t-1}$ is the prediction of the previous block, $\oplus $ is the concatenation operation, and $\psi ^t$ is a parametric function that represents the repeated convolutional block of the architecture with parameters $\theta ^t$. As in the previous case, $\mathbf {P}^t \in \mathbb {R}^{k \times w \times h}$.

The model is implemented in the popular framework Pytorch [14]. Further details regarding the network architecture are reported in Fig. 3.

4.2 Training Procedure

The architecture is trained in an end-to-end manner applying the following objective function

$$\begin{aligned} L^t = \sum _{k = 1}^{K} \alpha _k \cdot \sum _{\mathbf {p}} \Vert \mathbf {P}^t_{k}(\mathbf {p}) - \mathbf {H}_{k}(\mathbf {p}) \Vert ^{2}_{2}, \end{aligned}$$

(3)

where K is the number of considered body joints, $\alpha _k$ is a binary mask with $\alpha _k = 0$ if the annotation of joint k is missing, t is the current stage, and $\mathbf {p} \in \mathbb {R}^2$ is the spatial location.

Here, $\mathbf {P}^t_{k}(\mathbf {p})$ represents the prediction at location $\mathbf {p}$ for joint k while $\mathbf {H}_{k} \in \mathbb {R}^{w \times h}$ is the ground-truth heatmap for joint k, defined as

$$\begin{aligned} \mathbf {H}_{k}(\mathbf {p}) = e^{- ||\mathbf {p}-\mathbf {x}_{k}||_2^2 \, \cdot \, \sigma ^{-2}} \end{aligned}$$

(4)

where $\mathbf {p} \in \mathbb {R}^2$ is the location in the heatmap, $\mathbf {x}_{k} \in \mathbb {R}^2$ is the location of joint k, and $\sigma $ is a parameter to control the Gaussian spread. We set $\sigma = 7$.

Therefore, the overall objective function can be expressed as $L = \sum _{t = 1}^{T} L^t$ where T is the number of stages. In our experiments, $T = 6$.

As outlined in [5], applying the supervision at every stage of the network mitigates the vanishing gradient problem and, in conjunction with the sequential refining of the body joint prediction, leads to a faster and more effective training of the whole architecture.

The network is trained in two steps. In the first stage, the original body joint annotations of Watch-n-Patch are employed to train the whole architecture from scratch. It is worth noting that the Watch-n-Patch body joints are inferred by the Kinect SDK which makes use of a random forest-based algorithm [15].

In the second stage, the network is finetuned using the training set of the presented dataset. During this phase, we test different procedures. In the first tested procedure, the whole architecture is fine-tuned, in the second one the feature extraction block is frozen and not updated, while in the last procedure all the blocks but the last one are frozen and not updated.

During both training and finetuning, we apply data augmentation techniques and dropout regularization to improve the generalization capabilities of the model. In particular, we apply random horizontal flip, crop (extracting a portion of $488 \times 400$ from the original image with size $512 \times 424$), resize (to the crop dimension), and rotation (degrees in $[-4^\circ , +4^\circ ]$). Dropout is applied between the first convolutional block and each repeated block.

In our experiments, we employ the Adam optimizer [10] with $\alpha = 0.9$, $\beta = 0.999$, and weight decay set to $1 \cdot 10^{-4}$. During the training phase, we use a learning rate of $1 \cdot 10^{-4}$ while, during the finetuning step, we use a learning rate of $1 \cdot 10^{-4}$ and apply the dropout regularization with dropout probability of 0.5.

5 Experimental Results

5.1 Evaluation Procedure

We adopt an evaluation procedure following what proposed for the COCO Keypoints Challenge on the COCO website [12].

In details, we employ the mean Average Precision (mAP) to assess the quality of the human pose estimations compared to the ground-truth positions. The mAP is defined as the mean of 10 Average Precision calculated with different Object Keypoint Similarity (OKS) thresholds:

$$\begin{aligned} \text {mAP} = \frac{1}{10} \, \sum _{i = 1}^{10} \text {AP}^{\text {OKS} = 0.45 + 0.05 i} \end{aligned}$$

(5)

Table 2. Comparison of the mAP reached by different methods computed on the Watch-R-Patch dataset. See Sect. 4 for further details.

Full size table

The OKS is defined as

$$\begin{aligned} \text {OKS} = \frac{\sum _i^K [ \delta (v_i> 0) \cdot \exp {\frac{-d_i^2}{2 s^2 k_i^2}} ]}{\sum _i^K [ \delta (v_i > 0) ]} \end{aligned}$$

(6)

where $d_i$ is the Euclidean distance between the ground-truth and the predicted location of the keypoint i, s is the area containing all the keypoints, and $k_i$ is defined as $k_i = 2 \sigma _i$. Finally, $v_i$ is a visibility flag: $v_i=0$ means that keypoint i is not labeled while $v_i=1$ means that keypoint i is labeled.

The values of $\varvec{\sigma }$ depend on the dimension of each joint of the human body. In particular, we use the following values: $\sigma _i = 0.107$ for the spine, the neck, the head, and the hip joints; $\sigma _i = 0.089$ for the ankle and the foot joints; $\sigma _i = 0.087$ for the knee joints; $\sigma _i = 0.079$ for the shoulder joints; $\sigma _i = 0.072$ for the elbow joints; $\sigma _i = 0.062$ for the wrist and the hand joints.

Table 3. mAP of each body joint present in the Watch-R-Patch dataset.

Full size table

5.2 Results

Following the evaluation procedure described in Sect. 5.1, we perform extensive experimental evaluations in order to assess the quality of the proposed dataset and method. Results are reported in Table 2.

Firstly, we have assessed the accuracy obtained by our architecture after a training step employing the original Watch-n-Patch dataset. This experiment corresponds to $\text {Ours}_{{\textit{orig}}}$ in Table 2. As expected, when trained on the Kinect annotations, our model is capable of learning to predict human body joints accordingly to the Shotton et al. ’s method [15], reaching a remarkable mAP of 0.777 on the Watch-n-Patch testing set.

We also test the performance of the network employing our annotations as the ground-truth. In this case, our method reach a mAP of 0.729, outperforming the Shotton et al. ’s method with an absolute margin of 0.119. It is worth noting that our method has been trained on the Kinect annotations only, but the overall performance on the manually-annotated sequences is considerably higher than the one of [15]. We argue that the proposed architecture has better generalization capabilities than the method proposed in [15], even if it has been trained on the predictions of [15], therefore it obtains a higher mAP when tested on scenes with actual body joint annotations.

Then, we report the results obtained applying different finetuning procedures. In particular, we firstly train the proposed network on the original Watch-n-Patch annotations then we finetune the model with the proposed annotations updating different parts of the architecture. In the experiment $\text {Ours}_{{\textit{last}}}$, we freeze the parameters of all but the last repeated block, which means updating only the parameters $\theta ^6$ of the last convolutional block $\psi ^6$. In $\text {Ours}_{{\textit{blk}}}$, we freeze the parameters of the feature extraction block, i.e. only the parameters $\theta ^t$ of $\phi $ and $\psi ^t$ are updated. Finally, we finetune updating the whole network in the experiment $\text {Ours}$. As shown in Table 2, finetuning the whole architecture leads to the highest $\text {AP}^{\text {OKS}=0.50}$, $\text {AP}^{\text {OKS}=0.75}$, and mAP scores. The proposed model, trained on the original Watch-n-Patch dataset and finetuned on the presented annotations, reaches a remarkable mAP of 0.797, outperforming the Shotton et al. ’s method with an absolute gain of 0.187.

Finally, we report per-joint mAP scores in Table 3. As it can be observed, the proposed method outperforms the competitor and the baseline in nearly every joint prediction, confirming the capabilities and the quality of the model and the employed training procedure. Qualitative results are reported in (Fig. 4).

The model is able to run in real-time (5.37 ms, 186 fps) on a workstation equipped with an Intel Core i7-6850K and a GPU Nvidia 1080 Ti.

6 Conclusions

In this paper we have investigated the human pose estimation on depth maps. We have proposed a simple annotation refinement tool and a novel set of fine joint annotations for a representative subset of the Watch-n-Patch dataset, which we have published free-of-charge. We have presented a deep learning-based architecture that performs the human pose estimation by means of body joints, reaching state-of-the-art results on the challenging fine annotations of the Watch-R-Patch dataset. As future work, we plan to publicly release the annotation tool and to complete the annotation of the Watch-n-Patch dataset.

Notes

1.
Watch-R-Patch: http://imagelab.ing.unimore.it/depthbodypose.
2.
Annotation tool: https://github.com/aimagelab/human-pose-annotation-tool.

References

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Google Scholar
Ballotta, D., Borghi, G., Vezzani, R., Cucchiara, R.: Fully convolutional network for head detection with depth images. In: 2018 24th International Conference on Pattern Recognition (ICPR), IEEE, pp. 752–757 (2018)
Google Scholar
Ballotta, D., Borghi, G., Vezzani, R., Cucchiara, R.: Head detection with depth images in the wild. In: VISAPP (2018)
Google Scholar
Borghi, G., Fabbri, M., Vezzani, R., Cucchiara, R., et al.: Face-from-depth for head pose estimation on depth images. IEEE Trans. Pattern Anal. Mach. Intell. (2018). https://doi.org/10.1109/TPAMI.2018.2885472, https://ieeexplore.ieee.org/document/8567956
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)
Google Scholar
Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: International Conference on 3D Vision (3DV) (2016)
Google Scholar
Güler, R.A., Neverova, N., Kokkinos, I.: Densepose: dense human pose estimation in the wild. In: CVPR (2018)
Google Scholar
Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3D human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 160–177. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_10
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
COCO - Keypoint Evaluation. http://cocodataset.org/#keypoints-eval
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017)
Google Scholar
Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)
Google Scholar
Venturelli, M., Borghi, G., Vezzani, R., Cucchiara, R.: Deep head pose estimation from depth data for in-car automotive applications. In: International Workshop on Understanding Human Activities through 3D Sensors, pp. 74–85 (2016)
Chapter Google Scholar
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
Google Scholar
Wu, C., Zhang, J., Savarese, S., Saxena, A.: Watch-n-Patch: unsupervised understanding of actions and relations. In: CVPR (2015)
Google Scholar
Wu, C., Zhang, J., Sener, O., Selman, B., Savarese, S., Saxena, A.: Watch-n-Patch: unsupervised learning of actions and relations. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 467–481 (2018)
Article Google Scholar
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV (2018)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy
Andrea D’Eusanio, Stefano Pini, Guido Borghi, Roberto Vezzani & Rita Cucchiara

Authors

Andrea D’Eusanio
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Pini
View author publications
You can also search for this author in PubMed Google Scholar
Guido Borghi
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Vezzani
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guido Borghi .

Editor information

Editors and Affiliations

University of Trento, Povo, Italy
Elisa Ricci
Mapillary Research, Graz, Austria
Samuel Rota Bulò
University of Amsterdam, Amsterdam, The Netherlands
Cees Snoek
Fondazione Bruno Kessler, Povo, Italy
Oswald Lanz
Fondazione Bruno Kessler, Povo, Italy
Stefano Messelodi
University of Trento, Povo, Italy
Nicu Sebe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

D’Eusanio, A., Pini, S., Borghi, G., Vezzani, R., Cucchiara, R. (2019). Manual Annotations on Depth Maps for Human Pose Estimation. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds) Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11751. Springer, Cham. https://doi.org/10.1007/978-3-030-30642-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-30642-7_21
Published: 02 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30641-0
Online ISBN: 978-3-030-30642-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Manual Annotations on Depth Maps for Human Pose Estimation

Abstract