Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images

Cai, Yujun; Ge, Liuhao; Cai, Jianfei; Yuan, Junsong

doi:10.1007/978-3-030-01231-1_41

Yujun Cai ORCID: orcid.org/0000-0002-0993-4024¹⁷,
Liuhao Ge¹⁷,
Jianfei Cai¹⁸ &
…
Junsong Yuan¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11210))

Included in the following conference series:

European Conference on Computer Vision

2816 Accesses
139 Citations

Abstract

Compared with depth-based 3D hand pose estimation, it is more challenging to infer 3D hand pose from monocular RGB images, due to substantial depth ambiguity and the difficulty of obtaining fully-annotated training data. Different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, we propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing we take only RGB inputs for 3D joint predictions. In this way, we alleviate the burden of the costly 3D annotations in real-world dataset. Particularly, we propose a weakly-supervised method, adaptating from fully-annotated synthetic dataset to weakly-labeled real-world dataset with the aid of a depth regularizer, which generates depth maps from predicted 3D pose and serves as weak supervision for 3D pose regression. Extensive experiments on benchmark datasets validate the effectiveness of the proposed depth regularizer in both weakly-supervised and fully-supervised settings.

This research is supported by the BeingTogether Centre, a collaboration between Nanyang Technological University (NTU) Singapore and University of North Carolina (UNC) at Chapel Hill. The BeingTogether Centre is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative. This research is also supported in part by Singapore MoE Tier-2 Grant (MOE2016-T2-2-065) and start-up funds from University at Buffalo.

You have full access to this open access chapter, Download conference paper PDF

Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints

Self6D: Self-supervised Monocular 6D Object Pose Estimation

Multi-person Absolute 3D Human Pose Estimation with Weak Depth Supervision

Keywords

1 Introduction

Articulated hand pose estimation has aroused a long-standing study in the past decades [23, 38, 39], since it plays a significant role in numerous applications such as human-computer interaction and virtual reality. Although 3D hand pose estimation with depth cameras [6, 7, 13, 26, 41] has gained tremendous success in recent years, the advance in monocular RGB-based 3D hand pose estimation [15, 18, 27, 46], however, still remains limited. Due to the availability of RGB cameras, the RGB-based solution for 3D hand pose estimation is more favored than depth-based solutions in many vision applications.

Compared with depth images, single-view RGB images exhibit inherent depth ambiguity, which makes 3D hand pose estimation from single RGB images a challenging problem. To overcome the ambiguity, recent work on RGB-based 3D hand pose estimation [46] rely on large amount of labeled data for training, while comprehensive real-world dataset with complete 3D annotations is often difficult to obtain, thus limiting the performance. Specifically, compared with 2D annotations, providing 3D annotations for real-world RGB images is typically more difficult since 2D locations can be directly defined in the RGB images while 3D locations cannot be easily labeled by human annotator. To address this problem, Zimmermann et al. [46] turned to render low-cost synthetic hands with 3D models, from which the ground truth of 3D joints can be easily obtained. Although achieving good performance on the synthetic dataset, this method, however, does not generalize well to real image dataset due to the domain shift between image features. Paschalis [22] employed a discriminative approach to localize the 2D keypoints and model fitting method to calculate the 3D pose. Recently, Muller et al. [18] leveraged CycleGANs [45] to generate a “real” dataset transferred from synthetic dataset. However, limited performance shows that there still exists gap between generated “real” images and real-world images.

Our proposed weakly-supervised adaptation method addresses this limitation in a novel perspective. We observe that most previous work [18, 27, 46] for hand pose estimation from real-world single-view RGB images focus on training with complete 3D annotations, which are expensive and time-consuming to obtain, while ignoring the depth images that can be easily captured by commodity RGB-D cameras. Moreover, it is indicated that such low-cost depth images contain rich cues for 3D hand pose labels, as depth-based methods show decent performance on 3D pose estimation. Based on these observations, we propose to leverage the easily captured depth images to compensate the scarcity of entire 3D annotations during training, while during testing we take only RGB inputs for 3D hand pose estimation. Figure 1 illustrates the concept of our proposed weakly supervised 3D hand pose estimation method, which alleviates the burden of the costly 3D annotations in real-world datasets.

In particular, similar to the previous work [1, 32, 37, 42, 44] in body pose estimation, we apply a cascaded network architecture including a 2D pose estimation network and a 3D regression network. We note that directly transferring the network trained on synthetic dataset to real-world dataset usually produces poor estimation accuracy, due to the domain gap between them. To address this problem, inspired by [4, 19], we innovate the structure with a depth regularizer, which generates depth images from predicted 3D hand pose and regularizes the predicted 3D regression by supervising the rendered depth map, as shown in Fig. 1(b). This network essentially learns the mapping from 3D pose to its corresponding depth map, which can be used for the knowledge transfer from the fully-annotated synthetic images to weakly-labeled real-world images without entire 3D annotations. Additionally, we apply the depth regularizer to the fully-supervised setting. The effectiveness of the depth regularizer is experimentally verified for both our weakly-supervised and fully-supervised methods on two benchmark datasets: RHD [46] and STB datasets [43].

To summarize, this work makes the following contributions:

We innovatively introduce the weakly supervised problem of leveraging low-cost depth maps during training for 3D hand pose estimation from RGB images, which releases the burden of 3D joint labeling.
We propose an end-to-end learning based 3D hand pose estimation model for weakly-supervised adaptation from fully-annotated synthetic images to weakly-labeled real-world images. Particularly, we introduce a depth regularizer supervised by the easily captured depth images, which considerably enhances the estimation accuracy compared with weakly-supervised baselines (see Fig. 2).
We conduct experiments on the two benchmark datasets, which show that our weakly-supervised approach compares favorably with existing works and our proposed fully-supervised method outperforms all the state-of-the-art methods.

2 Related Work

3D hand pose estimation has been studied extensively for a long time, with vast theoretical innovations and important applications. Early works [17, 23, 28] on 3D hand pose estimation from monocular color input used complex model-fitting schemes which require strong prior knowledge on physics or dynamics and multiple hypotheses. These sophisticated methods, however, usually suffer from low estimation accuracy and restricted environments, which result in limited prospects in real-world applications. While multi-view approaches [21, 35] alleviate the occlusion problem and provide decent accuracy, they require sophisticated mesh models and optimization strategies that prohibit them from real-time tasks.

The emergence of low-cost consumer-grade depth sensors in the last few years greatly promotes the research on depth-based 3D hand pose estimation, since the captured depth images provide richer context that significantly reduces depth ambiguity. With the prevailing of deep learning technology [10], learning-based 3D hand pose estimation from single depth images has also been introduced, which can achieve state-of-the-art 3D pose estimation performance in real time. In general, they can be classified into generative approaches [16, 20, 34], discriminative approaches [5,6,7,8, 13, 40] and hybrid approaches [25, 30, 31].

Inspired by the great improvement of CNN-based 3D hand pose estimation from depth images [24], deep learning has also been adopted in some recent works on monocular RGB-based applications [18, 46]. In particular, Zimmermann et al. [46] proposed a deep network that learns an implicit 3D articulation prior of joint locations in canonical coordinates, as well as constructs a synthetic dataset to tackle the problem of insufficient annotations. Muller et al. [18] embedded a “GANerated" network which transfers the synthetic images to “real” ones so as to reduce the domain shift between them. The performance gain achieved by these methods indicates a promising direction, although estimating 3D hand pose from single-view RGB images is far more challenging due to the absence of depth information. Our work, as a follow-up exploration, aims at alleviating the burden of 3D annotations in real-world dataset by bridging the gap between fully-annotated synthetic images and weakly-labeled real-world images.

Dibra et al. [4] is the closest work in spirit to our approach, which proposed an end-to-end network that enables the adaptation from synthetic dataset to unlabeled real-world dataset. However, we want to emphasize that our method is significantly different from [4] in several aspects. Firstly, our work is targeted at 3D hand pose estimation from single RGB input, whereas [4] focuses on depth-based predictions. Secondly, compared with [4] that leverages a rigged 3D hand model to synthesize depth images, we use a simple fully-convolutional network to infer the corresponding depth maps from the predicted 3D hand pose. To the best of our knowledge, our weakly-supervised adaptation is the first learning-based attempt that introduces a depth regularizer to monocular-RGB based 3D hand pose estimation. This presents an alternative solution for this problem and will enable further research of utilizing depth images in RGB-input applications.

3 Methodology

3.1 Overview

Our target is to infer 3D hand pose from a monocular RGB image, where the 3D hand pose is represented by a set of 3D joint coordinates $\mathbf{{\Phi }} = \left\{ {{\phi _{k}}} \right\} _{k = 1}^K \in \mathbf{{\Lambda }}_{3D} $. Here $\mathbf{{\Lambda }_{3D}}$ is the ${K\times {3}}$ dimensional hand joint space with $K=21$ in our case. Figure 3 depicts the proposed network architecture, which utilizes a cascaded architecture inspired from [44]. It consists of a 2D pose estimation network (convolutional pose machines - CPM), a 3D regression network, and a depth regularizer. Given a cropped single RGB image containing human hand with certain gesture, we aim to get the 2D heatmap and the corresponding depth of each joint from the proposed end-to-end network. The 2D joint locations are denoted as $\mathbf{{\Phi }}_{2D}\in \mathbf{{\Lambda }}_{2D} $, where $\mathbf{{\Lambda }}_{2D} \in \mathcal {R}^{K \times 2 } $ and the depth values are denoted as $\mathbf{{\Phi }}_{z}\in \mathbf{{\Lambda }}_{z} $, where $\mathbf{{\Lambda }}_{z} \in \mathcal {R}^{K \times 1 } $. The final output 3D joint locations are represented in the camera coordinate system, where the first two coordinates are converted from the image plane coordinates using the camera intrinsic matrix, and the third coordinate is the joint depth. Note that our depth regularizer is only utilized during training. During testing, only 2D estimation network and regression network are used to predict joint locations.

The depth regularizer is the key part to facilitate the proposed weakly supervised training, i.e., relieve the painful joint depth annotations for real-world dataset by making use of the rough depth maps, which can be easily captured by consumer-grade depth cameras. In addition, our experiments show that the introduced depth regularizer can slightly improve 3D hand pose prediction of fully-supervised methods as well, since it serves as an additional constraint for the 3D hand pose space.

The entire network is trained with a Rendered Hand Pose Dataset (RHD) created by [46] and a real-world dataset from Stereo Hand Pose Tracking Benchmark [43]. For ease of representation, the synthesized dataset and the real-world dataset are denoted as $I_{RHD}$ and $I_{STB}$, respectively. Note that for weakly-supervised learning, our model is pretrained on $I_{RHD}$ and then adapted to $I_{STB}$ by fusing the training of both datasets. For fully-supervised learning, the two datasets are used independently in the training and evaluation processes.

3.2 2D Pose Estimation Network

For 2D pose estimation, we adopt the encoder-decoder architecture similar to the Convolutional Pose Machines by Wei et al. [36] and [46], which is fully convolutional with successively refined heatmaps in resolution. The network outputs K low-resolution heatmaps. The intensity on each heat-map indicates the confidence of a joint locating in the 2D position. Here we predict each joint by applying the MMSE (Minimum mean square error given a posterior) estimator, which can be viewed as taking the integration of all locations weighed by their probabilities in the heat map, as proposed in [29]. We initialize the network with weights adapted from human pose prediction to $I_{RHD}$, tuned by Zimmermann et al. [46].

To train this module, we employ mean square error (or L2 loss) between the predicted heat map $\hat{\mathbf{{\Phi }}}_{HM} \in \mathcal {R}^{H \times W }$ and the ground-truth Gaussian heat map $G(\mathbf{{\Phi }}_{2D}^{gt})$ generated from ground truth 2D labels $\mathbf{{\Phi }}_{2D}^{gt}$ with standard deviation $\sigma = 1$. The loss function is

$$\begin{aligned} {L_{2D}(\hat{\mathbf{{\Phi }}}_{HM}, \mathbf{{\Phi }}_{2D}^{gt} )} = \sum _{h}^{H}\sum _{w}^{W}(\hat{\mathbf{{\Phi }}}_{HM}^{(h,w)} - G(\mathbf{{\Phi }}_{2D}^{gt})^{(h,w)})^2 . \end{aligned}$$

(1)

3.3 Regression Network

The objective of the regression network is to infer the depth of each joint from the obtained 2D heatmap. Most previous work [2, 32, 46] in 3D human pose and hand pose estimation based on single image attempt to lift the set of 2D heatmaps into 3D space directly, while a key issue for this strategy is how to distinguish between the multiple 3D poses inferred from a single 2D skeleton. Inspired from [44], our method exploits contextual information to reduce the ambiguity of lifting 2D heatmaps to 3D locations, by extracting the intermediate image evidence in 2D pose estimation network concatenated with the predicted 2D heatmaps as the input to the regression network. We employ a simple yet effective depth regression network structure with only two convolutional layers and three fully-connected layers. Note that here we infer a scale-invariant and translation-invariant representation of joint depth, by subtracting each hand joint with the location of root keypoint and then normalizing it by the distance between a certain pair of keypoints, as done in [18, 46].

For fully-supervised learning, we simply apply smooth L1 loss introduced in [9] between our predicted joint depth $\hat{\mathbf{{\Phi }}}_{z}$ and the ground truth label $\mathbf{{\Phi }}_{z}^{gt}$. For weakly-supervised learning, no penalty is enforced because of the absence of 3D annotations. To address this issue, we introduce a novel depth regularizer as weak supervision for joint depth regression, which will be elaborated in Sect. 3.4.

Overall, the loss function of the regression network is defined as

$$\begin{aligned} \begin{aligned} {L_{z}}(\hat{\mathbf{{\Phi }}}_{z}, \mathbf{{\Phi }}_{z}^{gt}) =\left\{ \begin{array}{ll} smooth_{L1}( \hat{\mathbf{{\Phi }}}_{z}, \mathbf{{\Phi }}_{z}^{gt} ) &{} , if \text { full supervision} \\ 0 &{}, if \text { weak supervision}\\ \end{array} \right. \end{aligned} \end{aligned}$$

(2)

in which

$$\begin{aligned} \begin{aligned} smooth_{L1}(x) =\left\{ \begin{array}{ll} 0.5x^2, &{} if |x| < 1 \\ |x| - 0.5, &{} otherwise .\\ \end{array} \right. \end{aligned} \end{aligned}$$

(3)

3.4 Depth Regularizer

The purpose of the depth regularizer is to take the easily-captured depth images as an implicit constraint of physical structures that can be applied to both weakly-supervised and fully-supervised situations. Figure 4 shows the architecture of the proposed depth regularizer, which is fully-convolutional with six layers, inspired by [3, 19]. Each layer contains a transposed convolution followed by a Relu, after which the feature map is expanded along both image dimensions. In the first five layers, batch normalization [12] and drop out [11] are introduced before Relu in order to reduce the dependency on the initialization and alleviate from overfitting the training data. The final layer combines all feature maps to generate the corresponding depth image from 3D hand pose.

Let $(\hat{\mathbf{{\Phi }}}_{3D},D)$ denote a training sample, where $\hat{\mathbf{{\Phi }}}_{3D}$ is the input of the depth regularizer containing a set of 3D hand joint locations, and $\mathbf{{D}}$ is the corresponding depth image. We normalize $\mathbf{{D}}$ into $\mathbf{{D}}_n$:

$$\begin{aligned} \begin{aligned}&{\mathbf{{D}}_{n}= \sum _{i,j}{\frac{d_{max} - d_{ij}}{d_{range}}}} \end{aligned} \end{aligned}$$

(4)

where $d_{ij}$ is the depth value at the image location (i, j), and $d_{max}$ and $d_{range}$ represent the maximum depth value and the depth range, respectively. Note that the normalized depth value tends to be larger when located closer to the camera and background is set to 0 in this process.

The input of the network $\hat{\mathbf{{\Phi }}}_{3D}= \{(\mathbf{{\Phi }}_{2D}^{gt},\mathbf{{X}}_z)\}$ contains two parts: the ground truth 2D labels $\mathbf{{\Phi }}_{2D}^{gt}$ in the image coordinate system and the joint depth $\mathbf{{X}}_z$. Note that the reason we use ground truth 2D locations rather than our predicted 2D results is to simplify the training process since no back-propagation from the depth regularizer is fed back into the 2D pose estimation network. For the joint depth $\mathbf{{X}}_z$, we apply the same normalization:

$$\begin{aligned} \begin{aligned}&{\mathbf{{X}}_{z}= {\frac{d_{max} - \hat{\varvec{\Phi }}_{z}\cdot L_{scale} - d_{root}}{d_{range}}}} \end{aligned} \end{aligned}$$

(5)

where $\hat{{\varvec{\Phi }}}_{z}$ denotes the predicted joint depth from the regression network, which is a set of root-relative and normalized values and can be recovered to global coordinates by multiplying with hand scale $L_{scale}$ and shifting to root depth $d_{root}$.

To train the depth regularizer, we adopt L1 norm to minimize the difference between the generated depth image $\hat{\mathbf{{D}}}_n$and the corresponding ground truth $\mathbf{{D}}_n$:

$$\begin{aligned} \begin{aligned}&{L_{dep}(\hat{\mathbf{{D}}}_n, \mathbf{{D}}_n )= |\hat{\mathbf{{D}}}_n - \mathbf{{D}}_n|} \end{aligned} \end{aligned}$$

(6)

3.5 Training

Combining the losses in Eqs. (1), (2), and (6), we obtain the overall loss function as

$$\begin{aligned} \begin{aligned}&L = \lambda _{2D}L_{2D}(\hat{\mathbf{{\Phi }}}_{HM}, \mathbf{{\Phi }}_{2D}^{gt}) + {\lambda _{z}L_{z}} (\hat{\mathbf{{\Phi }}}_{z}, \mathbf{{\Phi }}_{z}^{gt}) + \lambda _{dep}L_{dep}(\hat{\mathbf{{D}}}_n, \mathbf{{D}}_n) . \end{aligned} \end{aligned}$$

(7)

Adam optimization [14] is used for training. For weakly-supervised learning, similar to [44] and [33], we adopt fused training where each mini-batch contains both the synthetic and the real training examples (half-half), shuffled randomly during the training process. In our experiments, we adopt a three-stage training process, which is more effective in practice compared with direct end-to-end training. In particular, Stage 1 initializes the regression network and fine-tunes the 2D pose estimation network with weights from Zimmermann et al. [46], which are adapted from the Convolutional Pose Machines [36]. Stage 2 initializes the depth regularizer, as described in Sect. 3.4. Stage 3 fine-tunes the whole network with all the training data, which is an end-to-end training.

4 Experiments

4.1 Implementation Details

Our method is implemented with Pytorch. For the first training stage described in Sect. 3.5, we take 60 epochs with an initial learning rate of $10^{-7}$, a batch size of 8 and a regularization strength of $5\times 10^{-4}$. For Stage 2 and Stage 3, we spend 40 and 20 epochs, respectively. During the fine-tunning process of the whole network, we set $ \lambda _{2D} = 1$, $\lambda _{z} = 0.1$ and $\lambda _{dep} = 1$. All experiments are conducted on one GeForce GTX 1080 GPU with CUDA 8.0.

4.2 Datasets and Metrics

We evaluate our method on two publicly available datasets: Rendered Hand Pose Dataset (RHD) [46] and a real-world dataset from Stereo Hand Pose Tracking Benchmark (STB) [43].

RHD is a synthetic dataset of rendered hand images with a resolution of $320\times 320$, which is built upon 20 different characters performing 39 actions and is composed of 41,258 images for training and 2,728 images for testing. All samples are annotated with 2D and 3D keypoint locations. For each RGB image, the corresponding depth image is also provided. This dataset is considerably challenging due to the large variations in viewpoints and hand shapes, as well as the large visual diversity induced by random noise and different illuminations. With all the labels provided, we train the entire proposed network, including the 2D pose estimation network, the regression network and the depth regularizer.

STB is a real world dataset containing two subsets with an image resolution of $640 \times 480$: the stereo subset STB-BB captured from a Point Grey Bumblebee2 stereo camera and the color-depth subset STB-SK captured from an active depth camera. Note that the two types of images are captured simultaneously with the same resolution, identical camera pose, and similar viewpoints. Both STB-BB and STB-SK provide 2D and 3D annotations of 21 keypoints. For weakly-supervised experiments, we use color-depth pairs in STB-SK with 2D annotations, as well as root depth (i.e., wrist in the experiments) and hand scale (the distance between a certain pair of keypoints). For fully-supervised experiments, both color-depth pairs (STB-BB) and stereo pairs (STB-SK) with 2D and 3D annotations are utilized to train the whole network. Note that all experiments conducted on STB dataset follow the same training and evaluation protocol used in [18, 46], which trains on 10 sequences and tests on the other two.

We evaluate the 3D hand pose estimation performance with two metrics. The first metric is the area under the curve (AUC) on the percentage of correct keypoints (PCK) score, which is a popular criterion to evaluate the pose estimation accuracy with different thresholds, as proposed in [18, 46]. The second metric is the mean error distance in z-dimension over all testing frames, which is used to further analyse the impact of the proposed depth regularizer. Following the same condition used in [18, 46], we assume that the global hand scale and the root depth are known in the experimental evaluations so that we can report PCK curve based on 3D hand joint locations in the global domain, which are computed from the output root-relative articulations.

4.3 Quantitative Results

Weak Supervision. We first evaluate the impact of weak label constraints on STB dataset compared with fully-supervised methods with complete 2D and 3D annotations. Specifically, we compare our proposed weakly-supervised approach (w/ 2D + w/ depth regularizer) with three baselines: (a) w/o 2D + w/o depth regularizer: directly using pretrained model based on RHD dataset; (b) w/ 2D + w/o depth regularizer: tuning the pretrained network with 2D labels in STB dataset and (c) w/ 2D + w/ 3D: fully-supervised method with complete 2D and 3D annotations.

As illustrated in the left part of Fig. 5, the fully-supervised method achieves the best performance while directly transferring the model trained on synthetic data with no adaptation (baseline-a) yields the worst estimation results. This is not surprising, since the fully-supervised method provides the most effective constraint in the 3D hand pose estimation task and real-world images have considerable domain shift from synthetic ones. Note that these two baselines serve as upper bound and lower bound for our weakly-supervised method. Compared with baseline-a, by fine-tuning the pretrained model with the 2D labels of the real images, baseline-b significantly improves the AUC value from 0.667 to 0.807. Moreover, adding our proposed depth regularizer further increases AUC to 0.889, which demonstrates the effectiveness of the depth regularizer.

We note that STB and RHD datasets adopt different schemes for 2D and 3D annotations, as shown in the right part of Fig. 5. In particular, STB dataset annotates palm position as root joint, which is different from RHD dataset that uses wrist position as root keypoint. Thus, we move the palm joint in STB to wrist point so as to make the annotations consistent for fused training. To evaluate the introduced noise of moving root joint, we compare our results of fully-supervised method on STB dataset with palm-relative and wrist-relative representations. Original palm-relative representation performs slightly better, reducing the mean error by about 0.6 mm. Besides, it is also noted that MCP (Metacarpophalangeal joints) positions are closer to wrist joint in STB dataset and labels for STB dataset are relatively noisy compared with synthetic dataset RHD (e.g., thumb dip is annotated in the background). Due to these differences, we argue that there exists a bias between our pose predictions and the ground truth provided by STB dataset, which might decrease the reported estimation accuracy of our proposed weakly-supervised approach. Furthermore, these inconsistencies, on the other hand, suggest the necessity of the introduced depth regularizer, since it provides certain prior knowledge of hand pose and shapes.

Fully-Supervised 3D Hand Pose Estimation. We also evaluate the effectiveness of the depth regularizer in the fully-supervised setting on both RHD and STB datasets. Note that the two datasets are trained independently in this case. As presented in Fig. 6 (left) and (middle), our fully-supervised method with depth regularizer outperforms that without depth regularizer on both RHD and STB dataset, with improvement of 0.031 and 0.001 in AUC, respectively. Figure 6 (right) shows the mean joint error in z-dimension, indicating that adding depth regularizer is able to slightly improve the fully-supervised results in the joint depth estimation.

Comparisons with State-of-the-Arts. Figure 7 shows the comparisons with state-of-the-art methods [18, 22, 27, 43, 46] on both RHD and STB datasets. It can be seen that on RHD dataset, even without the depth regularizer, our fully-supervised method significantly outperforms the state-of-the-art method [46], improving the AUC value from 0.675 to 0.887. On STB dataset, our fully-supervised method achieves the best results compared with all existing methods. Note that our weakly-supervised method is also superior to some of the existing works, which demonstrates the potential values for the weakly-supervised exploration when complete 3D annotations are difficult to obtain in real-world dataset. It is also noted that the AUC values of our proposed methods in Fig. 7 are slightly different from their counterparts in Sect. 4.3. This is because here we test on the stereo pair subset STB-BB rather than the color-depth subset STB-SK.

4.4 Qualitative Results

Figure 9 shows some visual results of our proposed weakly-supervised approach and baselines. For a better comparison, we show the 3D skeleton reconstructions at a novel view and the skeleton reconstructions of our method at the original view are overlaid with the input images. It can be seen that, after additionally imposing the depth regularizer with the reference depth images, our weakly-supervised approach on real-world dataset yields considerably better estimation accuracy, especially in terms of global orientation, which is consistent with our aforementioned quantitative analysis.

Figure 10 shows some visual results of our fully-supervised methods on RHD and STB datasets. We exhibit samples captured from various viewpoints with serious self-occlusions. It can be seen that our fully-supervised approach with the depth regularizer is robust to various hand orientations and complicated pose articulations.

Although the depth regularizer is only used in training but not in testing, it is interesting to see whether it has learned a manifold of hand poses. Thus, we collect some samples of the depth images generated by our well trained depth regularizer, given ground truth 3D hand joint locations, as shown in Fig. 8. We can see that our depth regularizer is able to render smooth and convincing depth images for hand poses in large variations and self-occlusions.

5 Conclusions

Building a large real-world hand dataset with full 3D annotations is often one of the major bottlenecks for learning-based approaches in 3D hand pose estimation task. To address this problem, our approach presents one way to adapt weakly-labeled real-world dataset from fully-annotated synthetic dataset with the aid of low-cost depth images, which, to our knowledge, is the first exploration of leveraging depth maps to compensate the absence of entire 3D annotations. To be specific, we introduce a simple yet effective end-to-end architecture consisting of a 2D estimation network, a regression network and a novel depth regularizer. Quantitative and qualitative experimental results show that our weakly-supervised method compares favorably with the existing works and our fully-supervised approach considerably outperforms the state-of-the-art methods. We note that we only show one way for weakly-supervised 3D hand pose estimation. There is a large space for un-/weakly-supervised learning.

References

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chapter Google Scholar
Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. In: CVPR, vol. 2, p. 6 (2017)
Google Scholar
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Dibra, E., Wolf, T., Oztireli, C., Gross, M.: How to refine 3D hand pose estimation from unlabelled depth data? In: 2017 International Conference on 3D Vision (3DV), pp. 135–144. IEEE (2017)
Google Scholar
Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand PointNet: 3D hand pose estimation using point sets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426 (2018)
Google Scholar
Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3593–3601 (2016)
Google Scholar
Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 5 (2017)
Google Scholar
Ge, L., Ren, Z., Yuan, J.: Point-to-point regression PointNet for 3D hand pose estimation. In: Proceedings of European Conference on Computer Vision (2018)
Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. IEEE (2015)
Google Scholar
Gu, J., et al.: Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2017)
Article Google Scholar
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Keskin, C., Kıraç, F., Kara, Y.E., Akarun, L.: Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 852–863. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_61
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Liang, H., Yuan, J., Thalman, D.: Egocentric hand pose estimation and distance recovery in a single RGB image. In: 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2015)
Google Scholar
Liang, H., Yuan, J., Thalmann, D., Zhang, Z.: Model-based hand pose estimation via spatial-temporal hand parsing and 3D fingertip localization. Vis. Comput. 29(6–8), 837–848 (2013)
Article Google Scholar
Lu, S., Metaxas, D., Samaras, D., Oliensis, J.: Using multiple cues for hand tracking and model refinement. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Proceedings, vol. 2, pp. II-443. IEEE (2003)
Google Scholar
Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2018. https://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/
Oberweger, M., Wohlhart, P., Lepetit, V.: Training a feedback loop for hand pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3316–3324 (2015)
Google Scholar
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3D tracking of hand articulations using Kinect. In: BmVC, vol. 1, p. 3 (2011)
Google Scholar
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2088–2095. IEEE (2011)
Google Scholar
Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single RGB frame for real time 3D hand pose estimation in the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 436–445. IEEE (2018)
Google Scholar
Rehg, J.M., Kanade, T.: DigitEyes: vision-based hand tracking for human-computer interaction. In: Proceedings of the 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pp. 16–22. IEEE (1994)
Google Scholar
Ren, Z., Yuan, J., Meng, J., Zhang, Z.: Robust part-based hand gesture recognition using Kinect sensor. IEEE Trans. Multimed. 15, 1110–1120 (2016)
Article Google Scholar
Sharp, T., et al.: Accurate, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633–3642. ACM (2015)
Google Scholar
Shotton, J., et al.: Efficient human pose estimation from single depth images. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2821–2840 (2013)
Article Google Scholar
Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–98 (2018)
Google Scholar
Stenger, B., Thayananthan, A., Torr, P.H., Cipolla, R.: Model-based hand tracking using a hierarchical Bayesian filter. IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1372–1384 (2006)
Article Google Scholar
Sun, X., Xiao, B., Liang, S., Wei, Y.: Integral human pose regression. arXiv preprint arXiv:1711.08229 (2017)
Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.K., Shotton, J.: Opening the black box: hierarchical sampling optimization for estimating human hand pose. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3325–3333 (2015)
Google Scholar
Taylor, J., et al.: Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Trans. Graph. (TOG) 35(4), 143 (2016)
Article Google Scholar
Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: CVPR 2017 Proceedings, pp. 2500–2509 (2017)
Google Scholar
Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4068–4076. IEEE (2015)
Google Scholar
Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 118(2), 172–193 (2016)
Article MathSciNet Google Scholar
Wang, R., Paris, S., Popović, J.: 6D hands: markerless hand-tracking for computer aided design. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 549–558. ACM (2011)
Google Scholar
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Google Scholar
Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_22
Chapter Google Scholar
Wu, Y., Huang, T.S.: Capturing articulated human hand motion: a divide-and-conquer approach. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 1, pp. 606–611. IEEE (1999)
Google Scholar
Wu, Y., Huang, T.S.: View-independent recognition of hand postures. In: CVPR, p. 2088. IEEE (2000)
Google Scholar
Xu, C., Cheng, L.: Efficient hand pose estimation from a single depth image. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3456–3462. IEEE (2013)
Google Scholar
Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1385–1392. IEEE (2011)
Google Scholar
Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4948–4956 (2016)
Google Scholar
Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3D hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 (2016)
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. IEEE (2017)
Google Scholar
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: International Conference on Computer Vision (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Media Innovation, Interdisciplinary Graduate School, Nanyang Technological University, Singapore, Singapore
Yujun Cai & Liuhao Ge
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Jianfei Cai
Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
Junsong Yuan

Authors

Yujun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Liuhao Ge
View author publications
You can also search for this author in PubMed Google Scholar
Jianfei Cai
View author publications
You can also search for this author in PubMed Google Scholar
Junsong Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujun Cai .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Y., Ge, L., Cai, J., Yuan, J. (2018). Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11210. Springer, Cham. https://doi.org/10.1007/978-3-030-01231-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-01231-1_41
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01230-4
Online ISBN: 978-3-030-01231-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics