Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Parsing people in visual data is central to many applications including mixed-reality interfaces, animation, video editing and human action recognition. Towards this goal, human 2D pose estimation has been significantly advanced by recent efforts [1,2,3,4]. Such methods aim to recover 2D locations of body joints and provide a simplified geometric representation of the human body. There has also been significant progress in 3D human pose estimation [5,6,7,8]. Many applications, however, such as virtual clothes try-on, video editing and re-enactment require accurate estimation of both 3D human pose and shape.

Fig. 1.
figure 1

Our BodyNet predicts a volumetric 3D human body shape and 3D body parts from a single image. We show the input image, the predicted human voxels, and the predicted part voxels.

3D human shape estimation has been mostly studied in controlled settings using specific sensors including multi-view capture [9], motion capture markers [10], inertial sensors [11], and 3D scanners [12]. In uncontrolled single-view settings 3D human shape estimation, however, has received little attention so far. The challenges include the lack of large-scale training data, the high dimensionality of the output space, and the choice of suitable representations for 3D human shape. Bogo et al. [13] present the first automatic method to fit a deformable body model to an image but rely on accurate 2D pose estimation and introduce hand-designed constraints enforcing elbows and knees to bend naturally. Other recent methods [14,15,16] employ deformable human body models such as SMPL [17] and regress model parameters with CNNs [18, 19]. In this work, we compare to such approaches and show advantages.

The optimal choice of 3D representation for neural networks remains an open problem. Recent work explores voxel [20,21,22,23], octree [24,25,26,27], point cloud [28,29,30], and surface [31] representations for modeling generic 3D objects. In the case of human bodies, the common approach has been to regress parameters of pre-defined human shape models [14,15,16]. However, the mapping between the 3D shape and parameters of deformable body models is highly nonlinear and is currently difficult to learn. Moreover, regression to a single set of parameters cannot represent multiple hypotheses and can be problematic in ambigous situations. Notably, skeleton regression methods for 2D human pose estimation, e.g., [32], have recently been overtaken by heatmap based methods [1, 2] enabling representation of multiple hypotheses.

In this work we propose and investigate a volumetric representation for body shape estimation as illustrated in Fig. 1. Our network, called BodyNet, generates likelihoods on the 3D occupancy grid of a person. To efficiently train our network, we propose to regularize BodyNet with a set of auxiliary losses. Besides the main volumetric 3D loss, BodyNet includes a multi-view re-projection loss and multi-task losses. The multi-view re-projection loss, being efficiently approximated on voxel space (see Sect. 3.2), increases the importance of the boundary voxels. The multi-task losses are based on the additional intermediate network supervision in terms of 2D pose, 2D body part segmentation, and 3D pose. The overall architecture of BodyNet is illustrated in Fig. 2.

To evaluate our method, we fit the SMPL model [13] to the BodyNet output and measure single-view 3D human shape estimation performance in the recent SURREAL [33] and Unite the People [34] datasets. The proposed BodyNet approach demonstrates state-of-the-art performance and improves accuracy of recent methods. We show significant improvements provided by the end-to-end training and auxiliary losses of BodyNet. Furthermore, our method enables volumetric body-part segmentation. BodyNet is fully-differentiable and could be used as a subnetwork in future application-oriented methods targeting e.g., virtual cloth change or re-enactment.

In summary, this work makes several contributions. First, we address single-view 3D human shape estimation and propose a volumetric representation for this task. Second, we investigate several network architectures and propose an end-to-end trainable network BodyNet combining a multi-view re-projection loss together with intermediate network supervision in terms of 2D pose, 2D body part segmentation, and 3D pose. Third, we outperform previous regression-based methods and demonstrate state-of-the art performance on two datasets for human shape estimation. In addition, our network is fully differentiable and can provide volumetric body-part segmentation.

Fig. 2.
figure 2

BodyNet: End-to-end trainable network for 3D human body shape estimation. The input RGB image is first passed through subnetworks for 2D pose estimation and 2D body part segmentation. These predictions, combined with the RGB features, are fed to another network predicting 3D pose. All subnetworks are combined to a final network to infer volumetric shape. The 2D pose, 2D segmentation and 3D pose networks are first pre-trained and then fine-tuned jointly for the task of volumetric shape estimation using multi-view re-projection losses. We fit the SMPL model to volumetric predictions for the purpose of evaluation

2 Related Work

3D Human Body Shape. While the problem of localizing 3D body joints has been well-explored in the past [5,6,7,8, 35,36,37,38], 3D human shape estimation from a single image has received limited attention and remains a challenging problem. Earlier work [39, 40] proposed to optimize pose and shape parameters of the 3D deformable body model SCAPE [41]. More recent methods use the SMPL [17] body model that again represents the 3D shape as a function of pose and shape parameters. Given such a model and an input image, Bogo et al. [13] present the optimization method SMPLify estimating model parameters from a fit to 2D joint locations. Lassner et al. [34] extend this approach by incorporating silhouette information as additional guidance and improves the optimization performance by densely sampled 2D points. Huang et al. [42] extend SMPLify for multi-view video sequences with temporal priors. Similar temporal constraints have been used in [43]. Rhodin et al. [44] use a sum-of-Gaussians volumetric representation together with contour-based refinement and successfully demonstrate human shape recovery from multi-view videos with optimization techniques. Even though such methods show compelling results, inherently they are limited by the quality of the 2D detections they use and depend on priors both on pose and shape parameters to regularize the highly complex and costly optimization process.

Deep neural networks provide an alternative approach that can be expected to learn appropriate priors automatically from the data. Dibra et al. [45] present one of the first approaches in this direction and train a CNN to estimate the 3D shape parameters from silhouettes, but assume a frontal input view. More recent approaches [14,15,16] train neural networks to predict the SMPL body parameters from an input image. Tan et al. [14] design an encoder-decoder architecture that is trained on silhouette prediction and indirectly regresses model parameters at the bottleneck layer. Tung et al. [15] operate on two consecutive video frames and learn parameters by integrating re-projection loss on the optical flow, silhouettes and 2D joints. Similarly, Kanazawa et al. [16] predict parameters with re-projection loss on the 2D joints and introduce an adversary whose goal is to distinguish unrealistic human body shapes.

Even though parameters of deformable body models provide a low-dimensional embedding of the 3D shape, predicting such parameters with a network requires learning a highly non-linear mapping. In our work we opt for an alternative volumetric representation that has shown to be effective for generic 3D objects [21] and faces [46]. The approach of [21] operates on low-resolution grayscale images for a few rigid object categories such as chairs and tables. We argue that human bodies are more challenging due to significant non-rigid deformations. To accommodate for such deformation, we use segmentation and 3D pose as proxy to 3D shape in addition to 2D pose [46]. Conditioning our 3D shape estimation on a given 3D pose, the network focuses on the more complicated problem of shape deformation. Furthermore, we regularize our voxel predictions with additional re-projection loss, perform end-to-end multi-task training with intermediate supervision and obtain volumetric body part segmentation.

Others have studied predicting 2.5D projections of human bodies. DenseReg [47] and DensePose [48] estimate image-to-surface correspondences, while [33] outputs quantized depth maps for SMPL bodies. Differently from these methods, our approach generates a full 3D body reconstruction.

Multi-task Neural Networks. Multi-task networks are well-studied. A common approach is to output multiple related tasks at the very end of the neural network architecture. Another, more recently explored alternative is to stack multiple subnetworks and provide guidance with intermediate supervision. Here, we only cover related works that employ the latter approach. Guiding CNNs with relevant cues has shown improvements for a number of tasks. For example, 2D facial landmarks have shown useful guidance for 3D face reconstruction [46] and similarly optical flow for action recognition [49]. However, these methods do not perform joint training. Recent work of [50] jointly learns 2D/3D pose together with action recognition. Similarly, [51] trains for 3D pose with intermediate tasks of 2D pose and segmentation. With this motivation, we make use of 2D pose, 2D human body part segmentation, and 3D pose, that provide cues for 3D human shape estimation. Unlike [51], 3D pose becomes an auxiliary task for our final 3D shape task. In our experiments, we show that training with a joint loss on all these tasks increases the performance of all our subnetworks (see Appendix C.1).

3 BodyNet

BodyNet predicts 3D human body shape from a single image and is composed of four subnetworks trained first independently, then jointly to predict 2D pose, 2D body part segmentation, 3D pose, and 3D shape (see Fig. 2). Here, we first discuss the details of the volumetric representation for body shape (Sect. 3.1). Then, we describe the multi-view re-projection loss (Sect. 3.2) and the multi-task training with the intermediate representations (Sect. 3.3). Finally, we formulate our model fitting procedure (Sect. 3.4).

3.1 Volumetric Inference for 3D Human Shape

For 3D human body shape, we propose to use a voxel-based representation. Our shape estimation subnetwork outputs the 3D shape represented as an occupancy map defined on a fixed resolution voxel grid. Specifically, given a 3D body, we define a 3D voxel grid roughly centered at the root joint, (i.e., the hip joint) where each voxel inside the body is marked as occupied. We voxelize the ground truth meshes (i.e., SMPL) into a fixed resolution grid using binvox [52, 53]. We assume orthographic projection and rescale the volume such that the xy-plane is aligned with the 2D segmentation mask to ensure spatial correspondence with the input image. After scaling, the body is centered on the z-axis and the remaining areas are padded with zeros.

Our network minimizes the binary cross-entropy loss after applying the sigmoid function on the network output similar to [46]:

$$\begin{aligned} \mathcal {L}_v = \sum _{x=1}^{W} \sum _{y=1}^{H} \sum _{z=1}^{D} V_{xyz}\log \hat{V}_{xyz}+(1-V_{xyz})\log (1-\hat{V}_{xyz}), \end{aligned}$$
(1)

where \(V_{xyz}\) and \(\hat{V}_{xyz}\) denote the ground truth value and the predicted sigmoid output for a voxel, respectively. Width (W), height (H) and depth (D) are 128 in our experiments. We observe that this resolution captures sufficient details.

The loss \(\mathcal {L}_v\) is used to perform foreground-background segmentation of the voxel grid. We further extend this formulation to perform 3D body part segmentation with a multi-class cross-entropy loss. We define 6 parts (head, torso, left/right leg, left/right arm) and learn 7-class classification including the background. The weights for this network are initialized by the shape network by copying the output layer weights for each class. This simple extension allows the network to directly infer 3D body parts without going through the costly SMPL model fitting.

3.2 Multi-view Re-projection Loss on the Silhouette

Due to the complex articulation of the human body, one major challenge in inferring the volumetric body shape is to ensure high confidence predictions across the whole body. We often observe that the confidences on the limbs away from the body center tend to be lower (see Fig. 5). To address this problem, we employ additional 2D re-projection losses that increase the importance of the boundary voxels. Similar losses have been employed for rigid objects by [54, 55] in the absence of 3D labels and by [21] as additional regularization. In our case, we show that the multi-view re-projection term is critical, particularly to obtain good quality reconstruction of body limbs. Assuming orthographic projection, the front view projection, \(\hat{S}^{FV}\), is obtained by projecting the volumetric grid to the image with the max operator along the z-axis [54]. Similarly, we define \(\hat{S}^{SV}\) as the max along the x-axis:

$$\begin{aligned} \hat{S}^{FV}(x,y) = \max _z \hat{V}_{xyz} \quad \text {and}\quad \hat{S}^{SV}(y,z) = \max _x \hat{V}_{xyz}. \end{aligned}$$
(2)

The true silhouette, \(S^{FV}\), is defined by the ground truth 2D body part segmentation provided by the datasets. We obtain the ground truth side view silhouette from the voxel representation that we computed from the ground truth 3D mesh: \({S}^{SV}(y,z) = \max _x {V}_{xyz}\). We note that our voxels remain slightly larger than the original mesh due to the voxelization step that marks every voxel that intersects with a face as occupied. We define a binary cross-entropy loss per view as follows:

$$\begin{aligned}&\mathcal {L}^{FV}_p = \sum _{x=1}^{W} \sum _{y=1}^{H} S(x,y)\log \hat{S}^{FV}(x,y)+(1-S(x,y))\log (1-\hat{S}^{FV}(x,y)), \end{aligned}$$
(3)
$$\begin{aligned}&\mathcal {L}^{SV}_p = \sum _{y=1}^{H} \sum _{z=1}^{D} S(y,z)\log \hat{S}^{SV}(y,z)+(1-S(y,z))\log (1-\hat{S}^{SV}(y,z)). \end{aligned}$$
(4)

We train the shape estimation network initially with \(\mathcal {L}_v\). Then, we continue training with a combined loss: \(\lambda _v\mathcal {L}_v + \lambda ^{FV}_p\mathcal {L}^{FV}_p + \lambda ^{SV}_p\mathcal {L}^{SV}_p\), Sect. 3.3 gives details on how to set the relative weighting of the losses. Sect. 4.3 demonstrates experimentally the benefits of the multi-view re-projection loss.

3.3 Multi-task Learning with Intermediate Supervision

The input to the 3D shape estimation subnetwork is composed by combining RGB, 2D pose, segmentation, and 3D pose predictions. Here, we present the subnetworks used to predict these intermediate representations and detail our multi-task learning procedure. The architecture for each subnetwork is based on a stacked hourglass network [1], where the output is over a spatial grid and is, thus, convenient for pixel- and voxel-level tasks as in our case.

2D Pose. Following the work of Newell et al. [1], we use a heatmap representation of 2D pose. We predict one heatmap for each body joint where a Gaussian with fixed variance is centered at the corresponding image location of the joint. The final joint locations are identified as the pixel indices with the maximum value over each output channel. We use the first two stacks of an hourglass network to map RGB features \(3\times 256\times 256\) to 2D joint heatmaps \(16\times 64\times 64\) as in [1] and predict 16 body joints. The mean-squared error between the ground truth and predicted 2D heatmaps is \(\mathcal {L}^{2D}_{j}\).

2D Part Segmentation. Our body part segmentation network is adopted from [33] and is trained on the SMPL [17] anatomic parts defined by [33]. The architecture is similar to the 2D pose network and again the first two stacks are used. The network predicts one heatmap per body part given the input RGB image, which results in an output resolution of \(15\times 64\times 64\) for 15 body parts. The spatial cross-entropy loss is denoted with \(\mathcal {L}_{s}\).

3D Pose. Estimating the 3D joint locations from a single image is an inherently ambiguous problem. To alleviate some uncertainty, we assume that the camera intrinsics are known and predict the 3D pose in the camera coordinate system. Extending the notion of 2D heatmaps to 3D, we represent 3D joint locations with 3D Gaussians defined on a voxel grid as in [6]. For each joint, the network predicts a fixed-resolution volume with a single 3D Gaussian centered at the joint location. The \(xy-\)dimensions of this grid are aligned with the image coordinates, and hence the 2D joint locations, while the z dimension represents the depth. We assume this voxel grid is aligned with the 3D body such that the root joint corresponds to the center of the 3D volume. We determine a reasonable depth range in which a human body can fit (roughly 85cm in our experiments) and quantize this range into 19 bins. We define the overall resolution of the 3D grid to be \(64\times 64\times 19\), i.e., four times smaller in spatial resolution compared to the input image as is the case for the 2D pose and segmentation networks. We define one such grid per body joint and regress with mean-squared error \(\mathcal {L}^{3D}_j\).

The 3D pose estimation network consists of another two stacks. Unlike 2D pose and segmentation, the 3D pose network takes multiple modalities as input, all spatially aligned with the output of the network. Specifically, we concatenate RGB channels with the heatmaps corresponding to 2D joints and body parts. We upsample the heatmaps to match the RGB resolution, thus the input resolution becomes \((3+16+15)\times 256\times 256\). While 2D pose provides a significant cue for the xy joint locations, some of the depth information is implicitly contained in body part segmentation since unlike a silhouette, occlusion relations among individual body parts provide strong 3D cues. For example a discontinuity on the torso segment caused by an occluding arm segment implies the arm is in front of the torso. In Appendix C.4, we provide comparisons of 3D pose prediction with and without using this additional information.

Combined Loss and Training Details. The subnetworks are initially trained independently with individual losses, then fine-tuned jointly with a combined loss:

$$\begin{aligned} \mathcal {L} = \lambda ^{2D}_j\mathcal {L}^{2D}_j + \lambda _s\mathcal {L}_s + \lambda ^{3D}_j\mathcal {L}^{3D}_j + \lambda _v\mathcal {L}_v + \lambda ^{FV}_p\mathcal {L}^{FV}_p + \lambda ^{SV}_p\mathcal {L}^{SV}_p. \end{aligned}$$
(5)

The weighting coefficients are set such that the average gradient of each loss across parameters is at the same scale at the beginning of fine-tuning. With this rule, we set \( ( \lambda ^{2D}_j, \lambda _s, \lambda ^{3D}_j, \lambda _v, \lambda ^{FV}_p, \lambda ^{SV}_p ) \propto ( 10^7, 10^3, 10^6, 10^1, 1, 1 ) \) and make the sum of the weights equal to one. We set these weights on the SURREAL dataset and use the same values in all experiments. We found it important to apply this balancing so that the network does not forget the intermediate tasks, but improves the performance of all tasks at the same time.

When training our full network, see Fig. 2, we proceed as follows: (i) we train 2D pose and segmentation; (ii) we train 3D pose with fixed 2D pose and segmentation network weights; (iii) we train 3D shape network with all the preceding network weights fixed; (iv) then, we continue training the shape network with additional re-projection losses; (v) finally, we perform end-to-end fine-tuning on all network weights with the combined loss.

Implementation Details. Each of our subnetworks consists of two stacks to keep a reasonable computational cost. We take the first two stacks of the 2D pose network trained on the MPII dataset [56] with 8 stacks [1]. Similarly, the segmentation network is trained on the SURREAL dataset with 8 stacks [33] and the first two stacks are used. Since stacked hourglass networks involve intermediate supervision [1], we can use only part of the network by sacrificing slight performance. The weights for 3D pose and 3D shape networks are randomly initialized and trained on SURREAL with two stacks. Architectural details are given in Appendix B. SURREAL [33], being a large-scale dataset, provides pre-training for the UP dataset [34] where the networks converge relatively faster. Therefore, we fine-tune the segmentation, 3D pose, and 3D shape networks on UP from those pre-trained on SURREAL. We use RMSprop [57] algorithm with mini-batches of size 6 and a fixed learning rate of \(10^{-3}\). Color jittering augmentation is applied on the RGB data. For all the networks, we assume that the bounding box of the person is given, thus we crop the image to center the person. Code is made publicly available on the project page [58].

3.4 Fitting a Parametric Body Model

While the volumetric output of BodyNet produces good quality results, for some applications, it is important to produce a 3D surface mesh, or even a parametric model that can be manipulated. Furthermore, we use the SMPL model for our evaluation. To this end, we process the network output in two steps: (i) we first extract the isosurface from the predicted occupancy map, (ii) next, we optimize for the parameters of a deformable body model, SMPL model in our experiments, that fits the isosurface as well as the predicted 3D joint locations.

Formally, we define the set of 3D vertices in the isosurface mesh that is extracted [59] from the network output to be \(\mathbf {V}^n\). SMPL [17] is a statistical model where the location of each vertex is given by a set \(\mathbf {V}^s(\theta ,\beta )\) that is formulated as a function of the pose (\(\theta \)) and shape (\(\beta \)) parameters [17]. Given \(\mathbf {V}^n\), our goal is to find \(\{\theta ^\star , \beta ^\star \}\) such that the weighted Chamfer distance, i.e., the distance among the closest point correspondences between \(\mathbf {V}^n\) and \(\mathbf {V}^s(\theta ,\beta )\) is minimized:

$$\begin{aligned} { \{ \theta ^\star , \beta ^\star \} } = \underset{ \{ \theta , \beta \} }{{{\mathrm{\arg \!\min }}}}&\sum _{\mathbf {p}^n \in \mathbf {V}^n} \min _{\mathbf {p}^s \in \mathbf {V}^s(\theta , \beta )} w^n \Vert \mathbf {p}^n-\mathbf {p}^s\Vert _2^2 + \nonumber \\&\sum _{\mathbf {p}^s \in \mathbf {V}^s(\theta , \beta )} \min _{\mathbf {p}^n \in \mathbf {V}^n} w^n\Vert \mathbf {p}^n - \mathbf {p}^s\Vert _2^2 + \lambda \sum _{i=1}^{J} \Vert \mathbf {j}^n_i - \mathbf {j}^s_i(\theta , \beta ) \Vert _2^2 . \end{aligned}$$
(6)

We find it effective to weight the closest point distances by the confidence of the corresponding point in the isosurface which depends on the voxel predictions of our network. We denote the weight associated with the point \(p^n\) as \(w^n\). We define an additional term to measure the distance between the predicted 3D joint locations, \(\{\mathbf {j}^n_i\}_{i=1}^{J}\), where J denotes the number of joints, and the corresponding joint locations in the SMPL model, denoted by \(\{\mathbf {j}^s_i(\theta , \beta )\}_{i=1}^{J}\). We weight the contribution of the joints’ error by a constant \(\lambda \) (empirically set to 5 in our experiments) since J is very small (e.g., 16) compared to the number of vertices (e.g., 6890). In Sect. 4, we show the benefits of fitting to voxel predictions compared to our baseline of fitting to 2D and 3D joints, and to 2D segmentation, i.e., to the inputs of the shape network.

We optimize for Eq. (6) in an iterative manner where we update the correspondences at each iteration. We use Powell’s dogleg method [60] and Chumpy [61] similar to [13]. When reconstructing the isosurface, we first apply a thresholding (0.5 in our experiments) to the voxel predictions and apply the marching cubes algorithm [59]. We initialize the SMPL pose parameters to be aligned with our 3D pose predictions and set \(\beta = \mathbf {0}\) (where \(\mathbf {0}\) denotes a vector of zeros).

4 Experiments

This section presents the evaluation of BodyNet. We first describe evaluation datasets (Sect. 4.1) and other methods used for comparison in this paper (Sect. 4.2). We then evaluate contributions of additional inputs (Sect. 4.3) and losses (Sect. 4.4). Next, we report performance on the UP dataset (Sect. 4.5). Finally, we demonstrate results for 3D body part segmentation (Sect. 4.6).

4.1 Datasets and Evaluation Measures

SURREAL Dataset [33] is a large-scale synthetic dataset for 3D human body shapes with ground truth labels for segmentation, 2D/3D pose, and SMPL body parameters. Given its scale and rich ground truth, we use SURREAL in this work for training and testing. Previous work demonstrating successful use of synthetic images of people for training visual models include [62,63,64]. Given the SMPL shape and pose parameters, we compute the ground truth 3D mesh. We use the standard train split [33]. For testing, we use the middle frame of the middle clip of each test sequence, which makes a total of 507 images. We observed that testing on the full test set of 12, 528 images yield similar results. To evaluate the quality of our shape predictions for difficult cases, we define two subsets with extreme body shapes, similar to what is done for example in optical flow [65]. We compute the surface distance between the average shape (\(\beta =\mathbf {0}\)) given the ground truth pose and the true shape. We take the \(10^{th}\) (s10) and \(20^{th}\) (s20) percentile of this distance distribution that represent the meshes with extreme body shapes.

Unite the People Dataset (UP) [34] is a recent collection of multiple datasets (e.g., MPII [56], LSP [66]) providing additional annotations for each image. The annotations include 2D pose with 91 keypoints, 31 body part segments, and 3D SMPL models. The ground truth is acquired in a semi-automatic way and is therefore imprecise. We evaluate our 3D body shape estimations on this dataset. We report errors on two different subsets of the test set where 2D segmentations as well as pseudo 3D ground truth are available. We use notation T1 for images from the LSP subset [34], and T2 for images used by [14].

3D Shape Evaluation. We evaluate body shape estimation with different measures. Given the ground truth and our predicted volumetric representation, we measure the intersection over union directly on the voxel grid, i.e., voxel IOU. We further assess the quality of the projected silhouette to enable comparison with [14, 16, 34]. We report the intersection over union (silhouette IOU), F1-score computed for foreground pixels, and global accuracy (ratio of correctly predicted foreground and background pixels). We evaluate the quality of the fitted SMPL model by measuring the average error in millimeters between the corresponding vertices in the fit and ground truth mesh (surface error). We also report the average error between the corresponding 91 landmarks defined for the UP dataset [34]. We assume the depth of the root joint and the focal length to be known to transform the volumetric representation into a metric space.

4.2 Alternative Methods

We demonstrate advantages of BodyNet by comparing it to alternative methods. BodyNet makes use of 2D/3D pose estimation and 2D segmentation. We define alternative methods in terms of the same components combined differently.

SMPLify++. Lassner et al. [34] extended SMPLify [13] with an additional term on 2D silhouette. Here, we extend it further to enable a fair comparison with BodyNet. We use the code from [13] and implement a fitting objective with additional terms on 2D silhouette and 3D pose besides 2D pose (see Appendix D). As shown in Table 2, results of SMPLify++ remain inferior to BodyNet despite both of them using 2D/3D pose and segmentation inputs (see Fig. 3).

Shape Parameter Regression. To validate our volumetric representation, we also implement a regression method by replacing the 3D shape estimation network in Fig. 2 by another subnetwork directly regressing the 10-dim. shape parameter vector \(\beta \) using L2 loss. The network architecture corresponds to the encoder part of the hourglass followed by 3 additional fully connected layers (see Appendix B for details). We recover the pose parameters \(\theta \) from our 3D pose prediction (initial attempts to regress \(\theta \) together with \(\beta \) gave worse results). Table 2 demonstrates inferior performance of the \(\beta \) regression network that often produces average body shapes (see Fig. 3). In contrast, BodyNet results in better SMPL fitting due to the accurate volumetric representation.

Fig. 3.
figure 3

SMPL fit on BodyNet predictions compared with other methods. While shape parameter regression and the fitting only to BodyNet inputs (SMPLify++) produce shapes close to average, BodyNet learns how the true shape observed in the image deviates from the average deformable shape model. Examples taken from the test subset s10 of SURREAL dataset with extreme shapes.

Table 1. Performance on the SURREAL dataset using alternative combinations of intermediate representations at the input.
Fig. 4.
figure 4

Our predicted 2D pose, segmentation, 3D pose, 3D volumetric shape, and SMPL model alignments. Our 3D shape predictions are consistent with pose and segmentation, suggesting that the shape network relies on the intermediate representations. When one of the auxiliary tasks fails (2D pose on the right), 3D shape can still be recovered with the help of the other cues.

4.3 Effect of Additional Inputs

We first motivate our proposed architecture by evaluating performance of 3D shape estimation in the SURREAL dataset using alternative inputs (see Table 1). When only using one input, 3D pose network, which is already trained with additional 2D pose and segmentation inputs, performs best. We observe improvements as more cues, specifically 3D cues are added. We also note that intermediate representations in terms of 3D pose and 2D segmentation outperform RGB. Adding RGB to the intermediate representations further improves shape results on SURREAL. Figure 4 illustrates intermediate predictions as well as the final 3D shape output. Based on results in Table 1, we choose to use all intermediate representations as parts of our full network that we call BodyNet.

Table 2. Volumetric prediction on SURREAL with different versions of our model compared to alternative methods. Note that lines 2–10 use same modalities (i.e., 2D/3D pose, 2D segmentation). The evaluation is made on the SMPL model fit to our voxel outputs. The average SMPL surface error decreases with the addition of the proposed components.

4.4 Effect of Re-projection Error and End-to-End Multi-task Training

We evaluate contributions provided by additional supervision from Sects. 3.2-3.3.

Effect of Re-projection Losses. Table 2 (lines 4–10) provides results when the shape network is trained with and without re-projection losses (see also Fig. 5). The voxels network without any additional loss already outperforms the baselines described in Sect. 4.2. When trained with re-projection losses, we observe increasing performance both with single-view constraints, i.e., front view (FV), and multi-view, i.e., front and side views (FV+SV). The multi-view re-projection loss puts more importance on the body surface resulting in a better SMPL fit.

Effect of Intermediate Losses. Table 2 (lines 7–10) presents experimental evaluation of the proposed intermediate supervision. Here, we first compare the end-to-end network fine-tuned jointly with auxiliary tasks (lines 9–10) to the networks trained independently from the fixed representations (lines 4–6). Comparison of results on lines 6 and 10 suggests that multi-task training regularizes all subnetworks and provides better performance for 3D shape. We refer to Appendix C.1 for the performance improvements on auxiliary tasks. To assess the contribution of intermediate losses on 2D pose, segmentation, and 3D pose, we implement an additional baseline where we again fine-tune end-to-end, but remove the losses on the intermediate tasks (lines 7–8). Here, we keep only the voxels and the re-projection losses. These networks not only forget the intermediate tasks, but are also outperformed by our base networks without end-to-end refinement (compare lines 8 and 6). On all the test subsets (i.e., full, s20, and s10) we observe a consistent improvement of the proposed components against baselines. Figure 3 presents qualitative results and illustrates how BodyNet successfully learns the 3D shape in extreme cases.

Comparison to the State of the Art. Table 2 (lines 1,10) demonstrates a significant improvement of BodyNet compared to the recent method of Tung et al. [15]. Note that [15] relies on ground truth 2D pose and segmentation on the test set, while our approach is fully automatic. Other works do not report results on the recent SURREAL dataset.

Table 3. Body shape performance and comparison to the state of the art on the UP dataset. Unlike in SURREAL, the 3D ground truth in this dataset is imprecise.

4.5 Comparison to the State of the Art on Unite the People

For the networks trained on the UP dataset, we initialize the weights pre-trained on SURREAL and fine-tune with the complete training set of UP-3D where the 2D segmentations are obtained from the provided 3D SMPL fits [34]. We show results of BodyNet trained end-to-end with multi-view re-projection loss. We provide quantitative evaluation of our method in Table 3 and compare to recent approaches [14, 16, 34]. We note that some works only report 2D metrics measuring how well the 3D shape is aligned with the manually annotated segmentation. The ground truth is a noisy estimate obtained in a semi-automatic way [34], whose projection is mostly accurate but not its depth. While our results are on par with previous approaches on 2D metrics, we note that the provided manual segmentations and the 3D SMPL fits [34] are noisy and affect both the training and the evaluation [48]. Therefore, we also provide a large set of visual results in Appendices A, E to illustrate our competitive 3D estimation quality. On 3D metrics, our method significantly outperforms both direct and indirect learning of [14]. We also provide qualitative results in Fig. 4 where we show both the intermediate outputs and the final 3D shape predicted by our method. We observe that voxel predictions are aligned with the 3D pose predictions and provide a robust SMPL fit. We refer to Appendix E for an analysis on the type of segmentation used as re-projection supervision.

4.6 3D Body Part Segmentation

As described in Sect. 3.1, we extend our method to produce not only the foreground voxels for a human body, but also the 3D part labeling. We report quantitative results on SURREAL in Table 4 where accurate ground truth is available. When the parts are combined, the foreground IOU becomes 58.9 which is comparable to 58.1 reported in Table 1. We provide qualitative results in Fig. 6 on the UP dataset where the parts network is only trained on SURREAL. To the best of our knowledge, we present the first method for 3D body part labeling from a single image with an end-to-end approach. We infer volumetric body parts directly with a network without iterative fitting of a deformable model and obtain successful results. Performance-wise BodyNet can produce foreground and per-limb voxels in 0.28s and 0.58s per image, respectively, using modern GPUs.

Fig. 5.
figure 5

Voxel predictions color-coded based on the confidence values. Notice that our combined 3D and re-projection loss enables our network to make more confident predictions across the whole body. Example taken from SURREAL.

Fig. 6.
figure 6

BodyNet is able to directly regress volumetric body parts from a single image on examples from UP.

Table 4. 3D body part segmentation performance measured per part on SURREAL. The articulated and small limbs appear more difficult than torso.

5 Conclusion

We have presented BodyNet, a fully automatic end-to-end multi-task network architecture that predicts the 3D human body shape from a single image. We have shown that joint training with intermediate tasks significantly improves the results. We have also demonstrated that the volumetric regression together with a multi-view re-projection loss is effective for representing human bodies. Moreover, with this flexible representation, our framework allows us to extend our approach to demonstrate impressive results on 3D body part segmentation from a single image. We believe that BodyNet can provide a trainable building block for future methods that make use of 3D body information, such as virtual cloth-change. Furthermore, we believe exploring the limits of using only intermediate representations is an interesting research direction for 3D tasks where acquiring training data is impractical. Another future direction is to study the 3D body shape under clothing. Volumetric representation can potentially capture such additional geometry if training data is provided.