BodyNet: Volumetric Inference of 3D Human Body Shapes

Varol, Gül; Ceylan, Duygu; Russell, Bryan; Yang, Jimei; Yumer, Ersin; Laptev, Ivan; Schmid, Cordelia

doi:10.1007/978-3-030-01234-2_2

Gül Varol¹⁷,
Duygu Ceylan¹⁹,
Bryan Russell²⁰,
Jimei Yang¹⁹,
Ersin Yumer¹⁹,
Ivan Laptev¹⁷ &
…
Cordelia Schmid¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11211))

Included in the following conference series:

European Conference on Computer Vision

6339 Accesses
161 Citations
3 Altmetric

Abstract

Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

G. Varol and I. Laptev—École normale supérieure, Inria, CNRS, PSL Research University, Paris, France.

C. Schmid—Univ. Grenoble Alpes, Inria, CNRS, INPG, LJK, Grenoble, France.

E. Yumer— Currently at Argo AI, USA. This work was performed while EY was at Adobe.

You have full access to this open access chapter, Download conference paper PDF

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

Article 28 April 2022

3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Parsing people in visual data is central to many applications including mixed-reality interfaces, animation, video editing and human action recognition. Towards this goal, human 2D pose estimation has been significantly advanced by recent efforts [1,2,3,4]. Such methods aim to recover 2D locations of body joints and provide a simplified geometric representation of the human body. There has also been significant progress in 3D human pose estimation [5,6,7,8]. Many applications, however, such as virtual clothes try-on, video editing and re-enactment require accurate estimation of both 3D human pose and shape.

3D human shape estimation has been mostly studied in controlled settings using specific sensors including multi-view capture [9], motion capture markers [10], inertial sensors [11], and 3D scanners [12]. In uncontrolled single-view settings 3D human shape estimation, however, has received little attention so far. The challenges include the lack of large-scale training data, the high dimensionality of the output space, and the choice of suitable representations for 3D human shape. Bogo et al. [13] present the first automatic method to fit a deformable body model to an image but rely on accurate 2D pose estimation and introduce hand-designed constraints enforcing elbows and knees to bend naturally. Other recent methods [14,15,16] employ deformable human body models such as SMPL [17] and regress model parameters with CNNs [18, 19]. In this work, we compare to such approaches and show advantages.

The optimal choice of 3D representation for neural networks remains an open problem. Recent work explores voxel [20,21,22,23], octree [24,25,26,27], point cloud [28,29,30], and surface [31] representations for modeling generic 3D objects. In the case of human bodies, the common approach has been to regress parameters of pre-defined human shape models [14,15,16]. However, the mapping between the 3D shape and parameters of deformable body models is highly nonlinear and is currently difficult to learn. Moreover, regression to a single set of parameters cannot represent multiple hypotheses and can be problematic in ambigous situations. Notably, skeleton regression methods for 2D human pose estimation, e.g., [32], have recently been overtaken by heatmap based methods [1, 2] enabling representation of multiple hypotheses.

In this work we propose and investigate a volumetric representation for body shape estimation as illustrated in Fig. 1. Our network, called BodyNet, generates likelihoods on the 3D occupancy grid of a person. To efficiently train our network, we propose to regularize BodyNet with a set of auxiliary losses. Besides the main volumetric 3D loss, BodyNet includes a multi-view re-projection loss and multi-task losses. The multi-view re-projection loss, being efficiently approximated on voxel space (see Sect. 3.2), increases the importance of the boundary voxels. The multi-task losses are based on the additional intermediate network supervision in terms of 2D pose, 2D body part segmentation, and 3D pose. The overall architecture of BodyNet is illustrated in Fig. 2.

To evaluate our method, we fit the SMPL model [13] to the BodyNet output and measure single-view 3D human shape estimation performance in the recent SURREAL [33] and Unite the People [34] datasets. The proposed BodyNet approach demonstrates state-of-the-art performance and improves accuracy of recent methods. We show significant improvements provided by the end-to-end training and auxiliary losses of BodyNet. Furthermore, our method enables volumetric body-part segmentation. BodyNet is fully-differentiable and could be used as a subnetwork in future application-oriented methods targeting e.g., virtual cloth change or re-enactment.

In summary, this work makes several contributions. First, we address single-view 3D human shape estimation and propose a volumetric representation for this task. Second, we investigate several network architectures and propose an end-to-end trainable network BodyNet combining a multi-view re-projection loss together with intermediate network supervision in terms of 2D pose, 2D body part segmentation, and 3D pose. Third, we outperform previous regression-based methods and demonstrate state-of-the art performance on two datasets for human shape estimation. In addition, our network is fully differentiable and can provide volumetric body-part segmentation.

2 Related Work

3D Human Body Shape. While the problem of localizing 3D body joints has been well-explored in the past [5,6,7,8, 35,36,37,38], 3D human shape estimation from a single image has received limited attention and remains a challenging problem. Earlier work [39, 40] proposed to optimize pose and shape parameters of the 3D deformable body model SCAPE [41]. More recent methods use the SMPL [17] body model that again represents the 3D shape as a function of pose and shape parameters. Given such a model and an input image, Bogo et al. [13] present the optimization method SMPLify estimating model parameters from a fit to 2D joint locations. Lassner et al. [34] extend this approach by incorporating silhouette information as additional guidance and improves the optimization performance by densely sampled 2D points. Huang et al. [42] extend SMPLify for multi-view video sequences with temporal priors. Similar temporal constraints have been used in [43]. Rhodin et al. [44] use a sum-of-Gaussians volumetric representation together with contour-based refinement and successfully demonstrate human shape recovery from multi-view videos with optimization techniques. Even though such methods show compelling results, inherently they are limited by the quality of the 2D detections they use and depend on priors both on pose and shape parameters to regularize the highly complex and costly optimization process.

Deep neural networks provide an alternative approach that can be expected to learn appropriate priors automatically from the data. Dibra et al. [45] present one of the first approaches in this direction and train a CNN to estimate the 3D shape parameters from silhouettes, but assume a frontal input view. More recent approaches [14,15,16] train neural networks to predict the SMPL body parameters from an input image. Tan et al. [14] design an encoder-decoder architecture that is trained on silhouette prediction and indirectly regresses model parameters at the bottleneck layer. Tung et al. [15] operate on two consecutive video frames and learn parameters by integrating re-projection loss on the optical flow, silhouettes and 2D joints. Similarly, Kanazawa et al. [16] predict parameters with re-projection loss on the 2D joints and introduce an adversary whose goal is to distinguish unrealistic human body shapes.

Even though parameters of deformable body models provide a low-dimensional embedding of the 3D shape, predicting such parameters with a network requires learning a highly non-linear mapping. In our work we opt for an alternative volumetric representation that has shown to be effective for generic 3D objects [21] and faces [46]. The approach of [21] operates on low-resolution grayscale images for a few rigid object categories such as chairs and tables. We argue that human bodies are more challenging due to significant non-rigid deformations. To accommodate for such deformation, we use segmentation and 3D pose as proxy to 3D shape in addition to 2D pose [46]. Conditioning our 3D shape estimation on a given 3D pose, the network focuses on the more complicated problem of shape deformation. Furthermore, we regularize our voxel predictions with additional re-projection loss, perform end-to-end multi-task training with intermediate supervision and obtain volumetric body part segmentation.

Others have studied predicting 2.5D projections of human bodies. DenseReg [47] and DensePose [48] estimate image-to-surface correspondences, while [33] outputs quantized depth maps for SMPL bodies. Differently from these methods, our approach generates a full 3D body reconstruction.

Multi-task Neural Networks. Multi-task networks are well-studied. A common approach is to output multiple related tasks at the very end of the neural network architecture. Another, more recently explored alternative is to stack multiple subnetworks and provide guidance with intermediate supervision. Here, we only cover related works that employ the latter approach. Guiding CNNs with relevant cues has shown improvements for a number of tasks. For example, 2D facial landmarks have shown useful guidance for 3D face reconstruction [46] and similarly optical flow for action recognition [49]. However, these methods do not perform joint training. Recent work of [50] jointly learns 2D/3D pose together with action recognition. Similarly, [51] trains for 3D pose with intermediate tasks of 2D pose and segmentation. With this motivation, we make use of 2D pose, 2D human body part segmentation, and 3D pose, that provide cues for 3D human shape estimation. Unlike [51], 3D pose becomes an auxiliary task for our final 3D shape task. In our experiments, we show that training with a joint loss on all these tasks increases the performance of all our subnetworks (see Appendix C.1).

3 BodyNet

BodyNet predicts 3D human body shape from a single image and is composed of four subnetworks trained first independently, then jointly to predict 2D pose, 2D body part segmentation, 3D pose, and 3D shape (see Fig. 2). Here, we first discuss the details of the volumetric representation for body shape (Sect. 3.1). Then, we describe the multi-view re-projection loss (Sect. 3.2) and the multi-task training with the intermediate representations (Sect. 3.3). Finally, we formulate our model fitting procedure (Sect. 3.4).

3.1 Volumetric Inference for 3D Human Shape

For 3D human body shape, we propose to use a voxel-based representation. Our shape estimation subnetwork outputs the 3D shape represented as an occupancy map defined on a fixed resolution voxel grid. Specifically, given a 3D body, we define a 3D voxel grid roughly centered at the root joint, (i.e., the hip joint) where each voxel inside the body is marked as occupied. We voxelize the ground truth meshes (i.e., SMPL) into a fixed resolution grid using binvox [52, 53]. We assume orthographic projection and rescale the volume such that the xy-plane is aligned with the 2D segmentation mask to ensure spatial correspondence with the input image. After scaling, the body is centered on the z-axis and the remaining areas are padded with zeros.

Our network minimizes the binary cross-entropy loss after applying the sigmoid function on the network output similar to [46]:

$$\begin{aligned} \mathcal {L}_v = \sum _{x=1}^{W} \sum _{y=1}^{H} \sum _{z=1}^{D} V_{xyz}\log \hat{V}_{xyz}+(1-V_{xyz})\log (1-\hat{V}_{xyz}), \end{aligned}$$

(1)

where $V_{xyz}$ and $\hat{V}_{xyz}$ denote the ground truth value and the predicted sigmoid output for a voxel, respectively. Width (W), height (H) and depth (D) are 128 in our experiments. We observe that this resolution captures sufficient details.

The loss $\mathcal {L}_v$ is used to perform foreground-background segmentation of the voxel grid. We further extend this formulation to perform 3D body part segmentation with a multi-class cross-entropy loss. We define 6 parts (head, torso, left/right leg, left/right arm) and learn 7-class classification including the background. The weights for this network are initialized by the shape network by copying the output layer weights for each class. This simple extension allows the network to directly infer 3D body parts without going through the costly SMPL model fitting.

3.2 Multi-view Re-projection Loss on the Silhouette

Due to the complex articulation of the human body, one major challenge in inferring the volumetric body shape is to ensure high confidence predictions across the whole body. We often observe that the confidences on the limbs away from the body center tend to be lower (see Fig. 5). To address this problem, we employ additional 2D re-projection losses that increase the importance of the boundary voxels. Similar losses have been employed for rigid objects by [54, 55] in the absence of 3D labels and by [21] as additional regularization. In our case, we show that the multi-view re-projection term is critical, particularly to obtain good quality reconstruction of body limbs. Assuming orthographic projection, the front view projection, $\hat{S}^{FV}$, is obtained by projecting the volumetric grid to the image with the max operator along the z-axis [54]. Similarly, we define $\hat{S}^{SV}$ as the max along the x-axis:

$$\begin{aligned} \hat{S}^{FV}(x,y) = \max _z \hat{V}_{xyz} \quad \text {and}\quad \hat{S}^{SV}(y,z) = \max _x \hat{V}_{xyz}. \end{aligned}$$

(2)

The true silhouette, $S^{FV}$, is defined by the ground truth 2D body part segmentation provided by the datasets. We obtain the ground truth side view silhouette from the voxel representation that we computed from the ground truth 3D mesh: ${S}^{SV}(y,z) = \max _x {V}_{xyz}$. We note that our voxels remain slightly larger than the original mesh due to the voxelization step that marks every voxel that intersects with a face as occupied. We define a binary cross-entropy loss per view as follows:

$$\begin{aligned}&\mathcal {L}^{FV}_p = \sum _{x=1}^{W} \sum _{y=1}^{H} S(x,y)\log \hat{S}^{FV}(x,y)+(1-S(x,y))\log (1-\hat{S}^{FV}(x,y)), \end{aligned}$$

(3)

$$\begin{aligned}&\mathcal {L}^{SV}_p = \sum _{y=1}^{H} \sum _{z=1}^{D} S(y,z)\log \hat{S}^{SV}(y,z)+(1-S(y,z))\log (1-\hat{S}^{SV}(y,z)). \end{aligned}$$

(4)

We train the shape estimation network initially with $\mathcal {L}_v$. Then, we continue training with a combined loss: $\lambda _v\mathcal {L}_v + \lambda ^{FV}_p\mathcal {L}^{FV}_p + \lambda ^{SV}_p\mathcal {L}^{SV}_p$, Sect. 3.3 gives details on how to set the relative weighting of the losses. Sect. 4.3 demonstrates experimentally the benefits of the multi-view re-projection loss.

3.3 Multi-task Learning with Intermediate Supervision

The input to the 3D shape estimation subnetwork is composed by combining RGB, 2D pose, segmentation, and 3D pose predictions. Here, we present the subnetworks used to predict these intermediate representations and detail our multi-task learning procedure. The architecture for each subnetwork is based on a stacked hourglass network [1], where the output is over a spatial grid and is, thus, convenient for pixel- and voxel-level tasks as in our case.

2D Pose. Following the work of Newell et al. [1], we use a heatmap representation of 2D pose. We predict one heatmap for each body joint where a Gaussian with fixed variance is centered at the corresponding image location of the joint. The final joint locations are identified as the pixel indices with the maximum value over each output channel. We use the first two stacks of an hourglass network to map RGB features $3\times 256\times 256$ to 2D joint heatmaps $16\times 64\times 64$ as in [1] and predict 16 body joints. The mean-squared error between the ground truth and predicted 2D heatmaps is $\mathcal {L}^{2D}_{j}$.

2D Part Segmentation. Our body part segmentation network is adopted from [33] and is trained on the SMPL [17] anatomic parts defined by [33]. The architecture is similar to the 2D pose network and again the first two stacks are used. The network predicts one heatmap per body part given the input RGB image, which results in an output resolution of $15\times 64\times 64$ for 15 body parts. The spatial cross-entropy loss is denoted with $\mathcal {L}_{s}$.

3D Pose. Estimating the 3D joint locations from a single image is an inherently ambiguous problem. To alleviate some uncertainty, we assume that the camera intrinsics are known and predict the 3D pose in the camera coordinate system. Extending the notion of 2D heatmaps to 3D, we represent 3D joint locations with 3D Gaussians defined on a voxel grid as in [6]. For each joint, the network predicts a fixed-resolution volume with a single 3D Gaussian centered at the joint location. The $xy-$dimensions of this grid are aligned with the image coordinates, and hence the 2D joint locations, while the z dimension represents the depth. We assume this voxel grid is aligned with the 3D body such that the root joint corresponds to the center of the 3D volume. We determine a reasonable depth range in which a human body can fit (roughly 85cm in our experiments) and quantize this range into 19 bins. We define the overall resolution of the 3D grid to be $64\times 64\times 19$, i.e., four times smaller in spatial resolution compared to the input image as is the case for the 2D pose and segmentation networks. We define one such grid per body joint and regress with mean-squared error $\mathcal {L}^{3D}_j$.

The 3D pose estimation network consists of another two stacks. Unlike 2D pose and segmentation, the 3D pose network takes multiple modalities as input, all spatially aligned with the output of the network. Specifically, we concatenate RGB channels with the heatmaps corresponding to 2D joints and body parts. We upsample the heatmaps to match the RGB resolution, thus the input resolution becomes $(3+16+15)\times 256\times 256$. While 2D pose provides a significant cue for the x, y joint locations, some of the depth information is implicitly contained in body part segmentation since unlike a silhouette, occlusion relations among individual body parts provide strong 3D cues. For example a discontinuity on the torso segment caused by an occluding arm segment implies the arm is in front of the torso. In Appendix C.4, we provide comparisons of 3D pose prediction with and without using this additional information.

Combined Loss and Training Details. The subnetworks are initially trained independently with individual losses, then fine-tuned jointly with a combined loss:

$$\begin{aligned} \mathcal {L} = \lambda ^{2D}_j\mathcal {L}^{2D}_j + \lambda _s\mathcal {L}_s + \lambda ^{3D}_j\mathcal {L}^{3D}_j + \lambda _v\mathcal {L}_v + \lambda ^{FV}_p\mathcal {L}^{FV}_p + \lambda ^{SV}_p\mathcal {L}^{SV}_p. \end{aligned}$$

(5)

The weighting coefficients are set such that the average gradient of each loss across parameters is at the same scale at the beginning of fine-tuning. With this rule, we set $ ( \lambda ^{2D}_j, \lambda _s, \lambda ^{3D}_j, \lambda _v, \lambda ^{FV}_p, \lambda ^{SV}_p ) \propto ( 10^7, 10^3, 10^6, 10^1, 1, 1 ) $ and make the sum of the weights equal to one. We set these weights on the SURREAL dataset and use the same values in all experiments. We found it important to apply this balancing so that the network does not forget the intermediate tasks, but improves the performance of all tasks at the same time.

When training our full network, see Fig. 2, we proceed as follows: (i) we train 2D pose and segmentation; (ii) we train 3D pose with fixed 2D pose and segmentation network weights; (iii) we train 3D shape network with all the preceding network weights fixed; (iv) then, we continue training the shape network with additional re-projection losses; (v) finally, we perform end-to-end fine-tuning on all network weights with the combined loss.

Implementation Details. Each of our subnetworks consists of two stacks to keep a reasonable computational cost. We take the first two stacks of the 2D pose network trained on the MPII dataset [56] with 8 stacks [1]. Similarly, the segmentation network is trained on the SURREAL dataset with 8 stacks [33] and the first two stacks are used. Since stacked hourglass networks involve intermediate supervision [1], we can use only part of the network by sacrificing slight performance. The weights for 3D pose and 3D shape networks are randomly initialized and trained on SURREAL with two stacks. Architectural details are given in Appendix B. SURREAL [33], being a large-scale dataset, provides pre-training for the UP dataset [34] where the networks converge relatively faster. Therefore, we fine-tune the segmentation, 3D pose, and 3D shape networks on UP from those pre-trained on SURREAL. We use RMSprop [57] algorithm with mini-batches of size 6 and a fixed learning rate of $10^{-3}$. Color jittering augmentation is applied on the RGB data. For all the networks, we assume that the bounding box of the person is given, thus we crop the image to center the person. Code is made publicly available on the project page [58].

3.4 Fitting a Parametric Body Model

While the volumetric output of BodyNet produces good quality results, for some applications, it is important to produce a 3D surface mesh, or even a parametric model that can be manipulated. Furthermore, we use the SMPL model for our evaluation. To this end, we process the network output in two steps: (i) we first extract the isosurface from the predicted occupancy map, (ii) next, we optimize for the parameters of a deformable body model, SMPL model in our experiments, that fits the isosurface as well as the predicted 3D joint locations.

Formally, we define the set of 3D vertices in the isosurface mesh that is extracted [59] from the network output to be $\mathbf {V}^n$. SMPL [17] is a statistical model where the location of each vertex is given by a set $\mathbf {V}^s(\theta ,\beta )$ that is formulated as a function of the pose ($\theta $) and shape ($\beta $) parameters [17]. Given $\mathbf {V}^n$, our goal is to find $\{\theta ^\star , \beta ^\star \}$ such that the weighted Chamfer distance, i.e., the distance among the closest point correspondences between $\mathbf {V}^n$ and $\mathbf {V}^s(\theta ,\beta )$ is minimized:

$$\begin{aligned} { \{ \theta ^\star , \beta ^\star \} } = \underset{ \{ \theta , \beta \} }{{{\mathrm{\arg \!\min }}}}&\sum _{\mathbf {p}^n \in \mathbf {V}^n} \min _{\mathbf {p}^s \in \mathbf {V}^s(\theta , \beta )} w^n \Vert \mathbf {p}^n-\mathbf {p}^s\Vert _2^2 + \nonumber \\&\sum _{\mathbf {p}^s \in \mathbf {V}^s(\theta , \beta )} \min _{\mathbf {p}^n \in \mathbf {V}^n} w^n\Vert \mathbf {p}^n - \mathbf {p}^s\Vert _2^2 + \lambda \sum _{i=1}^{J} \Vert \mathbf {j}^n_i - \mathbf {j}^s_i(\theta , \beta ) \Vert _2^2 . \end{aligned}$$

(6)

We find it effective to weight the closest point distances by the confidence of the corresponding point in the isosurface which depends on the voxel predictions of our network. We denote the weight associated with the point $p^n$ as $w^n$. We define an additional term to measure the distance between the predicted 3D joint locations, $\{\mathbf {j}^n_i\}_{i=1}^{J}$, where J denotes the number of joints, and the corresponding joint locations in the SMPL model, denoted by $\{\mathbf {j}^s_i(\theta , \beta )\}_{i=1}^{J}$. We weight the contribution of the joints’ error by a constant $\lambda $ (empirically set to 5 in our experiments) since J is very small (e.g., 16) compared to the number of vertices (e.g., 6890). In Sect. 4, we show the benefits of fitting to voxel predictions compared to our baseline of fitting to 2D and 3D joints, and to 2D segmentation, i.e., to the inputs of the shape network.

We optimize for Eq. (6) in an iterative manner where we update the correspondences at each iteration. We use Powell’s dogleg method [60] and Chumpy [61] similar to [13]. When reconstructing the isosurface, we first apply a thresholding (0.5 in our experiments) to the voxel predictions and apply the marching cubes algorithm [59]. We initialize the SMPL pose parameters to be aligned with our 3D pose predictions and set $\beta = \mathbf {0}$ (where $\mathbf {0}$ denotes a vector of zeros).

4 Experiments

This section presents the evaluation of BodyNet. We first describe evaluation datasets (Sect. 4.1) and other methods used for comparison in this paper (Sect. 4.2). We then evaluate contributions of additional inputs (Sect. 4.3) and losses (Sect. 4.4). Next, we report performance on the UP dataset (Sect. 4.5). Finally, we demonstrate results for 3D body part segmentation (Sect. 4.6).

4.1 Datasets and Evaluation Measures

SURREAL Dataset [33] is a large-scale synthetic dataset for 3D human body shapes with ground truth labels for segmentation, 2D/3D pose, and SMPL body parameters. Given its scale and rich ground truth, we use SURREAL in this work for training and testing. Previous work demonstrating successful use of synthetic images of people for training visual models include [62,63,64]. Given the SMPL shape and pose parameters, we compute the ground truth 3D mesh. We use the standard train split [33]. For testing, we use the middle frame of the middle clip of each test sequence, which makes a total of 507 images. We observed that testing on the full test set of 12, 528 images yield similar results. To evaluate the quality of our shape predictions for difficult cases, we define two subsets with extreme body shapes, similar to what is done for example in optical flow [65]. We compute the surface distance between the average shape ($\beta =\mathbf {0}$) given the ground truth pose and the true shape. We take the $10^{th}$ (s10) and $20^{th}$ (s20) percentile of this distance distribution that represent the meshes with extreme body shapes.

Unite the People Dataset (UP) [34] is a recent collection of multiple datasets (e.g., MPII [56], LSP [66]) providing additional annotations for each image. The annotations include 2D pose with 91 keypoints, 31 body part segments, and 3D SMPL models. The ground truth is acquired in a semi-automatic way and is therefore imprecise. We evaluate our 3D body shape estimations on this dataset. We report errors on two different subsets of the test set where 2D segmentations as well as pseudo 3D ground truth are available. We use notation T1 for images from the LSP subset [34], and T2 for images used by [14].

3D Shape Evaluation. We evaluate body shape estimation with different measures. Given the ground truth and our predicted volumetric representation, we measure the intersection over union directly on the voxel grid, i.e., voxel IOU. We further assess the quality of the projected silhouette to enable comparison with [14, 16, 34]. We report the intersection over union (silhouette IOU), F1-score computed for foreground pixels, and global accuracy (ratio of correctly predicted foreground and background pixels). We evaluate the quality of the fitted SMPL model by measuring the average error in millimeters between the corresponding vertices in the fit and ground truth mesh (surface error). We also report the average error between the corresponding 91 landmarks defined for the UP dataset [34]. We assume the depth of the root joint and the focal length to be known to transform the volumetric representation into a metric space.

4.2 Alternative Methods

We demonstrate advantages of BodyNet by comparing it to alternative methods. BodyNet makes use of 2D/3D pose estimation and 2D segmentation. We define alternative methods in terms of the same components combined differently.

SMPLify++. Lassner et al. [34] extended SMPLify [13] with an additional term on 2D silhouette. Here, we extend it further to enable a fair comparison with BodyNet. We use the code from [13] and implement a fitting objective with additional terms on 2D silhouette and 3D pose besides 2D pose (see Appendix D). As shown in Table 2, results of SMPLify++ remain inferior to BodyNet despite both of them using 2D/3D pose and segmentation inputs (see Fig. 3).

Shape Parameter Regression. To validate our volumetric representation, we also implement a regression method by replacing the 3D shape estimation network in Fig. 2 by another subnetwork directly regressing the 10-dim. shape parameter vector $\beta $ using L2 loss. The network architecture corresponds to the encoder part of the hourglass followed by 3 additional fully connected layers (see Appendix B for details). We recover the pose parameters $\theta $ from our 3D pose prediction (initial attempts to regress $\theta $ together with $\beta $ gave worse results). Table 2 demonstrates inferior performance of the $\beta $ regression network that often produces average body shapes (see Fig. 3). In contrast, BodyNet results in better SMPL fitting due to the accurate volumetric representation.

Table 1. Performance on the SURREAL dataset using alternative combinations of intermediate representations at the input.

Full size table

4.3 Effect of Additional Inputs

We first motivate our proposed architecture by evaluating performance of 3D shape estimation in the SURREAL dataset using alternative inputs (see Table 1). When only using one input, 3D pose network, which is already trained with additional 2D pose and segmentation inputs, performs best. We observe improvements as more cues, specifically 3D cues are added. We also note that intermediate representations in terms of 3D pose and 2D segmentation outperform RGB. Adding RGB to the intermediate representations further improves shape results on SURREAL. Figure 4 illustrates intermediate predictions as well as the final 3D shape output. Based on results in Table 1, we choose to use all intermediate representations as parts of our full network that we call BodyNet.

Table 2. Volumetric prediction on SURREAL with different versions of our model compared to alternative methods. Note that lines 2–10 use same modalities (i.e., 2D/3D pose, 2D segmentation). The evaluation is made on the SMPL model fit to our voxel outputs. The average SMPL surface error decreases with the addition of the proposed components.

Full size table

4.4 Effect of Re-projection Error and End-to-End Multi-task Training

We evaluate contributions provided by additional supervision from Sects. 3.2-3.3.

Effect of Re-projection Losses. Table 2 (lines 4–10) provides results when the shape network is trained with and without re-projection losses (see also Fig. 5). The voxels network without any additional loss already outperforms the baselines described in Sect. 4.2. When trained with re-projection losses, we observe increasing performance both with single-view constraints, i.e., front view (FV), and multi-view, i.e., front and side views (FV+SV). The multi-view re-projection loss puts more importance on the body surface resulting in a better SMPL fit.

Effect of Intermediate Losses. Table 2 (lines 7–10) presents experimental evaluation of the proposed intermediate supervision. Here, we first compare the end-to-end network fine-tuned jointly with auxiliary tasks (lines 9–10) to the networks trained independently from the fixed representations (lines 4–6). Comparison of results on lines 6 and 10 suggests that multi-task training regularizes all subnetworks and provides better performance for 3D shape. We refer to Appendix C.1 for the performance improvements on auxiliary tasks. To assess the contribution of intermediate losses on 2D pose, segmentation, and 3D pose, we implement an additional baseline where we again fine-tune end-to-end, but remove the losses on the intermediate tasks (lines 7–8). Here, we keep only the voxels and the re-projection losses. These networks not only forget the intermediate tasks, but are also outperformed by our base networks without end-to-end refinement (compare lines 8 and 6). On all the test subsets (i.e., full, s20, and s10) we observe a consistent improvement of the proposed components against baselines. Figure 3 presents qualitative results and illustrates how BodyNet successfully learns the 3D shape in extreme cases.

Comparison to the State of the Art. Table 2 (lines 1,10) demonstrates a significant improvement of BodyNet compared to the recent method of Tung et al. [15]. Note that [15] relies on ground truth 2D pose and segmentation on the test set, while our approach is fully automatic. Other works do not report results on the recent SURREAL dataset.

Table 3. Body shape performance and comparison to the state of the art on the UP dataset. Unlike in SURREAL, the 3D ground truth in this dataset is imprecise.

Full size table

4.5 Comparison to the State of the Art on Unite the People

For the networks trained on the UP dataset, we initialize the weights pre-trained on SURREAL and fine-tune with the complete training set of UP-3D where the 2D segmentations are obtained from the provided 3D SMPL fits [34]. We show results of BodyNet trained end-to-end with multi-view re-projection loss. We provide quantitative evaluation of our method in Table 3 and compare to recent approaches [14, 16, 34]. We note that some works only report 2D metrics measuring how well the 3D shape is aligned with the manually annotated segmentation. The ground truth is a noisy estimate obtained in a semi-automatic way [34], whose projection is mostly accurate but not its depth. While our results are on par with previous approaches on 2D metrics, we note that the provided manual segmentations and the 3D SMPL fits [34] are noisy and affect both the training and the evaluation [48]. Therefore, we also provide a large set of visual results in Appendices A, E to illustrate our competitive 3D estimation quality. On 3D metrics, our method significantly outperforms both direct and indirect learning of [14]. We also provide qualitative results in Fig. 4 where we show both the intermediate outputs and the final 3D shape predicted by our method. We observe that voxel predictions are aligned with the 3D pose predictions and provide a robust SMPL fit. We refer to Appendix E for an analysis on the type of segmentation used as re-projection supervision.

4.6 3D Body Part Segmentation

As described in Sect. 3.1, we extend our method to produce not only the foreground voxels for a human body, but also the 3D part labeling. We report quantitative results on SURREAL in Table 4 where accurate ground truth is available. When the parts are combined, the foreground IOU becomes 58.9 which is comparable to 58.1 reported in Table 1. We provide qualitative results in Fig. 6 on the UP dataset where the parts network is only trained on SURREAL. To the best of our knowledge, we present the first method for 3D body part labeling from a single image with an end-to-end approach. We infer volumetric body parts directly with a network without iterative fitting of a deformable model and obtain successful results. Performance-wise BodyNet can produce foreground and per-limb voxels in 0.28s and 0.58s per image, respectively, using modern GPUs.

Table 4. 3D body part segmentation performance measured per part on SURREAL. The articulated and small limbs appear more difficult than torso.

Full size table

5 Conclusion

We have presented BodyNet, a fully automatic end-to-end multi-task network architecture that predicts the 3D human body shape from a single image. We have shown that joint training with intermediate tasks significantly improves the results. We have also demonstrated that the volumetric regression together with a multi-view re-projection loss is effective for representing human bodies. Moreover, with this flexible representation, our framework allows us to extend our approach to demonstrate impressive results on 3D body part segmentation from a single image. We believe that BodyNet can provide a trainable building block for future methods that make use of 3D body information, such as virtual cloth-change. Furthermore, we believe exploring the limits of using only intermediate representations is an interesting research direction for 3D tasks where acquiring training data is impractical. Another future direction is to study the 3D body shape under clothing. Volumetric representation can potentially capture such additional geometry if training data is provided.

References

Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
Google Scholar
Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)
Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification-regression for human pose. In: CVPR (2017)
Google Scholar
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)
Google Scholar
Leroy, V., Franco, J.S., Boyer, E.: Multi-view dynamic shape refinement using local temporal integration. In: ICCV (2017)
Google Scholar
Loper, M.M., Mahmood, N., Black, M.J.: MoSh: motion and shape capture from sparse markers. In: SIGGRAPH (2014)
Google Scholar
von Marcard, T., Rosenhahn, B., Black, M., Pons-Moll, G.: Sparse inertial poser: automatic 3D human pose estimation from sparse IMUs. In: Eurographics (2017)
Google Scholar
Yang, J., Franco, J.-S., Hétroy-Wheeler, F., Wuhrer, S.: Estimation of human body shape in motion with wide clothing. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 439–454. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_27
Chapter Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3d human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chapter Google Scholar
Tan, V., Budvytis, I., Cipolla, R.: Indirect deep structured learning for 3D human body shape and pose prediction. In: BMVC (2017)
Google Scholar
Tung, H., Tung, H., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: NIPS (2017)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.: SMPL: a skinned multi-person linear model. In: SIGGRAPH (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Article Google Scholar
Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: IROS (2015)
Google Scholar
Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: NIPS (2016)
Google Scholar
Yumer, M.E., Mitra, N.J.: Learning semantic deformation flows with 3D convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 294–311. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_18
Chapter Google Scholar
Yumer, M.E., Mitra, N.J.: Learning semantic deformation flows with 3D convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 294–311. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_18
Chapter Google Scholar
Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In: ICCV (2017)
Google Scholar
Riegler, G., Ulusoy, A.O., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: CVPR (2017)
Google Scholar
Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: O-CNN: Octree-based convolutional neural networks for 3D shape analysis. In: SIGGRAPH (2017)
Google Scholar
Riegler, G., Ulusoy, A.O., Bischof, H., Geiger, A.: OctNetFusion: learning depth fusion from data. In: 3DV (2017)
Google Scholar
Su, H., Fan, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)
Google Scholar
Su, H., Qi, C., Mo, K., Guibas, L.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Google Scholar
Deng, H., Birdal, T., Ilic, S.: PPFNet: global context aware local features for robust 3D point matching. In: CVPR (2018)
Google Scholar
Groueix, T., Fisher, M., Kim, V.G., Russell, B., Aubry, M.: AtlasNet: a Papier-Mâché approach to learning 3D surface generation. In: CVPR (2018)
Google Scholar
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: CVPR (2014)
Google Scholar
Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)
Google Scholar
Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR (2017)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI 36(7), 1325–1339 (2014)
Article Google Scholar
Kostrikov, I., Gall, J.: Depth sweep regression forests for estimating 3D human pose from images. In: BMVC (2014)
Google Scholar
Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: CVPR (2016)
Google Scholar
Rogez, G., Schmid, C.: MoCap-guided data augmentation for 3D pose estimation in the wild. In: NIPS (2016)
Google Scholar
Balan, A., Sigal, L., Black, M.J., Davis, J., Haussecker, H.: Detailed human shape and pose from images. In: CVPR (2007)
Google Scholar
Guan, P., Weiss, A., O. Balan, A., Black, M.: Estimating human shape and pose from a single image. In: ICCV (2009)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. In: SIGGRAPH (2005)
Google Scholar
Huang, Y., et al.: Towards accurate marker-less human shape and pose estimation over time. In: 3DV (2017)
Google Scholar
Alldieck, T., Kassubeck, M., Wandt, B., Rosenhahn, B., Magnor, M.: Optical flow-based 3D human motion estimation from monocular video. In: GCPR (2017)
Google Scholar
Rhodin, H., Robertini, N., Casas, D., Richardt, C., Seidel, H.-P., Theobalt, C.: General automatic human shape and motion capture using volumetric contour cues. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 509–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_31
Chapter Google Scholar
Dibra, E., Jain, H., Öztireli, C., Ziegler, R., Gross, M.: HS-Nets: estimating human body shape from silhouettes with convolutional neural networks. In: 3DV (2016)
Google Scholar
Jackson, A.S., Bulat, A., Argyriou, V., Tzimiropoulos, G.: Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In: ICCV (2017)
Google Scholar
Güler, R.A., George, T., Antonakos, E., Snape, P., Zafeiriou, S., Kokkinos, I.: DenseReg: fully convolutional dense shape regression in-the-wild. In: CVPR (2017)
Google Scholar
Güler, R.A., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: CVPR (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: CVPR (2018)
Google Scholar
Popa, A., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR (2017)
Google Scholar
Nooruddin, F.S., Turk, G.: Simplification and repair of polygonal models using volumetric techniques. IEEE Trans. Vis. Comput. Graph. 9(2), 191–205 (2003)
Article Google Scholar
Min, P.: binvox. http://www.patrickmin.com/binvox
Zhu, R., Kiani, H., Wang, C., Lucey, S.: Rethinking reprojection: closing the loop for pose-aware shape reconstruction from a single image. In: ICCV (2017)
Google Scholar
Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR (2017)
Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5—RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012)
Google Scholar
http://www.di.ens.fr/willow/research/bodynet/
Lewiner, T., Lopes, H., Vieira, A.W., Tavares, G.: Efficient implementation of marching cubes cases with topological guarantees. J. Graph. Tools 8(2), 1–15 (2003)
Article Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5
Book MATH Google Scholar
http://chumpy.org
Barbosa, I.B., Cristani, M., Caputo, B., Rognhaugen, A., Theoharis, T.: Looking beyond appearances: synthetic training data for deep CNNs in re-identification. CVIU 167, 50–62 (2018)
Google Scholar
Ghezelghieh, M.F., Kasturi, R., Sarkar, S.: Learning camera viewpoint using CNN to improve 3D body pose estimation. In: 3DV (2016)
Google Scholar
Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: 3DV (2016)
Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
Google Scholar

Download references

Acknowledgement

This work was supported in part by Adobe Research, ERC grants ACTIVIA and ALLEGRO, the MSR-Inria joint lab, the Alexander von Humbolt Foundation, the Louis Vuitton ENS Chair on Artificial Intelligence, DGA project DRAAF, an Amazon academic research award, and an Intel gift.

Author information

Authors and Affiliations

Inria, Paris, France
Gül Varol & Ivan Laptev
Inria, Grenoble, France
Cordelia Schmid
Adobe Research, San Jose, USA
Duygu Ceylan, Jimei Yang & Ersin Yumer
Adobe Research, San Francisco, USA
Bryan Russell

Authors

Gül Varol
View author publications
You can also search for this author in PubMed Google Scholar
Duygu Ceylan
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Russell
View author publications
You can also search for this author in PubMed Google Scholar
Jimei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ersin Yumer
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Laptev
View author publications
You can also search for this author in PubMed Google Scholar
Cordelia Schmid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gül Varol .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5567 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Varol, G. et al. (2018). BodyNet: Volumetric Inference of 3D Human Body Shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11211. Springer, Cham. https://doi.org/10.1007/978-3-030-01234-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-01234-2_2
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01233-5
Online ISBN: 978-3-030-01234-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BodyNet: Volumetric Inference of 3D Human Body Shapes

Abstract

Similar content being viewed by others

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

SimpleMeshNet: end to end recovery of 3d body mesh with one fully connected layer

3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning

Keywords

1 Introduction

2 Related Work

3 BodyNet

3.1 Volumetric Inference for 3D Human Shape

3.2 Multi-view Re-projection Loss on the Silhouette