DOVE: Learning Deformable 3D Objects by Watching Videos

Learning deformable 3D objects from 2D images is often an ill-posed problem. Existing methods rely on explicit supervision to establish multi-view correspondences, such as template shape models and keypoint annotations, which restricts their applicability on objects “in the wild”. A more natural way of establishing correspondences is by watching videos of objects moving around. In this paper, we present DOVE, a method that learns textured 3D models of deformable object categories from monocular videos available online, without keypoint, viewpoint or template shape supervision. By resolving symmetry-induced pose ambiguities and leveraging temporal correspondences in videos, the model automatically learns to factor out 3D shape, articulated pose and texture from each individual RGB frame, and is ready for single-image inference at test time. In the experiments, we show that existing methods fail to learn sensible 3D shapes without additional keypoint or template supervision, whereas our method produces temporally consistent 3D models, which can be animated and rendered from arbitrary viewpoints. Project page: https://dove3d.github.io/.


Introduction
In applications, we often need to obtain accurate 3D models of deformable objects from just one or a few pictures of them.This is the case in traditional applications such as robotics, but also, increasingly, in consumer applications, such as content creation for virtual and augmented reality -using everyday pictures and videos taken with a cellphone.
3D reconstruction from a single image, or even a small number of views, is generally very ambiguous and only solvable by leveraging powerful statistical priors of the 3D world.Learning such priors is however very challenging.One approach is to use training data specifically collected for this purpose, for example by using 3D scanners and domes [1][2][3][4][5][6][7][8][9][10] or shape models [11][12][13][14][15].This is expensive and can be justified only for a few categories such as human bodies and faces that are of particular significance in applications.However, scanning is not a viable approach to cover the huge diversity of objects that exist in the real world.
We thus need to develop methods that can learn 3D deformable objects from as cheap supervision as possible, such as leveraging casually-collected images and videos found on the Internet, or crowdsourced datasets such as CO3D [16].Ideally, our system should take as input a collection of such casual images and videos and learn a model capable of reconstructing the 3D shape, appearance and deformation of a new object from a single image of it.While several authors have looked at this problem before [1,2,[4][5][6]10], so far it has always been necessary to make additional simplifying assumptions compared to the ideal unsupervised setting described above.These assumptions usually come in the form of additional geometric supervision.The most common one is to require 2D masks for the objects, either obtained manually or via a pre-trained segmentation network such as [17,18].On top of this, there is usually at least one more additional form of geometric supervision, such as providing an initial approximate 3D template of the object, 2D keypoint detections, or approximate 3D camera parameters [15,[19][20][21][22][23][24][25].There is a small number of works that require no masks or geometric supervision [26], but they come with other limitations such as relying on limited viewpoint range.
Our aim in this paper is to learn 3D deformable objects from complex casual videos while only using 2D masks and removing the additional geometric supervision that require expensive manual annotations used in prior works (keypoints, viewpoint, and templates).In order to compensate for this lack of geometric information, we propose to learn from casual videos rather than still images, unlike most prior works.While this adds some complexity to the method, using videos has the key advantage to allow one to establish correspondences between different images, for instance by using an off-the-shelf optical flow algorithm.While this information is weaker than externally-provided information such as keypoints, nevertheless it is very helpful in recovering the objects' viewpoint.
Note, though, that videos are only used for supervision: our goal is still to learn a model that can reconstruct a new object instance from a single image.
In order to use videos effectively, we make a number of technical contributions.The first one addresses the challenge of estimating the viewpoint of the 3D objects.Prior works addressed this issue by sampling a large number of possible views [23,27], an approach that [23] calls a camera multiplex.We find that this is unnecessary.While viewpoint estimation is ambiguous, we show that the ambiguity is mostly restricted to a small space of symmetries induced by the 2D projection of the 3D objects onto the image.The result is that, as the model is learned, only a very small number of alternative viewpoints need to be explored in order to escape form the ambiguity-induced local optima: from, e.g., 40 in [23] to just two per iteration, which largely reduces memory and time requirements for training.
Our second contribution is the design of the object model.We propose a hierarchical shape representation that explicitly disentangles categorydependent prior shape, instance-dependent deformation, as well as time-dependent articulated and rigid pose.In this way, we automatically factor shape and pose variations at different levels in the video dataset, and leverage instance-specific correspondences within a video and instance-agnostic correspondences across multiple videos.We also enforce a bilateral symmetry on the predicted canonical shape and texture, similar to previous methods [19,23,26,28].However, differently from these approaches, which assume symmetry at the level of the object instances, here we assume the canonical (pose free) shapes are symmetric, but individual articulations can be asymmetric [29,30], which is much more realistic.
We also address the challenge of evaluating these reconstruction methods.Prior works in this area generally lack data with 3D ground truth.Instead, they resort to indirect evaluation by measuring the quality of the 2D correspondences that are established by the 3D models.To address this problem, we create a dataset of views of real-life animal models (toy birds).The data is designed to resemble a subset of the images as found in existing datasets such as CUB [31]; however, it additionally comes with 3D scans of the objects, which can be used to test the quality of the 3D reconstructions directly.We use this data to evaluate our and several state-of-the-art algorithms without the need for proxy metrics such as keypoint re-projection error that are insufficient to assess the quality of a 3D reconstruction.We hope that this data will be useful for future work in this area.
Overall, our method can successfully learn good 3D shape predictors from videos of animals such as birds and horses.Compared to prior work, our method produces better 3D shape reconstructions, as measured on the new benchmark, when not using additional geometric supervision.

Related Work
We divide the vast literature of related work into two parts.The first one focuses on learning based approaches for 3D reconstruction with limited supervision.The second parts highlights related work for 3D reconstruction from images and video.

Unsupervised and Weakly-supervised 3D Reconstruction
A primary motivation for introducing machine learning in 3D reconstruction is to enable reconstruction from single views, which necessitates learning suitable shape priors.In particular, we focus the discussion on unsupervised and weaklysupervised methods that do not require explicit 3D ground-truth for training.Early unsupervised work include monocular depth predictors trained from egocentric videos of rigid scenes [38,39].
Others have explored weakly-supervised methods for learning full 3D meshes of object categories [4, 6, 19-21, 23, 24, 28, 40-43].Many of these methods learn from still images and generally require masks and other additional supervision or prior assumptions, summarized in Table 1.In particular, CMR [19] uses 2D keypoint annotations (in addition to masks) to initialize shape and viewpoint using Structure-from-Motion (SfM).This is extended in the follow-up works in various ways.U-CMR [23], TTP [24] and IMR [37] replace the keypoint annotations with a category-specific template shape beforehand.With the template shape, extensive viewpoint sampling (camera multiplex) can be done to search for the best camera viewpoint for each training image [23].UMR [28] instead uses part segmentations from SCOPS [33], which also requires supervised ImageNet pretraining.VMR [20] extends CMR with asymmetric deformation, and introduces a test-time adaptation procedure on individual videos by enforcing temporal consistency on the predictions produced by a pre-trained CMR model.Note that we use videos to learn a 3D shape model from scratch, whereas VMR starts with a pre-trained model and only performs online adaptation on videos.CSM [22] and articulated CSM [27] learn to pose an externallyprovided (articulated) 3D template of an object category to images.Unsup3D [26] learns symmetric objects, like faces, without masks, but only with limited viewpoint variation.
Adversarial learning has also been explored to replace the need of multi-views for training [44][45][46][47][48][49][50][51][52][53][54].The idea is to use a discriminator network to tell whether or not arbitrarily generated views of the learned 3D model are plausible, which provides signals to learn the geometry.Although this approach does not require viewpoint annotations for individual images, it does rely on a reasonable approximation of the viewpoint distribution in the training data, from which random views are generated.Overall, promising results can be achieved on synthetic data as well as a few real object categories, but general methods usually recover only coarse 3D shapes or 3D feature volumes that are difficult to extract.

Reconstruction from Multiple Views and Videos
Most works using multiple views and videos focus on reconstructing individual instances of an object.Classic SfM methods [55,56] use multiple views of a rigid scene, with pipelines such as KinectFusion [57] and DynamicFusion [58] integrating depth sensors for reconstructing dense static and deformable surfaces.Neural implicit surface representations have recently emerged for multi-view reconstruction [59][60][61].NeRF [62] and its deformable extensions [63][64][65][66][67][68] synthesize novel views from densely sampled multi-views of a static or mildly dynamic scene using a Neural Radiance Field, from which explicit coarse 3D geometry can be further extracted.A more recent line of work, such as LASR [34] and ViSER [35], optimizes a single 3D deformable model on an individual video sequence, using mask and optical flow supervision.BANMo [36] further extends the pipeline and optimizes over a few video sequences of the same object instance, with the help of a pretrained DensePose [32] model.However, these optimization-based models are typically sensitive to the quality of the sequences and tends to fail when only limited views are observed (see Fig. 6).In contrast, by learning priors over a video dataset, our model can perform inference on a single image.
Other works that learn 3D categories form videos typically require some shape prior, such as a parametric shape model [11,69], and hence mostly focus on reconstruction of human bodies or faces [15,21,[70][71][72][73][74][75][76].[77] and [78] consider turntable like videos to learn to reconstruct rigid object categories.In contrast, our method learns a 3D shape model of a deformable object category from scratch using videos.

Method
Our goal is to learn a function (V, ξ, T ) = f (I) that, given a single image I ∈ R 3×H×W of an object, predicts its 3D shape V (a mesh), its pose ξ (either a rigid transformation or full articulation parameters) and its texture T (an image).We describe below the key ideas in our method and refer the reader to the sup.mat.for details.
While the predictor f is monocular, we supervise it by making use of video sequences I = {I t } t=1,...,|I| , where t denotes time.For this, we use a photo-geometric auto-encoding approach.Let M ∈ {0, 1} H×W be the 2D mask of the object in image I, which we assume to be given.The model (V, ξ, T ) = f (I) encodes the image as a set of Encoderdecoder Fig. 2: Training Pipeline.Form a single frame of a video, we predict the 3D pose, shape and texture of the object.The shape is further disentangled into category shape, instance shape and deformation using linear blend skinning.Using a differentiable rendering step, we can train the model end-to-end by reconstructing the image and by enforcing temporal consistencies.photo-geometric parameters; from these, an handcrafted rendering function ( Î, M ) = R(V, ξ, T ) reconstructs the image Î and the mask M .For supervision, the rendered image and the rendered mask is compared to the given ones via two losses: where λ im and λ mask weigh each loss.Note that the image loss is restricted to the predicted region as the model only represents the object but not the background.Figure 2 gives an overview of the training pipeline.

Solving the Viewpoint Ambiguity
Decomposing a single image into shape, pose and appearance is ambiguous, which is a major challenge that can easily result in poor reconstructions.Some prior works have addressed this issue by sampling a large number of viewpoints during training, thus giving the optimizer a chance to avoid local optima.However, this is a slow process that requires testing a large number of hypotheses at every iteration (e.g.40 in [23]) and requires a precise template shape to understand the differences between small viewpoint changes.
Here we note that this is likely unnecessary.The key observation is that the ambiguities arising from image-based reconstruction are not arbitrary; instead, they tend to concentrate around specific symmetries induced by the projection of a 3D object onto an image.

M ξ qξ
The image to the right illustrates this idea.Here, given only the mask M , one is unable to choose between the object pose ξ or its mirrored variant qξ, where q is a suitable 'mirror mapping' that rotates the pose back to front (see sup. mat.for details).We argue that, before developing a more nuanced understanding of appearance, the model f is similarly undecided about the pose of the 3D object; however, the number of choices is very limited: either the current prediction ξ = f ξ (I) is correct, or its mirrored version qξ is.
Concretely, during training we evaluate the loss L(V, ξ, T ) for the model prediction and the loss L(V, qξ, T ) for the mirrored pose.We find the better of the two poses ξ * = arg min ξ∈{ξ,qξ} L(V, ξ, T ) and optimize the loss: In this way, the model is encouraged to flip its prediction when L(V, qξ, T ) < L(V, ξ, T ).This assures that the model eventually learns the correct pose and does not rely on the flipping towards the end of the training.

Learning from Videos
We exploit the information in videos by noting that the shape V and texture T of an object are invariant over time, with any time-dependent change limited to the pose ξ.Hence, given a sequence of images I = {I t } t=0,...,|I| of the same object and corresponding frame-based predictions (V t , ξ t , T t ) = Φ(I t ), we feed the rendering function ( Ît , Mt ) = R( V , ξ t , T ) with the shape and texture averages The idea is that, unless shape and texture agree across predictions, their averages would be blurry and result in poor renderings.Hence, minimizing the rendering loss indirectly encourages these quantities to be consistent over time.
Furthermore, while the pose ξ t does vary over time, pose changes must be compatible with image-level correspondences.Specifically, let F t ∈ R H×W ×2 be the optical flow measured between frames I t and I t+1 by an off-the-shelf method such as RAFT [79].We can render the flow Ft = R(V, ξ t , ξ t+1 ) by computing the displacement of the object vertices V as a pose change from ξ t to ξ t+1 .We can then add the flow reconstruction loss to encourage consistent motion of the object.Its influence is controlled by the weight λ flow .

Hierarchical Shape Model
Next, we flesh out the shape model.The shape V ∈ R 3×K is given by K mesh vertices and represents the shape of a specific object instance in a canonical pose.It is obtained by the predictor f V (I) = V base + ∆V tmpl + ∆V (I) where: V base is an initial fixed shape (a sphere), ∆V tmpl is a learnable matrix (initialized as zeros) such that V tmpl = V base + ∆V tmpl gives an average shape for the category (template), and ∆V (I) is a neural network further deforming the this template into the specific shape of the object seen in image I.We further restrict V , which is the rest pose, to be bilaterally symmetric by only predicting half of the vertices and obtaining the remaining half via mirroring along the x axis.Note that, while in many prior works the category-level template V tmpl is given to the algorithm, here this is learned automatically from a sphere.
Finally, the shape V is transformed into the actual mesh observed in the image by a posing function g(V, ξ).We work with two kinds of such functions.The first one is a simple rigid motion g(V, ξ) = g ξ V, g ξ ∈ SE(3).This is used in an initial warm-up phase for the model to allow it to learn a first version of the template V automatically.
In a second learning phase, we further enrich the model to capture complex articulations of the shape.There are a number of possible parameterizations that could be used for this purpose.For instance, [21] automatically initializes a set of keypoints via spectral analysis of the mesh.Here, we initialize instead a traditional skinning model, given by a system of bones b ∈ {1, . . ., B}, ensuring inelastic deformations.
The skinning model is specified by: the bone topology (a tree), the joint location J b ∈ R 3 of each bone with respect to the parent bone, the relative rotation ξ b ∈ SO(3) of that bone with respect to the parent, and a row-stochastic matrix of weights w ∈ [0, 1] K×B specifying the strength of association of each mesh vertex to each bone.Of these, only the topology is chosen manually (e.g, to account for a different number of legs for objects in the category).The joint locations J b and the skinning weights w are set automatically based on a simple heuristic (described in sup.mat.).
While topology, J b and w are fixed, the joint rotation ξ b ∈ SO(3), b = 2, . . ., B and the rigid pose ξ 1 ∈ SE(3) are output by the predictor f to express the deformation of the object as it changes from image to image.

Appearance Model and Rendering
We model the appearance of the object using a texture map T ∈ R 3×H T ×W T .The vertices of the base mesh V base are assigned to fixed texture uv-coordinates and the texture inherits the symmetry of the base mesh.Given the posed mesh g(V, ξ) and the texture T , we render an image ( Î, M ) = R(V, ξ, T ) of the object using standard perspective-correct texture mapping with barycentric coordinates using the PyTorch3D differentiable mesh renderer [80].

Symmetry and Geometric Regularizers
An important property of object categories is that they are often symmetric.This does not mean that individual object instances are symmetric, but that the space of objects is [29].In other words, if image I contains a valid object, so does the mirrored image mI.Furthermore, given the photogeometric parameters (V, ξ, T ) = f (I) for I, the parameters for mI must be given by f (mI) = (mV, mξ, T m) where mV = V (because the rest shape is assumed symmetric), T m is the flipped texture image and mξ is a mirrored version of the pose.Hence, we additionally enforce the pose predictor to satisfy this structure by minimizing the loss Note the relationship between the mirroring operators qξ in Section 3.1 and mξ here: they are the same, up to a further rigid body rotation.The effect is that q appears to rotate the object back to front, and m left to right.This is developed formally in the sup.mat.
We further regularise learning the mesh V via a loss L smooth (V, V tmpl ) which includes: the ARAP loss [81] between V and the template V tmlp , ensuring that they do not diverge too much, and a Laplacian and mesh normal smoothers for V .

Learning Formulation
Given a video I = {I t } t=1,...,|I| , the overall learning loss is thus: In practice, we found it important to warm up the model, activating increasingly more refined model components as training progresses.This can be seen as a sort of coarse-to-fine or paced learning strategy.
Learning thus uses the following schedule in three phases: (1) shape learning: the basic model with no instance-specific deformation (i.e., V = V tmpl ), no bone articulation and only the mask loss is optimized in order to obtain an initial template V tmpl ; (2) ambiguity resolution: the pose rectification loss L pose is activated to resolve the front-to-back ambiguity of Section 3.1; (3) full model : the bones are instantiated, and the instance real photos 3D scans Fig. 3: Examples of the 3D Toy Bird Dataset.Each bird toy was 3D scanned and the photographed "in the wild ".deformation, skinning models and appearance loss are also activated in order to learn the full model.

Experiments
We perform an extensive set of experiments to evaluate our method and compare to the state of the art.To this end, we collect three datasets, two of real animals, birds and horses, and one using toy birds of which we can obtain ground truth 3D scans.

Dataset and Implementation Details
Video Datasets.
We experiment with two types of objects: birds and horses.For each category, we extract a collection of short video clips from YouTube.The exact links to these videos and the preprocessing details are included in the sup.mat.We use the off-theshelf PointRend model [18] to detect and segment the object instances, remove the frames where the object is static, and automatically split the remaining frames into short clips, each containing one single object.The frames and the masks are then cropped around the objects and resized to 128×128 for training.We also run the off-the-shelf RAFT model [79] on the full frames to estimate optical flow between consecutive frames, and account for the cropping and resizing to obtain the correct optical flow for the crops.This procedure creates  Fig. 4: Qualitative Examples.We show multiple views of the reconstructed mesh together with a textured view and animated version of the bird that we obtained by rotating the learned bones.We find that the model is able to recover the shape well even when seen from novel viewpoints.The animation is able to generate believable poses.3D Toy Bird Dataset.
In order to properly evaluate and compare the quality of the reconstructed 3D shapes produced by different methods, we introduce a 3D Toy Bird Dataset, which consists of ground-truth 3D scans of realistic toy bird models and photographs of them taken in real world environments.Fig. 3 shows examples of the dataset.Specifically, we obtain 23 toy bird models, and used Apple RealityKit Object Capture API [82] to capture accurate 3D scans from turn-table videos.For each model, we then take 5 photographs from different viewpoints in 3 different outdoor scenes, resulting in a total of 345 images.We will release the dataset and ground-truth for future benchmarking.
Implementation Details.
Our reconstruction model is implemented using three neural networks (f V , f ξ , f T ) as well as a set of of trainable parameters for the categorical prior shape ∆V tmpl .The shape network f V and the rigid pose network f ξ are simple encoders with downsampling convolutional layers that take in an image and predict vertex deformations ∆V ins , skinning parameters ξ 2:B , and rigid pose ξ 1 and J 1 as flattened vectors.The texture network f T is an encoderdecoder that predicts the texture map T from an image.We use Adam optimizers with a learning rate of 10 −4 for all networks, and a learning rate 0.01 for the category shape parameters ∆V tmpl .We use a symmetric ico-sphere as the initial mesh.For each training iteration, we randomly sample 8 consecutive frames from 8 sequences.The models are trained in three phases described in Section 3.6.All details are included in the sup.mat.

Qualitative Results
Figure 4 shows qualitative 3D reconstruction results obtained from our model.Note that videos are no longer needed during inference and the shown predictions come from a single frame.
Despite not requiring any explicit 3D, viewpoint or keypoint supervision, our model learns to reconstruct accurate 3D shapes from only monocular training videos.The reconstructed 3D meshes can be animated with our skinning model by transforming the bones of the learned shape.This animation can also be transferred between instances.

Comparisons with State-of-the-Art Methods
We compare our model with a number of stateof-the-art learning-based reconstruction methods, including CMR [19], U-CMR [23], UMR [28] and VMR [20].CMR requires 2D keypoint annotations for initializing the 3D shape and viewpoints and also for the training loss.U-CMR removes keypoint supervision but requires a 3D template shape, and UMR replaces that with part segmentation maps from SCOPS [33] which relies on supervised ImageNet pretraining.VMR [20] allows for deformations but it requires the same level of supervision as CMR.All of them rely on external geometric supervision to establish correspondences for learning 3D shapes.We train all these methods on our video dataset with only mask supervision and show that without the additional supervision, all these methods reconstruct poor shapes.We also finetune their models pre-trained on CUB [31] with the required keypoint, camera view or template shape supervision on our bird video dataset.Finally, we also train UMR from scratch on our bird video dataset with SCOPS predictions obtained from the pre-trained SCOPS model.
On 3D Toy Bird Scans.
Our toy scan dataset allows for a direct evaluation of shape prediction.We first scale the predicted shapes to match the volume of the scans and roughly align the canonical pose of each method to the scans manually.Each individual predicted shape is further aligned to the ground-truth scan using Iterative Closest Point (ICP) [83] and the symmetric (average of scan-to-object and object-toscan) Chamfer distance is reported in centimeters, Table 2, by assuming the width of each bird to be 10 cm.While the reconstruction quality of other methods is good when trained with more geometric supervision, it degrades strongly without this training signal resulting in worse reconstructions when compared to our method.Note that this metric evaluates the individual shape predictions regardless the viewpoints.Next we evaluate the consistency across views.
On Bird Video Dataset.
Since we do not have ground-truth 3D shape and viewpoints for direct evaluation on our video test set, we measure reconstruction quality via a mask forward projection accuracy from one frame to another, using the object masks predicted by PointRend [18] as the pseudo ground-truth.This evaluates the shape and viewpoint quality as the object from a past frame is projected to a future frame which can only align when both shape and pose are estimated correctly, but cannot account for non-rigid deformation between frames.For each test sequence, we predict the shape at frame t and render the object mask from the pose at frame t + ∆t with an offset ∆t of 0, 5 and 20 frames.We then compute the mean Intersection over Union (mIoU) between the rendered masks and the ground-truth masks at t + ∆t.Table 3 summarizes the results, which suggest that our model achieves both better shape reconstruction and viewpoint consistency.We also compute the metrics on our model with frame-specific deformations predicted at frame t + ∆t applied to the shape predicted at frame t.This further improves the mask reprojection IoU, confirming that our model learns correct frame-specific deformations.
Other methods tend to overfit the shape to the image, resulting in a larger decrease in reprojection accuracy with increasing ∆t.We also compare the distribution of estimated viewpoints/object poses by plotting the elevation and azimuth predicted on the test set in Fig. 5. Our method is able to learn the full azimuth range, while other methods, with the exception of U-CMR, only predict limited range of views (azimuth) without additional geometric supervision.Qualitative Comparisons.
Fig. 5 shows a qualitative comparison of different methods.When methods relying on more geometric supervision (CMR, U-CMR, VMR) are trained without this learning signal, they fail to produce reasonable shape reconstructions.UMR trained without SCOPS part segmentations overfits to the input views producing inaccurate 3D shapes.Our method reconstructs accurate shape and pose, despite not using keypoint or template supervision.We refer the reader to the sup.mat.for more results.Note that our model is trained on 128×128 images, whereas other methods train on 256 × 256 images and, except U-CMR, sample the texture directly from the input image, explaining the difference in the texture quality.
On Horse Video Dataset.
For horses, we compare qualitatively with LASR [34] in Fig. 6, which is an optimizationbased method for single video sequences.While their reconstruction appears to be convincing in the original viewpoint, the actual mesh often does not resemble the shape of a horse.Running LASR on such a sequence takes over four hours.

Ablation and Analysis
We ablate the different components of our method quantitatively on our toy bird dataset in Table 4 and Fig. 7.We find that all components are necessary for the final performance.The pose distribution in Table 4 shows that the model only learns the full 360-degree (azimuth) view of the object when all components are active.Especially the two-view-ambiguity resolution and the shape symmetry are important to learn the pose while video training helps to discover the backside of the object.Without a good pose prediction the reconstructions look reasonable in the input view, reveal to be degenerate from other directions.The model without symmetry produces unrealistic shapes indicating that symmetry is a useful prior, even when learning deformable shapes.Similarly, the shape prior is important to discover fine details (e.g.beak and tail) that are not visible in every image.The full model predicts a full range of Fig. 5: Visual Comparison.We compare to state-of-the-art methods trained without external geometric supervision in the form of 2D keypoints, viewpoint, or template shape.As UMR leverages weak-supervision using part segmentation maps from SCOPS [33], we show versions trained with and without SCOPS.Our method consistently reconstructs reasonable 3D shapes and the predictions cover full 360-degree (azimuth) view, whereas other methods produce poor reconstructions and their viewpoint predictions collapse to only a limited range with the exception of U-CMR.Other methods, except for U-CMR directly copy the texture from the input image using texture flow.Hence, although the texture appears sharper from the input view, they are often incorrect as seen from other views.See the sup.mat.for extended results.Fig. 6: Comparison with LASR [34].While the rendering in the original viewpoint looks convincing, the shape produced by LASR is distorted and does not resemble the actual shape of a horse.Since our method trains on multiple sequences it can learn a consistent shape.
viewpoints (Fig. 7) and the most consistent shape (Table 4).We train another model without the learned category prior shape, predicting individual shapes for each bird.The resulting reconstructions are inconsistent across different instances, shown in Fig. 7.This suggests that the full model is able to leverage shape prior of the whole category, which is a major benefit of learning in a reconstruction pipeline.

Limitations and Future Work
Our method still requires segmentation masks obtained from the off-the shelf model as supervision for training.Moreover, their quality affects the fidelity of our reconstructions.Thus, similar to comparable methods, our reconstructions do not capture fine details well, such as legs and the beak.
The texture prediction sometimes results in low quality reconstructions especially when the input image is affected by motion blur.Currently, we have to handcraft a structure for various types of animals, for example different structures for horses (quadrupeds) and birds.How to automatically discover plausible bone structures is also an interesting question to explore for future work.

Conclusions
We have presented a method to learn articulated 3D representations of deformable objects from monocular videos without explicit geometric supervision, such as keypoints, viewpoint or template shapes.The resulting 3D meshes are temporally consistent and can be animated.The method can be trained from videos and only needs off-the-shelf object detection and optical flow models for preprocessing.For reproducibility, comparison and benchmarking, the dataset, code and models will be released with the paper.

Additional Results
On top of the material in this pdf, please see the supplementary video for more qualitative results.

Additional Comparisons with State-of-the-Art Methods
Comparison of Learned Rigid Pose Distributions.
Estimating the viewpoint/object pose is a key factor of learning consistent 3D shapes.We plot the elevation and azimuth of the rigid poses (viewpoint) predicted by various methods (CMR [19], U-CMR [23], UMR [28] and VMR [20]) and compare them in Fig. 9.As shown in the plots, without additional geometric supervision (keypoint, viewpoint or template shape), existing methods are not able to learn correct poses and the predicted poses tend to collapse to only half of the azimuth, resulting in poor shape reconstructions that are not consistent across different inputs, as shown in Fig. 13 as well as in the main paper.The only exception is U-CMR [23] which explicitly uses an extensive viewpoint search (camera multiplex) to encourage diverse viewpoint predictions.However, without a good shape template, the resulting viewpoints are also not correct leading to poor shapes.
Additional Qualitative Comparisons.
Fig. 13 provides a few more examples comparing the reconstruction results of our model and several state-of-the-art methods.We also show shape reconstructions rendered with a static synthetic texture, shown in Fig. 8, to clearly illustrate the orientation of the predicted shape.Our model learns more accurate 3D shapes, despite not requiring explicit geometric supervision from keypoints, viewpiont or template shapes.More comparisons on entire video sequences are provided in the supplementary video.Fig. 14 shows more qualitative comparisons to LASR [34] on horses.

PCA Analysis on Learned Shape Space
We analyze the learned articulated shape space across the dataset using Principal Component Analysis (PCA).Figure 11 visualizes the first 6 principal components.Each principal component corresponds to a typical bird movement, which means that the model learns meaningful articulations that are reflected in the skeleton and not in the shape deformation component.

Additional Reconstruction Results
Figure 15 shows more bird reconstructions from various viewpoints.The model is robust against various input images, including frontal views and blurry images.

Texture Swapping and Animation
Our model reconstructs the birds in the canonical pose, where the shape and texture of different birds are aligned in the canonical representation.This allows us to easily edit the texture, for example swapping the texture with another bird, as shown in Fig. 16.Moreover, with the learned articulation model, we can also easily animate the reconstructed birds in 3D, by rotating the bones, also illustrated in Fig. 16.

Datasets
We collected videos containing birds and horses from YouTube and automatically pre-processed them as follows.We use PointRend model [18] to obtain detection and segmentation of the object instances.We plot the distributions of the rigid poses predicted by various methods.Without additional geometric supervision ("kp" for keypoint, "vp" for viewpoint and "tmpl" for template shape), the rigid poses (viewpoint) predicted by existing methods collapse to only a limited range, hence resulting in poor, inconsistent shape predictions.For example, UMR is able to predict only frontal poses.
Each video is then split into short clips containing a single object while we also remove frames that would contain humans after final cropping.We also filter out static frames from the clips using optical flow computed by off-the-shelf RAFT model [79].The optical flow is then recomputed one more time to account for the removed frames.Finally, we crop the frames, segmentation masks and optical flow around the detected object bounding boxes and resize them to 128 × 128 for training.

Bird Videos.
We downloaded a single four-hour long video of birds with static camera from YouTube1 .

Horse Videos
We searched YouTube for terms such as horse training, horse training in pen, leading horse and selected 12 videos2 where a horse is trained without a mounted rider.As the content in these videos can be varied, we manually roughly split the videos into clips where the horse is the main focus before applying the automatic pre-processing.

3D Toy Bird Dataset
Figure 12 shows renderings of six 3D models in our 3D toy bird dataset and the corresponding photographs of the toys "in the wild".The visual appearance of the birds in the photos is close to the real one is CUB200 or our video dataset.The 3D reconstructions are of high quality and allow for precise quantitative evaluation.For future benchmarking and comparisons, we will release the dataset together with the paper.

Broader Impact
Our work focuses on 3D reconstruction of deformable objects from monocular videos.We expect this work to be most useful for object categories that do not have sophisticated 3D ground-truth annotations and 3D shape models.This is mainly the case for animals, as shown for birds and horses here, and thus the work can potentially be of use in behavioral research of animals in the wild.Our 3D Toy Bird dataset does not contain any humans, only toy birds in nature background and was captured by the authors.Thus, the copyright of the dataset is with the authors and it does not violate the personal privacy of individuals.Overall, we expect this work to impact mostly the research community with very little impact on society in the short term.
In the long term, we expect the task of obtaining 3D models and algorithms that can lift objects from images into 3D to become more and more important; for example in XR and VFX applications.Our method shows that obtaining these models can be achieved without external geometric supervision, which is often difficult to collect for arbitrary object categories.texture map (right half) initial sphere render

Mathematical Details
In this section we expand the underlying math of shape, pose and symmetries to provide a detailed mathematical foundation of the underlying principles.

Shape and Pose
Recall that we represent the shape of each object instance using a hierarchical shape model.The model first predicts the shape of each instance V at rest pose as: where V base is the initial fixed shape parametrized by a symmetric ico-sphere mesh with 642 vertices and 1, 280 faces.∆V tmpl is a set of trainable parameters initilized as zeros and directly optimized during training, such that V tmpl = V base + ∆V tmpl represents the category-specific template shape shared across all instances of the category.∆V is instance-specific deformation predicted from each input image by the shape network f V .This rest pose shape V is further transformed into the actual mesh observed in the image by a posing (or skinning) function g(V, ξ), described in Section 7. The rigid pose is predicted by the pose network f ξ as a 3D rotation parametrized by a forward vector and 3D translations in xyz axes.Specifically, we predict a forward vector and define (0, 1, 0) T as the up direction and obtain a basis for the rotation matrix from these vectors by two consecutive cross products.This effectively disables the in-plane rotation as the objects tend to stay mostly upright.The translations are capped at a range roughly corresponding to 0.4 of the image size after projection.
The bone rotations are predicted by the shape network f V simply as a set of Euler angles.

Skinning Equation
Each bone b is rigidly attached to a parent bone π(b) forming a kinematic tree.Vector J b is the location of the joint between b and π(b) expressed relative to the parent.If V ib are the coordinates of mesh vertex i expressed relative to bone b, we can find the location in world space as: Given an initial mesh V at rest, then posed version is found by first expressing the vertices relative to each bone b in the rest configuration ξ 0 using transformations G b (ξ 0 ) −1 and then posing the vertex by applying transformation G b (ξ).This is weighed by the strength of association of each vertex to a bone, resulting in the skinning equation [11]: Skinning Weights.
As described in the main paper, we estimate the bone structure using the two most extreme points of the mesh at its rest pose.The one with positive z coordinate is selected as the head end and the other one the tail end, ensuring consistent orientation of the bone structure.The rotation of individual bones is represented using Euler angles about the xyz-axes in its local coordinate frame, where the center of rotation is specified by the joint location.Recall Eq. ( 2) of the main paper where we use skinning weights w bi that associate each mesh vertex with the bones.We softly assign each mesh vertex with the bones based on its distances to the bones.The weights are defined as the inverse of the distance between a vertex [V ins ] i and a bone b at their rest pose.We normalize this distance for each vertex over all the bones with softmax function with temperature: where T is the temperature parameter, (s b1 , s b2 ) is the line segment defining the bone b at its rest pose and is a small number to avoid division by zero.
Bone Structure Initialization.
We estimate the bone structure using a simple heuristics.Given the shape corresponding to the rest pose, we define a fixed number of bones inside the mesh forming a 'spine' which we set to lie on two line segments going from the two most extreme point of the mesh to the center.We then divide each line segment into equally-sized parts that define the origin and the length of each bone.
For quadruped animals, we also define bones forming four legs.We divide the space into quadrants along the x and z axis and then we found the lowest point on the mesh along the y axis in each quadrant.These lowest points are supposed to correspond to feet.We define a line segment between each of the lowest points, and the closest joint on the spine.As done with the spine, we then divide each line segment into equally-sized parts forming the individual bones.

Symmetry
We first define the action of the mirror mapping m on several objects.First, m acts on 3D points as the matrix that flips the x axis: Note that m −1 = m.Next, let V = {V i ∈ R 3 } i=1,...,|V | be the vertices of a mesh.The mesh is posed by the transformation g ∈ SE(3), which we also write as g(•, ξ) for a certain vector of pose parameters ξ.The mirror-symmetric pose of g is the conjugate transformation mgm ∈ SE(3).This also defines the meaning of the symbol mξ as the parameters of the conjugate: g(•, mξ) ≡ mg(•, ξ)m.Thus, given a point X ∈ R 3 , this definition satisfies the equation mgX = (mgm)mX or: mg(X, ξ) = g(mX, mξ).
This means that posing the point X and mirroring the result is the same as applying the mirror pose to the mirror point mX.
Now let T i be the color of vertex V i .This determines the color is the perspective projection of the posed vertex V i onto the image plane.If we mirror pixel u, we obtain: If we define the mirror image (mI)(u) = I(mu) as the flipped copy of image I, and if I is the image of object (V, g, T ), then mI is the image of the mirrored object (mV, mgm, T ).Next, we assume that the mesh is symmetric.This means that every vertex V i has a symmetric counterpart V m(i) , where m acts as a permutation on the vertex indices: ∀i ∈ {1, . . ., |V |} : mV i = V m(i) .
In fact this is not an arbitrary permutation: it consists of swaps, meaning that m(m(i)) = i, so that we have m −1 = m, just as before.
If is convenient to define the right action m on the vertex collection V as applying this permutation.Using matrix notation, if we interpret V as a 3 × |V | matrix, this means that in the expression mV the operator m acts as the axis-flipping operator Eq. ( 8), and in the expression V m it acts a |V | × |V | permutation matrix.With this convention, the mesh is symmetric if, and only if, Mirror Equivariance.
To summarise, if I is the image of (V, g, T ), them mI is the image of (mV, mgm, T ) in general, and of (V m, mgm, T ) if the mesh is symmetric.Because the order of the mesh vertices is irrelevant for rendering an image, mI is also the image of object (V mm −1 , mgm, T m −1 ) = (V, mgm, T m).In other words, we obtain the mirror image mI by viewing the same symmetric shape V under a mirror pose and texture.If the texture is also symmetric (i.e, T m = T ), then we only need to mirror the pose.
We conclude that, if the network makes the prediction (V, g, T ) = f (I) for image I, then it mus make the prediction V, mgm, T m) = f (mI) for the mirror image.
Front-to-Back Ambiguity.which consists of a 180 degree rotation around the y axis.Given a symmetric mesh mV = V m, consider objects (V, g, T ) and (V m, rmgm, T ).If we pose the second, we see that: This means that the posed points are exactly the same, except for the fact that the z (not x!) axis is inverted.If we image them using the orographic projection Π, this axis is removed, so the image points are exactly the same in the two cases.Naturally, because the depth changes, the images are different; however, the masks are identical, causing the ambiguity.Ton conclude, this means that the network (V, g, T ) = f (I) is likely to be confused between pose g and pose rmgm.Hence, we define qξ so that: rmg(,ξ)m = g(•, qξ).
Extension to Articulated Pose.We assume that the bone structure is also symmetric, so that we can define for each bone b a symmetric counter-part m(b) (another permutation) such that:

Fig. 1 :
Fig. 1: DOVE -Deformable Objects from VidEos.Given a collection of video clips of an object category as training data, we learn a model that is able to predict a textured, articulated 3D mesh of the object from a single input image.

Fig. 7 :
Fig. 7: Ablation Studies.We train our model without some of the key components and plot the distribution of the predicted poses.Without 2-view ambiguity resolution or symmetry constraint, the pose prediction collapses.Video training and learning a shape prior also help improve the poses and shapes.

Fig. 9 :
Fig.9: Comparison of the Learned Rigid Pose Distributions.We plot the distributions of the rigid poses predicted by various methods.Without additional geometric supervision ("kp" for keypoint, "vp" for viewpoint and "tmpl" for template shape), the rigid poses (viewpoint) predicted by existing methods collapse to only a limited range, hence resulting in poor, inconsistent shape predictions.For example, UMR is able to predict only frontal poses.

Fig. 10 :
Fig. 10: Texture Mapping.The texture is mapped from a circle in the texture to both, left and right, sides of the initial sphere.
V i projects to point u = ΠV i in the image.The same consideration as before apply without changes.Additionally, consider the rotation matrix

Fig. 13 :
Fig. 13: Comparisons against Existing Methods.Without additional geometric supervision, SOTA methods fail to learn correct poses and hence produces poor shape reconstructions, whereas our model learns more plausible 3D shapes and accurate poses.

Fig. 14 :
Fig.14: Additional qualitative comparison with LASR[34].The shape produced by LASR is distorted and does not resemble the actual shape of a horse.Since our method trains on multiple sequences it learns consistent shapes of horses.

Fig. 15 :
Fig. 15: Additional Reconstruction Results.We show the reconstructed objects from various viewpoints.

Fig. 16 :
Fig.16: Texture Swapping and Animation.Top: since our model learns a canonical representation for all objects, we can easily swap the texture across different instances.Bottom: we can also easily animate the 3D birds using our learned articulation model.

Table 2 :
Evaluation on Toy Bird Scans.Shape reconstruction quality measured by bi-directional Chamfer Distance between predicted shape and ground-truth scans.The lower the better.Ɣ

Table 3 :
Mask Forward Projection IoU.Shape reconstruction quality and temporal consistency measured by projecting the shape predicted at frame t to a different pose at a future frame t + ∆t and comparing the masks at t + ∆t.The higher the better."finetuned" indicates pretrained models finetuned on our video dataset.

Table 4 :
Ablation Studies with 3D Toy Bird Scans.Every component of our model helps to improve the final performance.

Table 5 :
Architecture of the shape f S and pose network f P .The network follows a convolutional encoder structure.n is the number of parameters predicted by each network.

Table 6 :
Architecture of the texture network f T .The network follows an encoder-decoder structure.