Towards 3D Scene Understanding Using Differentiable Rendering

Periyasamy, Arul Selvam; Behnke, Sven

doi:10.1007/s42979-022-01663-3

Towards 3D Scene Understanding Using Differentiable Rendering

Original Research
Open access
Published: 03 March 2023

Volume 4, article number 245, (2023)
Cite this article

Download PDF

You have full access to this open access article

SN Computer Science Aims and scope Submit manuscript

Towards 3D Scene Understanding Using Differentiable Rendering

Download PDF

1811 Accesses
Explore all metrics

Abstract

Deep learning methods have achieved significant results in many 2D computer vision tasks. To realize similar results in 3D tasks, equipping deep learning pipelines with components that incorporate knowledge about 2D image generation from the 3D scene description is a promising research direction. Rasterization, the standard formulation of the image generation process is not differentiable, and thus not compatible with the deep learning models trained using gradient-based optimization schemes. In recent years, many new approximate differentiable renderers have been proposed to enable compatibility between deep learning methods and image rendering techniques. Differentiable renderers fit naturally into the render-and-compare framework where the 3D scene parameters are estimated iteratively by minimizing the error between the observed image and the image rendered according to the current scene parameter estimate. In this article, we present StilllebenDR, a light-weight, scalable differentiable renderer built as an extension to the openly available Stillleben library. We demonstrate the usability of the proposed differentiable renderer for the task of iterative 3D deformable registration using a latent shape-space model and occluded object pose refinement using order-independent transparency based on analytical gradients and learned scene aggregation.

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Rotation invariance and equivariance in 3D deep learning: a survey

Article Open access 07 June 2024

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In recent years, convolutional neural networks (CNNs) have been fundamental in achieving impressive results in many 2D computer vision tasks [1, 2]. The invariances and inductive biases embodied in the CNNs enabled efficient learning from images. Lately, CNNs are also used in architectures for 3D computer vision tasks. However, the inductive biases inherent to CNNs alone are not sufficient for 3D scene understanding [3, 4]. Thus, it is essential to imbue knowledge about 2D image generation from 3D scenes into models for 3D scene understanding. To this end, one promising area of research is differentiable rendering.

In computer graphics terminology, the process of generating 2D images from 3D scene descriptions is known as rasterization. The principle design choice of the standard rasterization implementation is the assumption that only one 3D face contributes to a 2D pixel. This discrete pixel assumption enables efficient parallelization. An unfortunate consequence of the discrete pixel assumption is that the standard rasterization is not differentiable. Despite the fact that deep learning methods inherit a lot of parallel processing techniques from the graphics pipelines to realize efficient parallelization on GPUs, the standard rasterization and deep learning methods remain incompatible. To address this limitation, lately, approximate differentiable rendering has been proposed. In this article, we present StilllebenDR, which is built on top of the Stillleben [5]. Stillleben enables online synthetic data generation for training deep learning models using OpenGL for efficient rendering. StilllebenDR is built as an extension to the Stillleben library with minimal overhead. In contrast to the well-known differentiable renderers [6,7,8], StilllebenDR benefits from the optimizations built into the OpenGL library. Thus, StilllebenDR is highly scalable. In our previous work, we used an early version of the StilllebenDR to refine 6D object pose for all objects in a scene using abstract render-and-compare [9]. Here, we present additional use cases for StilllebenDR, namely 3D deformable registration and occluded object pose refinement. Our contributions include:

1.
StilllebenDR, a differentiable renderer with PyTorch integration,
2.
an end-to-end differentiable pipeline for deformable object registration using a latent shape-space model and differentiable rendering, and
3.
a framework for joint object pose optimization and deformable registration to make our pipeline less susceptible to pose initialization errors.

In addition to our conference paper [10] presented at the 17th International Conference on Computer Vision Theory and Applications (VISAPP) we make the following contributions.

Occluded object pose refinement using:

1.
order-independent transparency employing analytical gradients, and
2.
learned scene decomposition—an efficient alternative to optimization based on analytical gradients.

Related Work

Differentiable Rendering

Rasterization is the process of generating 2D images given the 3D scene description. Libraries like OpenGL [11], Vulkan [12], and DirectX [13] offer optimized rasterization implementations. Although the standard formulation of rendering 3D faces of object meshes into discrete pixels is not differentiable, probabilistic formulations like SoftRas [6], PyTorch3D [7], and DIB-R [8] allow for differentiable rendering. Most of the commonly used differentiable renderers are implemented using CUDA with programming interfaces to neural network libraries like TensorFlow [14] or PyTorch [15].

In this work, we present StilllebenDR, a fast differentiable renderer built as an extension to the openly available Stillleben library [5].

Render-and-Compare

Render-and-compare methods estimate 3D scene parameters—often iteratively—by minimizing the rendered and observed images. Krull et al. [16] trained a CNN to output a similarity score of how well the rendered image corresponds to the observed image and used the Metropolis algorithm to search for scene parameters with the highest similarity score. Moreno et al. [17] demonstrated the applicability of their render-and-compare framework for jointly estimating the shape, pose, lighting, and camera parameters of the scene. In our earlier work [9], we used a render-and-compare framework to refine the 6D pose for all objects in the scene simultaneously by employing an earlier version of StilllebenDR. Li et al. [18] trained a CNN to refine object pose using render-and-compare iteratively.

Deformable Registration

Given a canonical model of an object category, 3D deformable registration aims at deforming the canonical model to match an observed instance while maintaining the geometric structure of the category. Our approach for solving deformable registration is closely related to DeepCPD [19]. DeepCPD uses coherent point drift (CPD) to create a low-dimensional shape-space of the object category and employs a CNN to estimate 3D deformation from single-view RGB images. Our pipeline presented in Section “3D Deformable Registration” uses differentiable rendering to estimate a 3D deformation field instead.

Order-Independent Transparency

In contrast to the standard rendering in which along a camera ray only the faces closest to the camera are rendered, order independent transparency [20,21,22] renders all the faces and blends them using alpha compositing. In general, alpha compositing is not to generate natural-looking images. Nevertheless, it is widely used in scientific visualization, computer-aided design (CAD) software, and game engines. In this work, we show a proof-of-concept occluded object pose refinement using order-independent transparency in Section “Occluded Object Pose Refinement Using Order Independent Transparency” (see Figs. 1 and 2).

StilllebenDR

StilllebenDR is built as an extension to the Stillleben library [5], a synthetic data generation pipeline designed to generate data needed for deep learning models on the fly. It provides PyTorch interface and uses OpenGL for rendering and PhysX for physics simulation. OpenGL provides an optimized implementation of a standard rasterization pipeline.

StilllebenDR enables differentiation support with only a minimal overhead to the OpenGL rasterization pipeline. During the rasterization step, a face F constituting of vertices V with colors C is projected on a pixel I. The pixel color I is computed as

$$\begin{aligned} {I}_{\textrm{rgb}} = \sum _{i} {b}_i {C}_i, \end{aligned}$$

(1)

where $b_i$ are the barycentric coordinates and $\sum _{i} {b}_i = 1$. We simplify the notation ${I}_{rgb}$ and use I instead. OpenGL allows users to write shaders, specialized light-weight programs that are designed to run at specific stages of a graphics pipeline. Breaking down the graphics pipeline into a sequence of shaders enables parallelization. In addition to the standard RGB-D channels, we utilize the flexibility of the shaders to render additional channels containing information like barycentric coordinates and vertex indices, as shown in Fig. 3. The vertex shader and the fragment shader are the only two mandatory shaders in an OpenGL pipeline. In the vertex shader, vertices in the mesh coordinate are projected into the clip space, the space covered by the camera frustum. In the fragment shader, the color for rasterized pixels are computed. To generate barycentric coordinates and vertex indices as additional output channels, we make use of one of the less commonly used shaders, namely the geometry shader. The geometry shader is invoked at the end of the vertex shader and is used to generate additional primitives like vertices, lines, or faces. A common use-case for an OpenGL pipeline consisting of the geometry shader is to visualize vertex normals (see Fig. 1). For each vertex, a line corresponding to the normal direction in the geometry shader is generated. We adapt the geometry shader to generate barycentric coordinates and vertex indices (see Fig. 2). For each vertex, a global vertex index and local vertex are generated. The global vertex index is marked as immutable and is rendered as constant values. For each face, one of (0, 0, 1), (0, 1, 0), (1, 0, 0) mutable vector values are assigned to each vertex uniquely. The vector values are interpolated in the later stages of the shader pipeline using the barycentric coordinates. The immutable global vertex indices and the interpolated local vertex indices reflect the corresponding barycentric coordinates and are rendered as additional output channels.

Given the loss L, computed pixel-wise between the rendered and the observed images, the gradient of the loss with respect to different scene parameters can be decomposed into the gradient of the loss with respect to the rendered image, and the gradient of the rendered image with respect to the scene parameters following the chain rule. For example, the gradient of the loss with respect to the vertex ${V}_{i}$ is computed as

$$\begin{aligned} \frac{\partial {L}}{ \partial {V}_{i}}&= \frac{\partial {I}}{ \partial {V}_{i}} \cdot \frac{\partial {L}}{ \partial {I}}, \end{aligned}$$

(2)

where $\frac{\partial {I}}{ \partial {V}_{i}}$ is computed automatically by PyTorch autograd. Using the barycentric weights and the vertex indices stored during the forward rendering step, $\frac{\partial {I}}{ \partial {V}_{i}}$ is computed as

$$\begin{aligned} \frac{\partial {I}}{ \partial {V}_{i}} = {C}_i. \end{aligned}$$

(3)

Similarly, we break down the gradient of the loss function with respect to object pose P as follows:

$$\begin{aligned} \frac{\partial {L}}{ \partial {P}}&= \frac{\partial {I}}{ \partial {P}} \cdot \frac{\partial {L}}{ \partial {I}}. \end{aligned}$$

(4)

The backward pass of the rendering process is depicted in Fig. 4.

3D Deformable Registration

Deformable registration is crucial for robotic manipulation tasks where robots have to transfer the grasping knowledge from the canonical instance to other instances of the same object category (see Fig. 5).

Latent Shape-Space

Given an object category with multiple instances and a canonical model, we use the coherent point drift (CPD) registration algorithm to learn a low-dimensional latent shape-space. The deformation $\tau _i$ of the canonical model $\textbf{C}$ to an instance i is modeled as

$$\begin{aligned} \tau _i\left( \textbf{C}_i, \textbf{W}_i\right) = \textbf{C} + \textbf{G}\left( \textbf{C}, \textbf{C}\right) \textbf{W}_i. \end{aligned}$$

(5)

where $\textbf{W}$ is the deformation field and $\textbf{G}$ is the Gaussian Kernel matrix defined element-wise as

$$\begin{aligned} \textbf{G}\left( y_i,z_i\right) = g_{ij} = \text {exp}\left( -\frac{1}{2\beta ^2}\left| \left| y_i - z_i\right| \right| ^2\right) , \end{aligned}$$

(6)

$\textbf{W}$ is estimated in the M-step of the EM algorithm. Since the shape of $\textbf{W}_i$ depends on the canonical model $\textbf{C}$ and not the instance i, we reduce the dimension of $\textbf{W}_i$ using principle component analysis (PCA). We use latent shape-space of dimension five for all the object categories in our experiments. We refer the reader to Rodriguez et al. [19], and Rodriguez et al. [23] for a detailed explanation of the latent shape-space.

Deformable Registration Pipeline

In Fig. 6, we present a pipeline to deform the canonical model $\textbf{C}$ to fit the observed image ${I}_\textrm{obs}$ of a novel object instance. We generate the deformation field from the latent shape-space parameters $\mathcal {S}$ as described in Section “3D Deformable Registration” and render the deformed canonical model. We denote the rendered image as ${I}_\textrm{rnd}$ using StilllebenDR. The rendered and the observed images are compared pixel-wise using the image comparison function described in Section “Deformable Registration Pipeline”. We implement the image comparison function and mesh deformation generation from $\mathcal {S}$ using PyTorch. The PyTorch autograd engine enables gradient propagation through these steps automatically. The gradient of the mesh parameters with respect to ${I}_\textrm{rnd}$ is provided by StilllebenDR. As shown in Fig. 4, the gradient can be computed only for the faces that are visible in ${I}_\textrm{rnd}$. However, instead of applying the gradients directly to the mesh parameters, we propagate the gradient to the latent shape-space $\mathcal {S}$ and generate the mesh deformation from $\mathcal {S}$. This step ensures that all vertices deform coherently and maintain the geometry of the object category. We perform the forward rendering and gradient-based $\mathcal {S}$ update iteratively until the image comparison error is negligible.

Image Comparison

Comparing images pixel-wise in the RGB color space is error-prone. Zhang et al. [24,25,26] demonstrated the effectiveness of the convolutional neural network feature space for image comparison. Inspired by the learned perceptual image patch similarity metric (LPIPS) [24], we construct the image comparison function as shown in Fig. 7. We extract the features from the last layer before the output layer of the U-Net model [27] for both rendered and observed images. We normalize the features between – 1 and 1 and aggregate the features along the channel dimension. Finally, we compute mean-squared error (MSE) of the aggregated features from the rendered and the observed images.

Experiments

We evaluate the proposed 3D deformable registration on the DeepCPD dataset [19]. The dataset consists of four object categories: bottles, cameras, drills, and sprays (shown in Fig. 8). For each object category, the dataset provides a canonical model and a varying number of instances. Two instances per category are used for testing and the rest are used for training. All the instances of an object category are aligned to have a common coordinate frame. Since the proposed method does not involve training a separate model for modeling the deformation, we use the training instances only to train the U-Net model for semantic segmentation. We compare our method with CLS [28] and DeepCPD [19]. CLS works with point clouds and thus needs depth information, whereas DeepCPD is RGB only.

Deformation Registration with Known Poses

We employ the proposed end-to-end-differentiable pipeline for deformable registration iteratively. We use stochastic gradient descent (SGD) with a momentum of 0.9 and exponential weight decay of 0.95. The meshes provided by the DeepCPD dataset are not watertight. Since the goal of our method is to deform the canonical model to fit the observed instance while maintaining the geometry of the object category, our pipeline does not benefit from having vertex colors. Thus, we use uniform red color for all the vertices in the canonical mesh. The tiny invisible holes on the surface of the meshes develop into larger visible holes during the iterative deformable registration process. This results in not only the rendered image looking unrealistic but also the image comparison being harder. To alleviate this issue, we use the ManifoldPlus algorithm [29] to generate watertight meshes.

In Table 1, we report the average $\ell _2$ distance between subsampled points of the canonical mesh and the test instances. Some qualitative visualizations are shown in Fig. 9. From the visualizations, one can observe that the rendered deformed mesh fits the observed mesh nicely. Our method not only works for objects with simple geometry like bottles but also for objects with complex geometry like drills and sprays. Quantitatively, our method performs only slightly worse than DeepCPD despite not employing any specialized learnable components to model the deformation.

Joint Deformable Registration and Pose Optimization

The accuracy of the deformable registration greatly depends on the quality of the pose estimation. Under the assumption that the exact pose of the observed instance is known, the competing methods CLS and DeepCPD perform slightly better than the proposed method. When the pose estimation is not accurate enough, the accuracy of the deformable registration degrades as well. In contrast to the methods in comparison, our end-to-end-differentiable pipeline can jointly optimize for 6D object pose along with deformable registration. To demonstrate this feature, we randomly sample offsets in the range of [$-$0.05, 0.05] m for the x and y translation components and [-15°and 15°] for the rotation components. Although our method can optimize z translation along with other pose parameters, optimizing both z translation and vertex position jointly is an ill-posed problem. Thus, we include offsets only for x and y translation components. In our experiments, we observed that the pose parameters require fewer updates to converge than the shape parameters. Therefore, we update the shape parameters at a higher frequency than the pose parameters, i.e., we update the pose parameters once per three shape parameter updates. Quantitative results of joint pose and shape optimization are presented in Table 1. The mean error only increases marginally when pose noise is injected, indicating that our method is less susceptible to pose initialization errors than competing methods.

Table 1 Comparison of our approach with CLS [28] and DeepCPD [19]

Full size table

Occluded Object Pose Refinement Using Order Independent Transparency

Render-and-compare frameworks enable joint pose estimation and pose refinement for a single or all objects in a scene [9, 17, 30]. However, occlusion hinders the effectiveness of the render-and-compare framework. To alleviate the issues with occlusion, we present scene peeling. Let us consider the scenario depicted in Fig. 10. In the top row, we show the observed image and the image rendered according to the current pose estimate. In the bottom row, we visualize the scene corresponding to the images in the top row from top view. In the observed scene, the cup is lying in front of the wooden block. But, due to an erroneous pose estimate, the cup is completely occluded by the wooden block in the image rendered according to the current pose estimate. In this scenario, a render-and-compare framework using differentiable rendering will not able to refine the pose of the cup since the gradients of the loss function exist only for the objects in the rendered image. To address this shortcoming, we introduce depth peeling-based aggregated images for object pose refinement. We render a scene multiple passes by discarding all the faces that are rendered in the previous pass. The first rendering pass is the standard rendering pass resulting in an RGB and a depth image (conventional Z-buffer). In the subsequent passes, we discard all the faces rendered already by ignoring the faces with smaller Z-buffer values than the Z-buffer rendered in the previous pass. In the top row of Fig. 11, we show the resulting RGB images of the first two rendering passes of the scene corresponding to the erroneous pose estimate in Fig. 10.

Peeling Images Aggregation

RGB images rendered during different peeling passes are blended into one aggregated image. Formally, given $C^{i}_{j}$, $z^i_j$, the color and the Z-buffer value at image from $j\textrm{th}$ pass at pixel i respectively, and $C_{b}$, the background color, the aggregated color I at pixel i is defined as:

$$\begin{aligned} {I}_{i} = \sum _{i} {{w^{i}_{j}}{C^{i}_{j}} + {w}^{i}_{b} {C}_b}. \end{aligned}$$

(7)

The weight $w_j$ is defined as:

$$\begin{aligned} w^i_j = \frac{\exp ({Z^i_j}/{\tau })}{\sum _{k}{\exp ({Z^i_k}/{\tau }) + \exp ({\epsilon }/{\tau })}} \end{aligned}$$

(8)

and $\sum _{j}{w^{i}_{j}}+{w}^{i}_{b} = 1$, where $\tau$ is the temperature parameter and $\epsilon$ corresponds to background color. Setting $\tau$ to a small value (1$e^{-2}$ in our experiments) allows blending of RGB images from multiple passes evenly, while setting $\tau =0$ is equivalent to the standard rendering. The aggregated images with different values of $\tau$ are shown in the bottom row of Fig. 11. Figure 11a bottom is generated with $\tau$=$1e^{-2}$. RGB images from the two passes are aggregated with equal weighting. The resulting image is smooth blending of the RGB from the two peeling passes. In contrast, Fig. 11a is the result of an aggressive aggregation where the aggregated image is equivalent to the RGB from the first rendering pass.

Occluded Object Pose Refinement Using Analytical Gradients

Using aggregated images with $\tau$=1$e^{-2}$ for image comparison allows gradients for occluded objects. We perform object pose refinement iteratively to minimize the pixel-wise loss between the aggregated image and the observed image shown in Fig. 10. We decay the temperature parameter by a factor of 0.95 after each iteration. In Fig. 12, we show the aggregated image, and the standard RGB image, the top view of the corresponding scene during the iteration process. The position of mug that is completely occluded behind the wooden block in the initial iteration is refined to match the observed scene. During the refinement process, we can observe the position of the object mug being gradually pulled forward.

Table 2 Quantitative comparison of the RGB and the RGB-D models

Full size table

Learning Scene Decomposition

While performing occluded object pose refinement using analytical gradients works for simple scenes, applying such an approach for complex scenes is non-trivial. Furthermore, it needs multiple iterative refinement steps to converge. An alternate approach to analytical gradients is to learn scene decomposition. To this end, we propose the Scene Decomposition Model depicted in Fig. 13. It takes the observed RGB image, the rendered RGB-D images according to the current pose estimate, and the RGB-D images of the peeling passes as input. RGB image features are extracted using a pre-trained ResNet model. We use the features after the third ResNet block, i.e., from an input RGB image of height H and width W, the features extracted are of dimension $128 \times \frac{H}{8} \times \frac{W}{8}$. The depth images are down-scaled and concatenated with the RGB features. The resulting features are of dimension $515 \times \frac{H}{8} \times \frac{W}{8}$. From these features, the Scene Decomposition Model generates a dense pixel-wise z-estimate for objects in each peeling layer employing ResNet-style convolutional blocks without pooling layers—maintaining the image dimension of $\frac{H}{8} \times \frac{W}{8}$.

Similar to the analytical optimization described in Section “Occluded Object Pose Refinement Using Analytical Gradients”, we demonstrate the learned Scene Aggregation Model by optimizing for the z component of the object translation. The model generates dense pixel-wise refined z-estimate for objects in two peeling layers in one forward pass. It is trained using supervised learning to minimize pixel-wise mean squared error (MSE) using a custom-made dataset. The dataset consists of two objects, namely cup, and wooden block placed at random positions in the scene such that one object always occludes the other. We then add noise to the z-component of the objects’ translation to simulate erroneous z-estimate. We sample uniform z-estimate noise in the range of [$-$0.5, 0.5] cm. In addition to rendering the RGB-D images of objects in the original pose (observed image) and in the erroneous pose (rendered image), we render two RGB-D image peels of objects in the erroneous pose. We use only RGB images of the objects in the original pose for training the model. The model learns the 3D geometry of the scene from the rendered and the peel RGB-D images. We use the Stillleben library to create the dataset on the fly to train the model and train the model for 5000 iterations with a batch size of 32. We create a test set of 1000 scenes for quantitative comparison. To establish a strong competing approach, we create a variant of the Scene Decomposition Model that takes RGB-D observed images instead RGB observed images as input (516 input feature maps instead of 515 feature maps). In Table 2, we present the quantitative comparison of the RGB and RGB-D Scene Decomposition Models. The RGB-D model performs only slightly better than the RGB model despite having access to depth information. This demonstrates the ability of the Scene Decomposition Model to learn the scene 3D geometry only from the RGB images. Moreover, both variants perform better for the wooden block object than for the cup object. This can be attributed to the simple geometric shape of wooden block compared to cup consisting of a non-convex surface. Qualitative visualizations of the RGB model are shown in Fig. 14.

Conclusion

We presented StilllebenDR and the end-to-end differentiable pipeline for joint 3D deformable registration and pose refinement from single-view RGB images using StilllebenDR introduced in our previous work (Periyasamy et al. [10]). Designed as a light-weight extension to the Stillleben library that uses OpenGL graphics pipeline for efficient rendering enables StilllebenDR to be fast and scalable. Furthermore, we introduced occluded object pose refinement based on order-independent transparency using StilllebenDR employing analytical gradients. Aggregating RGB images from multiple depth peeling passes facilitates gradient flow to objects that are completely occluded. Moreover, we introduced learned scene decomposition, an efficient alternative to iterative analytical gradient-based optimization.

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author upon reasonable request.

References

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Article Google Scholar
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
MATH Google Scholar
Cohen N, Shashua A. Inductive bias of deep convolutional networks through pooling geometry. In: International Conference on learning representations (ICLR), 2017. https://openreview.net/forum?id=BkVsEMYel. Accessed 01 Jan 2023.
Xuhong LI, Grandvalet Y, Davoine F. Explicit inductive bias for transfer learning with convolutional networks. In: International Conference on machine learning (ICML), 2018, pp. 80:2825–2834
Schwarz M, Behnke S. Stillleben: realistic scene synthesis for deep learning in robotics. In: IEEE International Conference on robotics and automation (ICRA), 2020; p. 10502-10508
Liu S, Li T, Chen W, Li H. Soft Rasterizer: a differentiable renderer for image-based 3D reasoning. In: IEEE International Conference on computer vision (ICCV), 2019; pp. 7708–7717 .
Ravi N, Reizenstein J, Novotny D, Gordon T, Lo W-Y, Johnson J, Gkioxari G. Accelerating 3D deep learning with PyTorch3D. In: SIGGRAPH Asia 2020 Courses, 2020.
Chen W, Ling H, Gao J, Smith E, Lehtinen J, Jacobson A, Fidler S. Learning to predict 3D objects with an interpolation-based differentiable renderer. In: Advances in neural information processing systems, Curran Associates, Inc. 2019; p. 32 Editors: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R.
Periyasamy AS, Schwarz M, Behnke S. Refining 6D object pose predictions using abstract render-and-compare. In: IEEE-RAS 19th International Conference on humanoid robots (Humanoids), 2019; pp. 739–46.
Periyasamy AS, Schwarz M, Behnke S. Iterative 3D deformable registration from single-view rgb images using differentiable rendering. In: 17th International Conference on computer vision theory and applications (VISAPP), 2022.
Segal M, Akeley K. The OpenGL graphics system: a specification (version 1.1). 1999. https://registry.khronos.org/OpenGL/specs/gl/glspec11.pdf. Accessed 20 Dec 2022.
Vulkan K. A specification (version 1.0). 2018. https://www.khronos.org/news/press/khronos-releases-vulkan-1-0-specification. Accessed 20 Dec 2022.
Microsoft. Directx-specs. 2019. https://microsoft.github.io/DirectX-Specs/. Accessed 20 Dec 2022.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. TensorFlow: a system for Large-Scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016; pp. 265–83 .
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison An, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, Curran Associates, Inc., 2019; pp. 32:8024–35. Editors: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R.
Krull A, Brachmann E, Michel F, Ying Yang M, Gumhold S, Rother C. Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In: IEEE International Conference on computer vision (ICCV), 2015; pp. 954–62.
Moreno P, Williams Christopher KI , Nash C, Kohli P. Overcoming occlusion with inverse graphics. In: European Conference on Computer Vision (ECCV) Workshops, 2016; 3:170-185
Li Y, Wang G, Ji X, Xiang Y, Fox D. DeepIM: Deep iterative matching for 6D pose estimation. In: European Conference on computer vision (ECCV), 2018; pp. 683–98.
Rodriguez D, Huber F, Behnke S. Category-level 3D non-rigid registration from single-view RGB images. In: IEEE/RSJ International Conference on intelligent robots and systems (IROS), 2020; pp. 10617-10624
Bavoil L, Myers K. Order independent transparency with dual depth peeling. 2008. https://developer.download.nvidia.com/SDK/10/opengl/src/dual_depth_peeling/doc/DualDepthPeeling.pdf. Accessed 20 May 2022.
Everitt C. Interactive order independent transparency. 2008. https://developer.download.nvidia.com/assets/gamedev/docs/OrderIndependentTransparency.pdf. Accessed 20 May 2022.
Maule M, Comba JLD, Torchelsen RP, Bastos R. A survey of raster-based transparency techniques. Comput Graph. 2011;35(6):1023–34.
Article Google Scholar
Rodriguez D, Cogswell C, Koo S, Behnke S. Transferring grasping skills to novel instances by latent space non-rigid registration. In: IEEE International Conference on robotics and automation (ICRA), 2018; pp. 1-8
Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE Conference on computer vision and pattern recognition (CVPR), 2018; pp. 586–95.
Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), 2015; pp. 4353–61.
Appalaraju S, Chaoji V. Image similarity using deep CNN and curriculum learning. 2017. arXiv:abs/1709.08761.
Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: International Conference on medical image computing and computer-assisted intervention (MICCAI), 2015; pp. 234–41.
Myronenko A, Song X. Point set registration: coherent point drift. IEEE Trans Pattern Anal Mach Intell (TPAMI). 2010;32(12):2262–75.
Article Google Scholar
Huang J, Zhou Y, Guibas L. ManifoldPlus: a robust and scalable watertight manifold surface generation method for triangle soups. 2020. arXiv:abs/2005.11621.
Kundu A, Li Y, Rehg JM. 3D-RCNN: Instance-level 3D object reconstruction via render-and-compare. In: IEEE Conference on computer vision and pattern recognition (CVPR), 2018; pp. 3559–68 .

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Autonomous Intelligent Systems, University of Bonn, Friedrich-Hirzebruch-Allee 8, 53115, Bonn, Germany
Arul Selvam Periyasamy & Sven Behnke

Authors

Arul Selvam Periyasamy
View author publications
You can also search for this author in PubMed Google Scholar
Sven Behnke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arul Selvam Periyasamy.

Ethics declarations

Conflict of interest

Both authors declare that they do not have any conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances on Computer Vision, Imaging and Computer Graphics Theory and Applications” guest edited by Kadi Bouatouch, Augusto Sousa, Mounia Ziat and Helen Purchase.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Periyasamy, A.S., Behnke, S. Towards 3D Scene Understanding Using Differentiable Rendering. SN COMPUT. SCI. 4, 245 (2023). https://doi.org/10.1007/s42979-022-01663-3

Download citation

Received: 22 June 2022
Accepted: 30 December 2022
Published: 03 March 2023
DOI: https://doi.org/10.1007/s42979-022-01663-3

Towards 3D Scene Understanding Using Differentiable Rendering

Abstract

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Rotation invariance and equivariance in 3D deep learning: a survey

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Introduction

Related Work

Differentiable Rendering

Render-and-Compare

Deformable Registration

Order-Independent Transparency

StilllebenDR

3D Deformable Registration

Latent Shape-Space

Deformable Registration Pipeline

Image Comparison

Experiments

Deformation Registration with Known Poses

Joint Deformable Registration and Pose Optimization

Occluded Object Pose Refinement Using Order Independent Transparency

Peeling Images Aggregation

Occluded Object Pose Refinement Using Analytical Gradients

Learning Scene Decomposition

Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation