Towards 3D Scene Understanding Using Differentiable Rendering

Deep learning methods have achieved significant results in many 2D computer vision tasks. To realize similar results in 3D tasks, equipping deep learning pipelines with components that incorporate knowledge about 2D image generation from the 3D scene description is a promising research direction. Rasterization, the standard formulation of the image generation process is not differentiable, and thus not compatible with the deep learning models trained using gradient-based optimization schemes. In recent years, many new approximate differentiable renderers have been proposed to enable compatibility between deep learning methods and image rendering techniques. Differentiable renderers fit naturally into the render-and-compare framework where the 3D scene parameters are estimated iteratively by minimizing the error between the observed image and the image rendered according to the current scene parameter estimate. In this article, we present StilllebenDR, a light-weight, scalable differentiable renderer built as an extension to the openly available Stillleben library. We demonstrate the usability of the proposed differentiable renderer for the task of iterative 3D deformable registration using a latent shape-space model and occluded object pose refinement using order-independent transparency based on analytical gradients and learned scene aggregation.


Introduction
In recent years, convolutional neural networks (CNNs) have been fundamental in achieving impressive results in many 2D computer vision tasks [1,2]. The invariances and inductive biases embodied in the CNNs enabled efficient learning from images. Lately, CNNs are also used in architectures for 3D computer vision tasks. However, the inductive biases inherent to CNNs alone are not sufficient for 3D scene understanding [3,4]. Thus, it is essential to imbue knowledge about 2D image generation from 3D scenes into models for 3D scene understanding. To this end, one promising area of research is differentiable rendering.
In computer graphics terminology, the process of generating 2D images from 3D scene descriptions is known as rasterization. The principle design choice of the standard rasterization implementation is the assumption that only one 3D face contributes to a 2D pixel. This discrete pixel assumption enables efficient parallelization. An unfortunate consequence of the discrete pixel assumption is that the standard rasterization is not differentiable. Despite the fact that deep learning methods inherit a lot of parallel processing techniques from the graphics pipelines to realize efficient parallelization on GPUs, the standard rasterization and deep learning methods remain incompatible. To address this limitation, lately, approximate differentiable rendering has been proposed. In this article, we present StilllebenDR, which is built on top of the Stillleben [5]. Stillleben enables online synthetic data generation for training deep learning models using OpenGL for efficient rendering. StilllebenDR is built as an extension to the Stillleben library with minimal overhead. In contrast to the well-known differentiable renderers [6][7][8], StilllebenDR benefits from the optimizations built into the OpenGL library. Thus, StilllebenDR is highly scalable. In our previous work, we used an early version of the StilllebenDR to refine 6D object pose for all objects in a scene using abstract render-and-compare [9]. Here, we present additional use cases for StilllebenDR, namely 3D deformable registration and occluded object pose refinement. Our contributions include: 1. StilllebenDR, a differentiable renderer with PyTorch integration, 2. an end-to-end differentiable pipeline for deformable object registration using a latent shape-space model and differentiable rendering, and 3. a framework for joint object pose optimization and deformable registration to make our pipeline less susceptible to pose initialization errors.
In addition to our conference paper [10] presented at the 17th International Conference on Computer Vision Theory and Applications (VISAPP) we make the following contributions.
Occluded object pose refinement using: 1. order-independent transparency employing analytical gradients, and 2. learned scene decomposition-an efficient alternative to optimization based on analytical gradients.

Differentiable Rendering
Rasterization is the process of generating 2D images given the 3D scene description. Libraries like OpenGL [11], Vulkan [12], and DirectX [13] offer optimized rasterization implementations. Although the standard formulation of rendering 3D faces of object meshes into discrete pixels is not differentiable, probabilistic formulations like SoftRas [6], PyTorch3D [7], and DIB-R [8] allow for differentiable rendering. Most of the commonly used differentiable renderers are implemented using CUDA with programming interfaces to neural network libraries like TensorFlow [14] or PyTorch [15].
In this work, we present StilllebenDR, a fast differentiable renderer built as an extension to the openly available Stillleben library [5].

Render-and-Compare
Render-and-compare methods estimate 3D scene parameters-often iteratively-by minimizing the rendered and observed images. Krull et al. [16] trained a CNN to output a similarity score of how well the rendered image corresponds to the observed image and used the Metropolis algorithm to search for scene parameters with the highest similarity score. Moreno et al. [17] demonstrated the applicability of their render-and-compare framework for jointly estimating the shape, pose, lighting, and camera parameters of the scene. In our earlier work [9], we used a render-and-compare framework to refine the 6D pose for all objects in the scene simultaneously by employing an earlier version of Stillle-benDR. Li et al. [18] trained a CNN to refine object pose using render-and-compare iteratively.

Deformable Registration
Given a canonical model of an object category, 3D deformable registration aims at deforming the canonical model to match an observed instance while maintaining the geometric structure of the category. Our approach for solving deformable registration is closely related to DeepCPD [19]. DeepCPD uses coherent point drift (CPD) to create a lowdimensional shape-space of the object category and employs a CNN to estimate 3D deformation from single-view RGB images. Our pipeline presented in Section "3D Deformable Registration" uses differentiable rendering to estimate a 3D deformation field instead.

Order-Independent Transparency
In contrast to the standard rendering in which along a camera ray only the faces closest to the camera are rendered, order independent transparency [20][21][22] renders all the faces and blends them using alpha compositing. In general, alpha compositing is not to generate natural-looking images. Nevertheless, it is widely used in scientific visualization, computer-aided design (CAD) software, and game engines. In this work, we show a proof-of-concept occluded object pose refinement using order-independent transparency in Section "Occluded Object Pose Refinement Using Order Independent Transparency" (see Figs. 1 and 2).

StilllebenDR
StilllebenDR is built as an extension to the Stillleben library [5], a synthetic data generation pipeline designed to generate data needed for deep learning models on the fly. It provides PyTorch interface and uses OpenGL for rendering and PhysX for physics simulation. OpenGL provides an optimized implementation of a standard rasterization pipeline.
StilllebenDR enables differentiation support with only a minimal overhead to the OpenGL rasterization pipeline. During the rasterization step, a face F constituting of vertices V with colors C is projected on a pixel I. The pixel color I is computed as where b i are the barycentric coordinates and ∑ i b i = 1 . We simplify the notation I rgb and use I instead. OpenGL allows users to write shaders, specialized light-weight programs that are designed to run at specific stages of a graphics pipeline. Breaking down the graphics pipeline into a sequence of shaders enables parallelization. In addition to the standard RGB-D channels, we utilize the flexibility of the shaders to render additional channels containing information like barycentric coordinates and vertex indices, as shown in Fig. 3. The vertex shader and the fragment shader are the only two mandatory shaders in an OpenGL pipeline. In the vertex shader, vertices in the mesh coordinate are projected into the clip space, the space covered by the camera frustum. In the fragment shader, the color for rasterized pixels are computed. To generate barycentric coordinates and vertex indices as additional output channels, we make use of one of the less commonly used shaders, namely the geometry shader. The geometry shader is invoked at the end of the vertex shader and is used to generate additional primitives like vertices, lines, or faces. A common use-case for an OpenGL pipeline consisting of the geometry shader is shader. For each vertex, a line primitive corresponding to the normal direction is generated in the geometry shader. Faces are rasterized into discrete pixels in the fragment shader

Geometry shader
Faces Renderer RGB channels

Vertex indices & Barycentric coordinates
to visualize vertex normals (see Fig. 1). For each vertex, a line corresponding to the normal direction in the geometry shader is generated. We adapt the geometry shader to generate barycentric coordinates and vertex indices (see Fig. 2). For each vertex, a global vertex index and local vertex are generated. The global vertex index is marked as immutable and is rendered as constant values. For each face, one of (0, 0, 1), (0, 1, 0), (1, 0, 0) mutable vector values are assigned to each vertex uniquely. The vector values are interpolated in the later stages of the shader pipeline using the barycentric coordinates. The immutable global vertex indices and the interpolated local vertex indices reflect the corresponding barycentric coordinates and are rendered as additional output channels. Given the loss L, computed pixel-wise between the rendered and the observed images, the gradient of the loss with respect to different scene parameters can be decomposed into the gradient of the loss with respect to the rendered image, and the gradient of the rendered image with respect to the scene parameters following the chain rule. For example, the gradient of the loss with respect to the vertex V i is computed as is computed automatically by PyTorch autograd.
Using the barycentric weights and the vertex indices stored during the forward rendering step, I V i is computed as Similarly, we break down the gradient of the loss function with respect to object pose P as follows: The backward pass of the rendering process is depicted in Fig. 4.

3D Deformable Registration
Deformable registration is crucial for robotic manipulation tasks where robots have to transfer the grasping knowledge from the canonical instance to other instances of the same object category (see Fig. 5).

Latent Shape-Space
Given an object category with multiple instances and a canonical model, we use the coherent point drift (CPD) registration algorithm to learn a low-dimensional latent shapespace. The deformation i of the canonical model C to an instance i is modeled as where W is the deformation field and G is the Gaussian Kernel matrix defined element-wise as W is estimated in the M-step of the EM algorithm. Since the shape of W i depends on the canonical model C and not the instance i, we reduce the dimension of W i using principle component analysis (PCA). We use latent shape-space of dimension five for all the object categories in our experiments. We refer the reader to Rodriguez et al. [19], and Rodriguez et al. [23] for a detailed explanation of the latent shape-space.

Deformable Registration Pipeline
In Fig. 6, we present a pipeline to deform the canonical model C to fit the observed image I obs of a novel object instance. We generate the deformation field from the latent shape-space parameters S as described in Section "3D Deformable Registration" and render the deformed canonical model. We denote the rendered image as I rnd using StilllebenDR. The rendered and the observed images are compared pixel-wise using the image comparison function described in Section "Deformable Registration Pipeline". We implement the image comparison function and mesh deformation generation from S using PyTorch. The PyTorch autograd engine enables gradient propagation through these steps automatically. The gradient of the mesh parameters with respect to I rnd is provided by StilllebenDR. As shown in Fig. 4, the gradient can be computed only for the faces that are visible in I rnd . However, instead of applying the gradients directly to the mesh parameters, we propagate the gradient to the latent shape-space S and generate the mesh deformation from S . This step ensures that all vertices deform coherently and maintain the geometry of the object category. We perform the forward rendering and gradient-based S update iteratively until the image comparison error is negligible.

Image Comparison
Comparing images pixel-wise in the RGB color space is error-prone. Zhang et al. [24][25][26] demonstrated the effectiveness of the convolutional neural network feature space for image comparison. Inspired by the learned perceptual image patch similarity metric (LPIPS) [24], we construct the image comparison function as shown in Fig. 7. We extract the features from the last layer before the output layer of the U-Net model [27] for both rendered and observed images. We normalize the features between -1 and 1 and aggregate the features along the channel dimension. Finally, we compute mean-squared error (MSE) of the aggregated features from the rendered and the observed images.

Experiments
We evaluate the proposed 3D deformable registration on the DeepCPD dataset [19]. The dataset consists of four object categories: bottles, cameras, drills, and sprays (shown in Fig. 8). For each object category, the dataset provides a canonical model and a varying number of instances. Two instances per category are used for testing and the rest are  We compare our method with CLS [28] and DeepCPD [19]. CLS works with point clouds and thus needs depth information, whereas DeepCPD is RGB only.

Deformation Registration with Known Poses
We employ the proposed end-to-end-differentiable pipeline for deformable registration iteratively. We use stochastic gradient descent (SGD) with a momentum of 0.9 and exponential weight decay of 0.95. The meshes provided by the DeepCPD dataset are not watertight. Since the goal of our method is to deform the canonical model to fit the observed instance while maintaining the geometry of the object category, our pipeline does not benefit from having vertex colors. Thus, we use uniform red color for all the vertices in the canonical mesh. The tiny invisible holes on the surface of the meshes develop into larger visible holes during the iterative deformable registration process. This results in not only the rendered image looking unrealistic but also the image comparison being harder. To alleviate this issue, we use the ManifoldPlus algorithm [29] to generate watertight meshes.
In Table 1, we report the average 2 distance between subsampled points of the canonical mesh and the test instances. Some qualitative visualizations are shown in Fig. 9. From the visualizations, one can observe that the rendered deformed mesh fits the observed mesh nicely. Our method not only works for objects with simple geometry like bottles but also for objects with complex geometry like drills and sprays. Quantitatively, our method performs only slightly worse than DeepCPD despite not employing any specialized learnable components to model the deformation.

Joint Deformable Registration and Pose Optimization
The accuracy of the deformable registration greatly depends on the quality of the pose estimation. Under the assumption that the exact pose of the observed instance is known, the competing methods CLS and DeepCPD perform slightly better than the proposed method. When the pose estimation is not accurate enough, the accuracy of the deformable registration degrades as well. In contrast to the methods in comparison, our end-to-end-differentiable pipeline can jointly optimize for 6D object pose along with deformable registration. To demonstrate this feature, we randomly sample offsets in the range of [ − 0.05, 0.05] m for the x and y translation components and [-15°and 15°] for the rotation components. Although our method can optimize z translation along with other pose parameters, optimizing both z translation and vertex position jointly is an ill-posed problem. Thus, we include offsets only for x and y translation components. In our experiments, we observed that the pose parameters require Fig. 7 Image comparison operation. We compare the rendered canonical and the observed image using U-Net features. We normalize the extracted U-Net features and normalize them between -1 and 1 and aggregate the features along the channel dimension. Finally, we compute the mean-squared error pixel-wise between the aggregated features. Source: Periyasamy et al. [  SN Computer Science fewer updates to converge than the shape parameters. Therefore, we update the shape parameters at a higher frequency than the pose parameters, i.e., we update the pose parameters once per three shape parameter updates. Quantitative results of joint pose and shape optimization are presented in Table 1. The mean error only increases marginally when pose noise is injected, indicating that our method is less susceptible to pose initialization errors than competing methods.  Table 1 Comparison of our approach with CLS [28] and DeepCPD [19] The best values for RGB and 3D categories are presented in bold font

Occluded Object Pose Refinement Using Order Independent Transparency
Render-and-compare frameworks enable joint pose estimation and pose refinement for a single or all objects in a scene [9,17,30]. However, occlusion hinders the effectiveness of the render-and-compare framework. To alleviate the issues with occlusion, we present scene peeling. Let us consider the scenario depicted in Fig. 10. In the top row, we show the observed image and the image rendered according to the current pose estimate. In the bottom row, we visualize the scene corresponding to the images in the top row from top view. In the observed scene, the cup is lying in front of the wooden block. But, due to an erroneous pose estimate, the cup is completely occluded by the wooden block in the image rendered according to the current pose estimate. In this scenario, a render-and-compare framework using differentiable rendering will not able to refine the pose of the cup since the gradients of the loss function exist only for the objects in the rendered image. To address this shortcoming, we introduce depth peeling-based aggregated images for object pose refinement. We render a scene multiple passes by discarding all the faces that are rendered in the previous pass. The first rendering pass is the standard rendering pass resulting in an RGB and a depth image (conventional Z-buffer). In the subsequent passes, we discard all the faces rendered already by ignoring the faces with smaller Z-buffer values than the Z-buffer rendered in the previous pass. In the top row of Fig. 11, we show the resulting RGB images of the first two rendering passes of the scene corresponding to the erroneous pose estimate in Fig. 10.

Peeling Images Aggregation
RGB images rendered during different peeling passes are blended into one aggregated image. Formally, given C i j , z i j , the color and the Z-buffer value at image from jth pass at pixel i respectively, and C b , the background color, the aggregated color I at pixel i is defined as: The weight w j is defined as: and where is the temperature parameter and corresponds to background color. Setting to a small value (1e −2 in our experiments) allows blending of RGB images from multiple passes evenly, while setting = 0 is  Figure 11a bottom is generated with =1e −2 . RGB images from the two passes are aggregated with equal weighting. The resulting image is smooth blending of the RGB from the two peeling passes. In contrast, Fig. 11a is the result of an aggressive aggregation where the aggregated image is equivalent to the RGB from the first rendering pass.

Occluded Object Pose Refinement Using Analytical Gradients
Using aggregated images with =1e −2 for image comparison allows gradients for occluded objects. We perform object pose refinement iteratively to minimize the pixel-wise loss between the aggregated image and the observed image shown in Fig. 10. We decay the temperature parameter by a factor of 0.95 after each iteration. In Fig. 12, we show the aggregated image, and the standard RGB image, the top view of the corresponding scene during the iteration process. The position of mug that is completely occluded behind the wooden block in the initial iteration is refined to match the observed scene. During the refinement process, we can observe the position of the object mug being gradually pulled forward.

Learning Scene Decomposition
While performing occluded object pose refinement using analytical gradients works for simple scenes, applying such an approach for complex scenes is non-trivial. Furthermore, it needs multiple iterative refinement steps to converge. An alternate approach to analytical gradients is to learn scene decomposition. To this end, we propose the Scene Decomposition Model depicted in Fig. 13. It takes the observed RGB image, the rendered RGB-D images according to the current pose estimate, and the RGB-D images of the peeling passes as input. RGB image features are extracted using a pre-trained ResNet model. We use the features after the third ResNet block, i.e., from an input RGB image of height H and width W, the features extracted are of dimension 128 × H 8 × W 8 . The depth images are down-scaled and concatenated with the RGB features. The resulting features are of dimension 515 × H 8 × W 8 . From these features, the Scene Decomposition Model generates a dense pixel-wise z-estimate for objects in each peeling layer employing ResNetstyle convolutional blocks without pooling layers-maintaining the image dimension of H 8 × W 8 . Similar to the analytical optimization described in Section "Occluded Object Pose Refinement Using Analytical Gradients", we demonstrate the learned Scene Aggregation Model by optimizing for the z component of the object translation. The model generates dense pixel-wise refined z-estimate for objects in two peeling layers in one forward pass. It is trained using supervised learning to minimize pixel-wise mean squared error (MSE) using a custommade dataset. The dataset consists of two objects, namely cup, and wooden block placed at random positions in the scene such that one object always occludes the other. We then add noise to the z-component of the objects' translation to simulate erroneous z-estimate. We sample uniform z-estimate noise in the range of [ − 0.5, 0.5] cm. In addition Fig. 11 a, Fig. 12 Occluded object pose refinement using scene peeling. a iteration step, b aggregated image c standard rendered image d top view of the corresponding scene. The mug is completely occluded by the wooden block in the first iteration, whereas it is pulled forward gradually during the refinement process until it matches the observed image to rendering the RGB-D images of objects in the original pose (observed image) and in the erroneous pose (rendered image), we render two RGB-D image peels of objects in the erroneous pose. We use only RGB images of the objects in the original pose for training the model. The model learns the 3D geometry of the scene from the rendered and the peel RGB-D images. We use the Stillleben library to create the dataset on the fly to train the model and train the model for 5000 iterations with a batch size of 32. We create a test set of 1000 scenes for quantitative comparison. To establish a strong competing approach, we create a variant of the Scene Decomposition Model that takes RGB-D observed images instead RGB observed images as input (516 input feature maps instead of 515 feature maps). In Table 2, we present the quantitative comparison of the RGB and RGB-D Scene Decomposition Models. The RGB-D model performs only slightly better than the RGB model despite having access to depth information. This demonstrates the ability of the Scene Decomposition Model to learn the scene 3D geometry only from the RGB images. Moreover, both variants perform better for the wooden block object than for the cup object. This can be attributed to the simple geometric shape of wooden block compared to cup consisting of a non-convex surface. Qualitative visualizations of the RGB model are shown in Fig. 14.

Conclusion
We presented StilllebenDR and the end-to-end differentiable pipeline for joint 3D deformable registration and pose refinement from single-view RGB images using Stillle-benDR introduced in our previous work (Periyasamy et al. [10]). Designed as a light-weight extension to the Stillleben library that uses OpenGL graphics pipeline for efficient rendering enables StilllebenDR to be fast and scalable. Furthermore, we introduced occluded object pose refinement based on order-independent transparency using StilllebenDR employing analytical gradients. Aggregating RGB images from multiple depth peeling passes facilitates gradient flow to objects that are completely occluded. Moreover, we introduced learned scene decomposition, an efficient alternative to iterative analytical gradient-based optimization.