S2P3: Self-Supervised Polarimetric Pose Prediction

This paper proposes the first self-supervised 6D object pose prediction from multimodal RGB+polarimetric images. The novel training paradigm comprises 1) a physical model to extract geometric information of polarized light, 2) a teacher-student knowledge distillation scheme and 3) a self-supervised loss formulation through differentiable rendering and an invertible physical constraint. Both networks leverage the physical properties of polarized light to learn robust geometric representations by encoding shape priors and polarization characteristics derived from our physical model. Geometric pseudo-labels from the teacher support the student network without the need for annotated real data. Dense appearance and geometric information of objects are obtained through a differentiable renderer with the predicted pose for self-supervised direct coupling. The student network additionally features our proposed invertible formulation of the physical shape priors that enables end-to-end self-supervised training through physical constraints of derived polarization characteristics compared against polarimetric input images. We specifically focus on photometrically challenging objects with texture-less or reflective surfaces and transparent materials for which the most prominent performance gain is reported.

1 Introduction "Fiat lux, et facta est lux". 1 Light has been the foundation of many significant scientific findings in history.Early horological devices utilized changing shadows cast from the sun to measure time throughout centuries across different civilizations all over the globe.Based on the constant speed of the electromagnetic wave (EM) with which light travels, it is possible to determine the distance of an object after emitting a light pulse 1 Latin for "let there be light, and there was light".
by measuring its return time after reflection: a principle used in many active depth sensors.However, measurements are affected by artifacts such as multi-path interference (MPI) (Y.Cui, Schuon, Chan, Thrun, & Theobalt, 2010) due to reflective materials, ambient light (Jung, Brasch, Leonardis, Navab, & Busam, 2021), or inherently incorrect estimates when the light passes through transparent objects such as glass.This leads to inaccurate depth estimates, most noticeable for photometrically challenging objects (Jung et al., 2022).Still, many methods that learn geometric tasks from Inputs Inverted Physical Model  "#$%!&%  "%'()* Fig. 1 S 2 P 3 Pipeline Overview.Our proposed teacher-student training scheme takes four polarization images taken under different polarization filter angles as well as polarimetric and geometrical representations derived from the analytical physical model as multi-modal inputs to both the teacher and student networks, individually.The student network is optimized not only towards the pseudo labels generated from the teacher denoted as L pseudo , but also by L physics which minimizes the discrepancy between the polarimetric representations ρ from the input images after the analytical physical model (cf.Inputs) and ρ derived through the inverted physical model from the predicted surface normal of the student network.
images use such geometry information from depth data.
6D object pose estimation is one of those geometric tasks and essential in many computer vision and AR applications, ranging from robotics (P.Wang et al., 2021) to safety-critical autonomous driving (Ost, Mannan, Thuerey, Knodt, & Heide, 2021) and medical applications (Busam et al., 2018).Recent methods integrate geometric information either directly as input (He, Huang, Fan, Chen, & Sun, 2021) or leverage it for self-supervision (G.Wang, Manhardt, Liu, Ji, & Tombari, 2021).Reliable geometric cues can improve pose estimation performance, while unreliable and noisy depth information would interfere with what information a neural network has learned to extract.
Recent approaches integrate the geometric information of polarized light by learning features from both the estimated normal from polarization, and their polarization characteristics, for the task of 6D object pose estimation in a supervised way (Gao et al., 2022).In the case of photometrically complex objects, it is shown that the deterioration of measured depth is even inferior to the use of this modality, ultimately making the direct geometric measurement obsolete.The authors report impressive results for texture-less, reflective and translucent objects, outperforming state-of-the-art RGB-only (G.Wang, Manhardt, Tombari, & Ji, 2021) and RGB-D (He et al., 2021) methods.However, an extensive training dataset with ground-truth annotations is required, which may be challenging to obtain in practice, especially with high accuracy (P.Wang et al., 2022).
In S 2 P 3 , we study how a neural network can encode the geometric shape priors from polarized light captured with a multi-modal polarization camera for the task of 6D object pose estimation without the need for annotated real data.We leverage the aforementioned supervised polarimetric 6D object pose estimation method (Gao et al., 2022) as a teacher network and pre-train it on synthetically rendered polarimetric image data only.We then utilize its noisy predictions on real data, to support a student network with weak labels for guidance.A differentiable renderer is employed to enable self-supervision with dense geometric cues.Additionally, we propose an invertible formulation of the physical polarization model to analytically compute pixel-wise image characteristics from the geometric normal representation after the differentiable rendering of the student with the predicted 6D pose.This analytic inversion closes the self-supervision loop and allows for direct comparison with the input polarization as illustrated in Figure 1.
While we adopt the architecture of PPP-Net for a teacher network with an additional differentiable renderer, different from (Gao et al., 2022), we use this network to only train on synthetic data.This pre-trained model then produces predictions on the 6D pose of objects on real data, which are leveraged in our proposed teacherstudent scheme as weak labels.The teacher network, based on PPP-Net, is thus merely one element of the overall method S 2 P 3 as introduced here.
Inspired by the advancements in selfsupervised learning and the use of differentiable renderers in end-to-end learning pipelines, as e.g. in Self6D++ (G.Wang, Manhardt, Liu, et al., 2021), we transfer such knowledge to the multimodal imaging domain of polarization.Unlike Self6D++, where a renderer produces geometric information in terms of a depth map, which is then compared against a presumably noisy depth map from an active depth sensor, i.e., as explained in later sections here, we carefully study the physical properties of light and integrate encoded shape priors into a self-supervised scheme.This is possible through the differentiable analytical derivation of the physical properties from surface normal information.
The full pipeline of S 2 P 3 thus includes a) novel architectural designs for the encoding of physical shape priors that extend the findings from PPP-Net to a student-teacher scheme; b) integrates RGB-agnostic shape information as surface normal maps from a differentiable renderer, offering a more resilient alternative to the issues posed by active depth sensors for photometrically complex objects; c) entails weak pseudo-labels in the form of geometric and pose information for selfsupervision from the teacher network; and most notably, d) proposes an inverted physical model to leverage shape priors.The lightweight student network predicts and encodes these into a surface normal representation through a differentiable renderer.This encoded representation is then utilized to derive the object's analytical polarimetric representation.By integrating this representation into a new physical loss, we achieve complete end-to-end self-supervision using raw polarimetric images.
To this end we contribute in summary: 1. S 2 P 3 as a hybrid neural-physics approach to learn 6D object pose prediction with photometric challenges through self-supervision with neural encodings of geometric shape priors from mutli-modal data.2. Insights on the interplay of differentiable rendering with the invertible physical model through extensive experiments on objects of varying photometric complexity.

Related Work
We revise related work in the realm of polarimetric imaging and 6D object pose estimation, including relevant datasets and recent self-supervised approaches, to provide a solid overview in the research field.

Polarimetric Imaging
Early works on shape from polarization (SfP) investigate how the relation between polarization and the object's surface can be used to estimate surface normals and depth information, but focus on lab scenarios with controlled conditions of the environment (Atkinson & Hancock, 2006;Garcia, De Erausquin, Edmiston, & Gruev, 2015;Smith, Ramamoorthi, & Tozza, 2018;Yu, Zhu, & Smith, 2017).These methods only rely on monocular polarization images, but multiple views can also be used for SfP (Atkinson & Hancock, 2005;Z. Cui, Gu, Shi, Tan, & Kautz, 2017), also extending to depth estimation (Verdie et al., 2022) from a freely moving camera.In (Verdie et al., 2022) the goal is to predict dense depth for outdoor scenes with photometrically easy objects in a (partly) supervised manner with depth measurements from an active structured light sensor while leveraging multi-modal input to account for other artefacts that affect depth predictions.Polarimetric images are also combined with photometric information from either stereo (Atkinson, 2017) or monocular RGB (Zhu & Smith, 2019) to complement each other for depth predictions.Polarized light can also improve initial noisy depth maps from other sensors (Kadambi, Taamazyan, Shi, & Raskar, 2017).(Ba et al., 2020) compute a set of plausible cues from polarimetric images to predict surface normals with a neural network which can disambiguate such cues for SfP.(Lei et al., 2022) present a novel method for scene-level surface normal estimation from a single polarization image.By introducing a unique real-world dataset and employing advanced neural architecture with a multi-head self-attention module and viewing encoding, the study achieves superior performance in complex scenes.Our approach is inspired from these findings to complement the pose estimation with shape priors from physical properties extracted from the polarized light.
In S 2 P 3 we build upon these findings to directly regress the object pose.

Self-Supervision
Self-supervised learning avoids the problem of lacking properly labeled data.In the realm of 6D pose estimation, differentiable rendering is being used to render synthetic images with a predicted pose to compare against input images (Sock, Garcia-Hernando, Armagan, & Kim, 2020).Self6D (G.Wang et al., 2020) proposes such approach, where a network is first trained on synthetic RGB data and then finetuned on real RGB-D data without pose annotations in a self-supervised manner.They use depth data to align the visual and geometric cues which is the core part in the self-supervision stage.
Building on top of Self6D, Self6D++ (G.Wang, Manhardt, Liu, et al., 2021) replaces the onestage pose regression backbone to two-stage GDRnet (G.Wang, Manhardt, Tombari, & Ji, 2021) backbone, and additionally introduces a pose refiner on top of the teacher network to improve the accuracy and the robustness towards occlusions.

Polarimetric 6D Pose Prediction
With recently published annotated datasets for real-world polarimetric category-level (P.Wang et al., 2022) and instance-level (Gao et al., 2022) 6D pose estimation, it is now possible to study methods with this mostly unexplored imaging modality (Jung et al., n.d.).PPP-Net (Gao et al., 2022) investigates the advantages of using polarization for supervised object pose estimation, and designs a hybrid pipeline leveraging polarization through a combination of physical model cues with learning, yielding impressive performance for photometrically challenging objects when compared against RGB and RGB-D baselines.However, acquiring real training data with accurate annotations is still difficult and not easily reproducible for other scholars without complex and expensive hardware (Gao et al., 2022;P. Wang et al., 2022).Inspired by the strengths of polarimetric information in the supervised learning, we investigate the logical, yet non-trivial, next step towards exploring how this interesting modality can be integrated into a self-supervised scheme to reduce the need for annotated data.Different from Self6D (G.Wang et al., 2020) and Self6D++ (G.Wang, Manhardt, Liu, et al., 2021), we leverage polarimetric images, and extend the differentiable renderer to yield -besides appearance information -geometric representations in terms of normal maps of the object of interest.We further utilize this representation to compute polarimetric properties used for additional self-supervision through our proposed invertible physical model.To the best of our knowledge, we present the first method to utilize the geometric information from polarization in a self-supervised learning scheme.

Polarimetric Physical Model
Commonly used sensors in computer vision send or receive light to measure the wavelength and energy within some specific spectrum.Additionally to this information, the relative oscillation of the electromagnetic wave defines its polarization.Emitted unpolarized natural light becomes polarized after being reflected from a surface, hence it carries information about the object's surface characteristics.The utilization of RGB-D sensors in pose estimation has gained popularity owing to their cost-effectiveness and easy integration into various devices.These sensors utilize active illumination for depth measurement, either through projection of a pattern or time-of-flight measurements.However, they are prone to photometric challenges such as translucency and reflections that can result in erroneous depth estimates.This paper presents a solution to these challenges through the use of surface normals derived from polarization of an RGB-P sensor (refer to Figure 2).After discussing some issus of RGB-D sensors, this section will introduce how aforementioned information can be measured with a passive sensor with integrated polarization filters.Then we will introduce how the physical model computes geometric shape priors from the information encoded in the polarimetric images and how our invertible formulation is integrated into our network architecture to enable direct self-supervision.

Photometric Challenges for RGB-D
Commercial depth sensors rely on photometric measurements to estimate depth, by using active illumination either by projecting a pattern (e.g.intel RealSense D series) or using time-of-flight (ToF) measurements (e.g.Kinect v2 / Azure Kinect, intel RealSense L series).This makes them susceptible to challenges such as reflections and translucency, which can artificially extend the roundtrip time of photons or deteriorate the projected pattern.As a result, accurate depth estimation becomes infeasible in such scenarios, as illustrated in Figure 3 for a set of common household objects.The ToF sensor (RealSense L515) used in the experiment struggles to detect the semi-transparent vase, which appears almost invisible to the sensor.Additionally, reflections on the cutlery and can, cause the sensor to generate depth estimates that are significantly further from the true value, while strong reflections at boundaries result in pixel distances that are invalidated.

Surface Normals from Polarization
Most artificial and natural light is unpolarized, meaning the electromagnetic wave oscillates along all planes perpendicular to the direction of propagation of the light (Fließbach, 2012).When unpolarized light passes through a linear polarizer or is reflected at Brewster's angle from a surface, it becomes perfectly polarized.The refractive index of a material determines how fast light travels through it, how much of it is reflected, and the Brewster's angle of that medium.When light is reflected at the same angle to the surface normal as the incident ray, we call it specular reflection.The remaining part penetrates the object as refracted light, which becomes partially polarized as it traverses through the medium.This light wave escapes from the object and creates diffuse reflection (Fließbach, 2012).We use Figure 4 to provide an example that illustrates these concepts.
For real physical objects, the resulting reflection is a combination of specular and diffuse reflection, where the ratio largely depends on the refractive index and the angle of incident light.We propose to use surface normals obtained from polarization to overcome the photometric challenges faced by RGB-D sensors.Our method can be applied to various applications, including pose estimation, where accurate 3D information is crucial.
Fig. 4 Degree of Polarization.The polarization of light changes when it reflects off a translucent surface, resulting in differences in the polarimetric image quadruplet, with different polarization angles (P0-P3), that are directly related to the surface normal.In particular, the degree of polarization (DoP) for both the translucent and reflective surfaces is considerably higher than for the rest of the image, as shown in the indicated areas in the image.

Image Formation Model
We present the fundamental polarization image formation model and our invertible physical model that links the polarimetric and geometrical representations.When light with a specific intensity I and wavelength λ reaches the sensor, it passes through the color filter array (CFA), which separates the light into RGB wavebands, as shown in Figure 2. The incoming light also has a degree of polarization (DoP) ρ and a direction (angle) of polarization (AoP) ϕ.As light passes through a polarizer array on top of a pixel unit with four different polarization angles φ pol ∈ {0 • , 45 • , 90 • , 135 • }, the oscillation state of light is recorded alongside its wavelength and energy (Kalra et al., 2020).The polarization image formation model in Equation 1 defines the underlying parameters that contribute to the captured polarized intensities as: where the unpolarized intensity I un can be computed via averaging over polarized intensities I φ pol under different polarization filter angles φ pol ∈ {0 where the unknowns x i in the linear system represent x 1 = I un , x 2 = I un ρ cos 2ϕ, and x 3 = I un ρ sin 2ϕ.We find φ and ρ from the over-determined system of linear equations in 1 using linear least squares.Depending on the surface properties, AoP is calculated as: where [π] indicates the π-ambiguity and α is the azimuth angle of the surface normal n.We can further relate the viewing angle θ ∈ [0, π/2] to the degree of polarization by considering Fresnel coefficients, thus DoP is similarly given by (Atkinson & Hancock, 2006): with the refractive index of the observed object material η.Solving equation 4 for θ, we retrieve three solutions θ d , θ s1 , θ s2 , one for the diffuse case and two for the specular case.For each of the cases, we can now find the 3D orientation of the surface by calculating the surface normals: n = (cos α sin θ, sin α sin θ, cos θ) T . (5) We use these plausible normals n d , n s1 , n s2 as physical priors per pixel as input to the neural network.
With the help of the physical model defined by Equations 1 and 2, we can now derive physical polarimetric characteristics which encode shape information as geometric normals.More formally, when light gets reflected by the object's surface, the shape information is encoded in the captured polarization intensities accordingly.The physical model in our pipeline reveals the implicitly encoded shape information to provide objectcentric priors orthogonal to intensity information.We derive a set of explicit object shape priors N i based on polarimetric intensities I φ pol and properties ρ, ϕ as (Ba et al., 2020;Zou et al., 2020).The ambiguities within this process lead to non-unique solutions as in (Ba et al., 2020), yet we encode them in a pixel-exclusive manner to guide the network to distinguish between different priors and extract meaningful geometrical features.

Invertible Physical Model
Inverting the model and assuming a given normal map of an object, e.g., from a differentiable renderer with an estimated 6D pose as in our training scheme, we define an invertible solution to solve for the polarimetric representation analytically.This serves to close the loop from the network's prediction by transferring the information of the object's pose parameterized as 6D transformation through a differentiable renderer into a geometric form and further into encoded physical properties of light reflections that can be compared against the original input information in a self-supervised scheme.
The inverted physical model aims to bring a loop closure from the other end by taking the rendered object surface normal map to analytical polarimetric parameters considering different reflection properties.We obtain the viewing angle θ v from cos θ v = n • v where n is the rendered object surface normal map, and the viewing vector v is defined as v = −π −1 (u, v, K) with π −1 which serves as backprojection operation for pixel (u, v) with camera intrinsics K.The analytical DoP ρ is then derived via formulations for diffuse and specular reflection cases: where η is a constant defined by the refractive index of object materials.The inverted physical model offers the possibility to optimize the model via object shape cues, which is more robust in photometrically challenging scenarios compared to active depth sensors.

Methodology
The objective of S 2 P 3 is to achieve 6D object pose prediction without relying on annotated real data.To accomplish this, a teacher-student Pseudo Labels  training approach is suggested, which utilizes pretraining on synthetic data and pseudo-labels from the teacher during self-supervision as depicted in Figure 1.By additionally incorporating the proposed invertible physical model for selfsupervision, S 2 P 3 makes full use of the geometric data encoded in the polarimetric images.This section outlines the hybrid polarization-based pipeline for learning object pose and explains the physics-induced self-supervision approach in detail.
4.1 S 2 P 3 Network Architecture S 2 P 3 , consisting of a teacher network (cf. Figure 5) with a larger capacity and a light student network (cf. Figure 6), is illustrated in Figure 7 as a schematic overview.Both networks are pretrained on synthetic data, whereas the teacher later provides pseudo labels on real data to guide the student network in a self-supervised manner.The detailed architecture illustrates essential extensions, modifications, and important design choices of S 2 P 3 compared against established student-teacher training schemes in the community of 6D object pose estimation (G.Wang, Manhardt, Liu, et al., 2021).These are explained in detail in the following and justified with ablations in our experiments section.

Teacher Network
Inspired by the architecture of PPP-Net (Gao et al., 2022), we propose our polarimetric network with an extended differentiable renderer, as the teacher of S 2 P 3 (cf.Figure 5).Here, the inputs of polarimetric intensities and geometrical shape priors are encoded through separate input heads, followed by an explicit decoder to predict an object mask Mt , an object normal map Ñt , and the dense correspondences as normalized object coordinate map Mxyz t .The spatial and shape correlation of Mxyzt and Ñt serve as inputs to an object pose estimation module (G.Wang, Manhardt, Tombari, & Ji, 2021), in which the predicted rotation vector is parameterized in the form of allocentric continuous 6D representation (Zhou, Barnes, Lu, Yang, & Li, 2019) and the predicted translation as scale-invariant vector (Li et al., 2019).We further convert them into a standard rotation matrix Rt ∈ R 3×3 and a translation vector tt ∈ R 3 and denote the final pose as Pt = [ Rt | tt ].Here, we extend the neural network of PPP-Net.To compute pixel-wise geometrical pseudo labels from the predicted pose, a differentiable renderer takes the object's CAD model and Pt as inputs to render an object mask MR t and an object normal map ÑR t .All the predicted and rendered quantities serve as weak pseudo labels for the student network.

Student Network.
We propose a lightweight student network without explicit geometric decoder, different to Self6D++ (G.Wang, Manhardt, Liu, et al., 2021), where the network directly regresses the predicted pose for the student Ps (cf. Figure 6).This also favors fast inference while maintaining high accuracy.Our ablations, discussed later in Table 4, indicate the superiority of our student network design.The teacher network consists of about 5.5 million weights, whereas our lightweight teacher does not need the explicit decoder, thus reducing the network to about 5 million weights.While the  number of parameters is not significantly reduced, the inference time and also pose prediction accuracy is greatly improved by not predicting the intermediate geometric representations, as discussed later in the results section.We test this against the design choice of Self6D++ (G.Wang, Manhardt, Liu, et al., 2021) of having the student network identical to the teacher but without a subsequent pose refiner.Our student network converges towards better predictions without the redundant explicit prediction of intermediate geometric representations with our proposed selfsupervision.The final output of our student in S 2 P 3 , thus only consists of the predicted pose Ps .
To link the predictions with geometric and polarimetric properties, we render an object normal map Ns and an object mask Ms given Ps via the differentiable renderer -analogous to the teacher network.We will detail how this polarimetric representation of the geometric information is utilized in a self-supervised loss term in the following.

Physics-Induced Self-Supervised Training Scheme
As detailed before, the polarimetric images contain rich information that we provide as explicit representations to the network to learn neural geometric encodings.This section defines how these representations are further leveraged and integrated into our physically induced self-supervised scheme, firstly through implicit and explicit weak pseudo-labels of the teacher network, and second as direct coupling by closing the loop towards the input information of the pipeline.

Loss Formulations
Our proposed optimization scheme comprises two complementary paradigms.The first passes knowledge of the pre-trained teacher to the student in the form of weak labels of the pose Pt and related object shape knowledge { Mt , Ñt , MR t , ÑR t }, which we define as pseudo label loss L pseudo .The second is to utilize the inverted physical model to optimize the student prediction Ps via raw polarization data in our physical loss term L physics detailed below.
To account for potential misalignment between the decoded shape knowledge { Mt , Ñt } and pose knowledge Pt , we compare the predicted mask Mt and the rendered mask MR t and normalize the discrepancy to a scalar value of δ, which serves as the criteria of choosing pseudo ground truth for the geometrical regularization term L geo and a dynamic weighting term in the overall learning objective.The final formulation is then: with: in which we define the L mask as mean squared error and L normals as cosine similarity loss.The rendered representations { MR t , ÑR t } are chosen as geometrical pseudo ground truth if δ is within a predefined threshold r, otherwise the predicted representations are selected, also leading to a reduced weighting factor λ 1 = (1 − δ) on direct pseudo pose loss L pose .

Physical Constraints
To enable self-supervision via the invertible physical model, the rendered geometric normal map Ns serves as input to solve for analytical diffuse and specular DoP { ρd , ρs } according to Equation 6.
To benefit from the underlying physical process of polarimetric imaging, L physics deploys a pixel-wise minimum selection mechanism inspired by (Verdie et al., 2022): To avoid the domain gap between the analytically solved intensity map and the real polarimetric images as in (Verdie et al., 2022), we directly formulate the loss function based on polarimetric properties instead of polarimetric intensities.
Hence, the student's output is optimized to align with raw DoP ρ from real polarization images.
The overall loss combines the knowledge from the teacher and the raw data as: (11)

Experimental Results
We perform extensive evaluations and ablations on the instance-level polarimetric 6D pose dataset on which PPP-Net (Gao et al., 2022)

Synthetic Data Generation
Given a CAD model of an object, we randomly sample camera locations on its upper hemisphere for rendering.To further enforce realistic renderings and to reduce the domain gap, we set up backgrounds with different textures and lighting positions in Mitsuba2 renderer (Nimier-David, Vicini, Zeltner, & Jakob, 2019) to acquire 200-800 sets of polarization images for each object.We present illustrations of our synthetic dataset for different viewpoints in Figure 8 to illustrate the variety of sampled poses, objects of different photometric complexity, and their appearance in the image.The synthetic dataset is used to pretrain the teacher and student networks.We render a set of four polarimetric images with different angles of the polarization filter according to the camera used in the real setup.
As rendering is very time-consuming, we provide the dataset 2 .We also train a customized 2 https://daoyig.github.io/object detector on synthetic data to later provide predicted masks and bounding boxes on the real domain.
We present samples of our real polarimetric dataset with annotated object poses in Figure 9.The objects rendered using ground truth pose labels indicate the high quality of data annotation, and the object models with white color rendering indicate their textureless nature, which supports our design of removing the need for color texture supervision.

S 2 P 3 Training
We detail the two phases of the training: "Synthetic Pre-Training" on rendered data and "Selfsupervised Training on Real Data."The former uses synthetic data with 6D pose annotations for supervised pre-training of the teacher and the student networks individually.In the latter phase, we use real data to train the student network in a self-supervised fashion by leveraging our proposed novel training scheme and loss function.

Synthetic Pre-Training
Both the teacher and student models go through a pre-training phase in which they receive supervision exclusively based on the 6D pose information, derived from ground truth annotations from synthetic data.During this phase, the loss function has a two-part structure: an L1 loss is utilized for translation, while a point matching loss is applied for rotation.Notably, the differentiable renderer is not integrated into this pre-training stage.In terms of computational time, the pre-training process takes several hours, typically ranging from 4 to 5 hours for each object.Subsequently, the selfsupervised phase is more time-intensive, demanding approximately 10 hours per object.

Self-supervised Training on Real Data
We evaluate our method on a specific data split of the instance-level 6D pose estimation dataset introduced in (Gao et al., 2022) containing objects with varying photometric complexity with highly accurate annotations from robotic forward-kinematics.The RGB-P data is acquired with the polarization camera Phoenix 5.0 MP PHX050S1-QC comprising a Sony IMX264MYR As the amount of real data differs between objects, we follow common practice in instancelevel object pose estimation literature by sampling around 15% − 20% of total data for training, and the rest for testing (Gao et al., 2022), which results in 200-300 sets of real polarization images as training data for each object, and 1000-2000 sets of images as testing.
We ensure that the poses for the data split of the rendered synthetic domain are similar to the poses of the real domain in terms of overall distribution, to ensure comparability when analyzing the domain shift and the influence of our proposed self-supervision scheme later.The predicted bounding box crops out the region containing the object of interest and is resized to 256 × 256 as inputs to the networks.The predicted object mask serves as input to the physical model to produce only object-related polarimetric parameters as well as shape priors.

Implementation Details
We implement our model using Pytorch (Paszke et al., 2019) and train on an NVIDIA 2080 GPU and using ADAM optimizer (Kingma & Ba, 2014) on a commodity desktop PC with an Intel i7 CPU processor and 32GB RAM.The teacher and student networks are trained for 100 epochs for each object individually, both for synthetic and real data.The initial learning rate is set to 1 × 10 −4 , and halved every 25 epochs.The weights for all encoders are initialized with ImageNet weights.For synthetic pre-training, we use a batch size of 8, and for the self-supervised training on real data a batch size of 4.

Refractive Indices
As a material-related coefficient, the refractive index η for each object is listed in Table 1.The index serves as input to both the forward and inverted physical model.The refractive index is assumed to be known, but it has only a minor influence on object pose predictions (cf.also PPP-Net (Gao et al., 2022)).In terms of objects with different composite materials, we can observe the plastic cup in our experiments, where the plastic material is slightly different for the beige and green parts and the texture changes as well.Given the results for the different refractive indices (cf.Table A6.Refractive Index Ablation.in the Supp.Mat. of PPP-Net (Gao et al., 2022)), we expect that objects with different composite materials can still be handled.An extensive study of composite objects is out of scope for this work, as the PhoCal dataset (P.Wang et al., 2022) does not include other such objects.The results are evaluated using the common Average Distance of Distinguishable Model Points (ADD) metric (Hinterstoisser et al., 2013) for nonsymmetrical objects, in which 10% of the object's diameter is set as the threshold to judge the average deviation of the transformed model points.For symmetric objects, the average deviation to the closest model points is measured as in the Distance of Indistinguishable Model Points (ADD-S) metric (Hodaň, Matas, & Obdržálek, 2016).

Quantitative Results -Baseline Comparisons
S 2 P 3 proposes to leverage polarimetric information for self-supervised 6D object pose estimation and focuses on photometrically challenging objects, where self-supervised RGB-D methods may fail due to inherent sensor data artifacts, and supervised approaches, either RGB-only or RGB-P methods as e.g.PPP-Net (Gao et al., 2022), would require a large amount of annotated real data.Therefore, the experiments are deliberately chosen to analyse the multi-modal self-supervision through the physical constraints, its loss functions, as well as the architecture and design choices for the student-teacher scheme in the ablation studies to yield best scientific insights into self-supervised polarimetric 6D pose estimation.As such, we compare S 2 P 3 against PPP-Net (Gao et al., 2022) on our data split, as a very strong supervised baseline, in order to analyze the effect of self-supervision.PPP-Net already outperforms other strong state-of-the-art RGBonly methods as reported in (Gao et al., 2022), and is thus a valid upper threshold for comparison.Self6D++ (G.Wang, Manhardt, Liu, et al., 2021) is the state-of-the-art self-supervised RGB-D method on many standard benchmark datasets, and is thus chosen for establishing polarimetric self-supervision in S 2 P 3 as a strong baseline.See also the qualitative results in Figures 10  and 11, also including occlusions of the object as in Figures 12 and 13, for visual results which are discussed later in more detail.
We prove the effectiveness of the selfsupervision pipeline by quantitative results in Table 2. Please note, that PPP-Net (trained on annotated real data) is the identical network as we Table 2 S 2 P3 Quantitative Results.Average recall of ADD(-S) metric is reported for different objects with increasing photometric complexity.Self6D++ from (G.Wang, Manhardt, Liu, et al., 2021).PPP-Net from (Gao et al., 2022).use in our teacher model but without the differentiable renderer.In our full model S 2 P 3 however, we do not train the teacher in a supervised manner on real data, but only pre-train it on the synthetic data.Then, the weights of the teacher are frozen and it only provides weak pseudo-labels on real data for the teacher-student scheme.Our model S 2 P 3 , consistently outperforms the selfsupervised learning-based state-of-the-art RGB-D method Self6D++ by (G.Wang, Manhardt, Liu, et al., 2021) 3 , and even reaches comparable performance against the fully supervised upper bound baseline (Gao et al., 2022) for photometrically complex objects.

Ablation Studies
Our evaluation comprises several ablation studies to analyze the nuances of our model's components.We assess performance variations between synthetic and real data domains, particularly in the absence of self-supervision, to answer the question: how well can the student and the teacher network perform on real data, when trained in a supervised fashion on synthetic or real data, respectively, and how much performance gain does S 2 P 3 achieve when supervising on synthetic data only and performing self-supervision with real data.We further explore the impact of the student's architecture within the student-teacher paradigm, focusing on whether a lightweight student could match or outperform the teacher when refined on real data.Or to put it simple: Do we need a large student model, identical to the teacher network with a decoder and dedicated geometrical predictions?Or is the design choice of S 2 P 3 to directly regress the 6D pose for the student beneficial?Additionally, we dissect the influence of individual loss components, emphasizing the significance of our physically-induced self-supervised loss.And finally investigate the role of depth versus polarimetric information, gauging their relative contributions to the model's efficacy.
Ablation on Domain Shift -S 2 P 3 's Self-Supervision  (Gao et al., 2022) and marked with † in the table, outperforms the smaller student network when trained in a supervised fashion in both scenarios.Our full pipeline of S 2 P 3 (where the student is trained self-supervised on real data and the teacher weights are fixed), with our proposed small student network and a teacher, which are both only pre-trained on synthetic data (i.e., the synthetically pre-trained networks correspond to the numbers of the lower part of Table 3), achieves impressive results without being trained on annotations from real images due to our proposed self-supervision paradigm.S 2 P 3 even partly outperforms the fully supervised training on real data (cf.top rows against S 2 P 3 ) and achieves comparable results to PPP-Net as fully supervised upper boundary (indicated by †).Notably, the self-supervision of S 2 P 3 improves the results against the synthetically pre-trained student network (cf.Table 3 "Student ⋆" against S 2 P 3 ).While this trend holds true for all objects, the observation from before is not as significant for the fork, which may result from large occlusions  S) metric is reported for different objects with increasing photometric complexity for the student and teacher network individually, when trained in a supervised setting on either real or synthetic data and tested on real data.The full S 2 P 3 pipeline, with synthetic pre-training, and self-supervised training of the student on non-annotated real data, is also reported for comparison."Teacher †" as upper bound is identical to PPP-Net (Gao et al., 2022)."Student ⋆" corresponds to the setting of S 2 P 3 before applying our proposed self-supervision scheme.for this object in the majority of the data (cf. Figures 12 and 13 where the fork is inside the cup).

Ablation on Network Architecture -Exchanging the Student
We follow the motivation to utilize a lightweight student for faster inference.We exchange the network architecture for the student network in S 2 P 3 , with the one that is normally used as teacher, i.e., instead of the network in Figure 6 we use the one of Figure 5   physical constraints improve the large student significantly after self-supervision (cf.Large student with None and with Self-Supervision).The ablations demonstrate that the lightweight student can achieve better performance than the larger student network after fine-tuning on real data with our self-supervision scheme through physical constraints L physics , as employed in S 2 P 3 .

Ablation on Loss Terms
We first verify the influence of various loss terms by training the network without each specific loss term for the self-supervision stage as summarized in Table 5.We find that the direct geometrical point matching loss of L pose is crucial to self-supervision.Without enforcing L pose for the student against the weak pseudo-labels of the teacher, the training would easily diverge.The physically-induced self-supervised loss L physics , that is derived from our invertible physical derivations, indicates a larger impact on training results compared to geometrical supervision signals from the teacher network, e.g., L normal and L mask .The captured real polarimetric images contain more robust underlying object shape information compared to the output of the differentiable renderer.
The overall performance of the model reaches best accuracy metrics for all objects with varying photometric complexity when all loss ingredients are present, as indicated in the last row of Table 5.These results indicate, that the convergence of the student can only be guaranteed when weak labels of the teacher network roughly guide the pose predictions.One reason to explain such behavior, is that the differentiable renderer would be completely unconstrained without L pose , thus potentially rendering outputs with pose predictions that are out of the field of view.Dense Fig. 10 S 2 P 3 Qualitative Results Before and After Self-Supervision.The projected bounding boxes in blue, red and green represent the ground-truth 6D object poses, the results before and after applying self-supervision, respectively.Fig. 11 S 2 P 3 Qualitative Results Before and After Self-Supervision (zoomed-in from Figure 10).The projected bounding boxes in blue, red and green represent the ground-truth 6D object poses, the results before and after applying self-supervision, respectively.Fig. 12 S 2 P 3 Qualitative Results Before and After Self-Supervision with Occlusions.The projected bounding boxes in blue, red and green represent the ground-truth 6D object poses, the results before and after applying self-supervision, respectively.Fig. 13 S 2 P 3 Qualitative Results Before and After Self-Supervision with Occlusions (zoomed-in from Figure 12).The projected bounding boxes in blue, red and green represent the ground-truth 6D object poses, the results before and after applying self-supervision, respectively.supervision of the appearance and geometric representations after differentiable rendering further improve the networks performance, while the boost in pose accuracy is most noticeable with our proposed self-supervised physically-induced loss formulation.The contribution of the selfsupervision is also apparent in the qualitative results in Figures 10 and 11.The projected bounding boxes in green show better alignment with ground truth (blue) after self-supervision, compared against predictions of the pre-trained teacher (red).Figures 12 and 13 show additional results for cases where part of the object, here fork and knife, is occluded.

Ablation on Modalities
RGB-Texture Supervision.For textureless and transparent objects, the rendered object texture will only be white, since it does not have any color (cf.also Figure 7 in PhoCal (P.Wang et al., 2022) and Figure 5 in PPP-Net (Gao et al., 2022)).This would reduce the RGB-texture loss essentially to the mask loss in our pipeline.Hence, we eliminate the need for texture rendering and instead rely on the physical properties of polarized light.
Depth Supervision.To analyze the importance of accurate and reliable geometric representations for the task of 6D object pose estimation, we train our pipeline with depth maps from a direct time of flight (D-ToF) sensor and compare it against the polarimetric S 2 P 3 method with our physicallyinduced self-supervised loss.For this purpose, we adapt our network to have an additional loss term utilizing depth information aside from having almost all other components unchanged.We let the differentiable renderer of the student network additionally render depth maps D R given the predicted pose Ps , and employ a chamfer distance loss L chamf er between the point cloud P R back-projected from the rendered depth D R and the point cloud P back-projected from the depth map in the polarization camera coordinate system, to optimize for alignments without explicit 3D-3D correspondence registrations as: Besides adding L chamf er to the pipeline, we remove the L physics to have a fair comparison of the effectiveness of direct spatial cues from depth and object shape cues from polarimetric physical properties.The results listed in Table 6 indicate the depth cues can be beneficial when the quality is reliable, i.e., the performance on the cup peeks when L chamf er is introduced to the pipeline.We conduct additional ablations using a pixelwise depth loss instead of the chamfer distance loss, as reported in Table 6.The experiment illustrates that also with the pixel-wise depth loss, inaccurate depth information would inject incorrect geometric guidance into the pipeline, leading to degraded performance on photometrically challenging objects.
The inherent limitations of the depth sensor cause severe degradation of the depth quality (Jung et al., 2022).The reflective and semitransparent objects are measured incorrectly due to reflective and translucent object materials.This is also illustrated in detailed large figures of real examples in Figure 14.In such cases, the strong signal coming from depth alignment loss introduces incorrect spatial awareness, leading to low pose prediction performance.
On the contrary, the shape of the object that is encoded in the polarimetric image modality can provide stable geometric information for objects of all material characteristics presented here, across a variety of photometric complexity, e.g., from a matte plastic cup, to reflective stainless steel cutlery, and translucent and transparent colored glass objects.The analytically retrieved diffuse and specular solutions after the differentiable renderer are stable across all discussed objects.These polarization properties are computed through our invertible model and then utilized in the physicsinduced self-supervision scheme against the raw DoP illustrated on the top left in Figure 14.Please note that L physics is a pixelwise minimum loss of the diffuse and specular reflection.

Runtime Analysis
On a desktop PC with an Intel i7 4.20GHz CPU and an NVIDIA 2080 GPU, given a 512 × 612 image, our student network takes ≈ 7.3 ms for inferring the 6D pose for a single object, which is around 30% faster than the teacher model.Additionally, the preprocessing for the physical prior calculation takes 13.0 ms, and the object detection takes 15.4 ms.

Limitations
The performed experiments highlight the importance of reliable geometric priors for the task of 6D object pose estimation.When the quality of the depth map is reliable and accurate, the spatial loss term introduced by the source depth map may lead to better performance than pure object-shape-based optimization through polarization.The current model focuses on instancelevel pose estimation and does not generalize to unseen objects during training.An interesting future direction is to include the idea in a category-level pipeline.

Self-Supervised Polarimetric Pose Prediction
This paper bridges two worlds and combines a hybrid model for polarimetric pose estimation that fuses an invertible physical model with neural shape extraction from data within a selfsupervised framework.S 2 P 3 solves instance-level object pose estimation from polarimetric images without annotated real data.In our proposed pipeline, a teacher pre-trained on a small set of synthetic renderings ensures convergence of a lightweight student network through weak pseudolabels.Our employed differentiable renderer additionally provides the appearance and geometric outputs and enables self-supervision.S 2 P 3 outperforms methods that use depth measurements from active sensors for photometrically challenging objects.We achieve this by carefully integrating distinct design choices in the student-teacher architecture and proposing our invertible physical model for self-supervision by leveraging XoP properties, instead of raw polarimetric data as in (Verdie et al., 2022), to reduce the domain gap.Our contributions are validated through extensive ablation studies.
Our experimental results show the importance of self-supervision through geometric and physical cues for the task of 6D pose estimation and yield scientific insights into the robustness of polarimetric images.Such observations are most noticeable for photometrically challenging, texture-less, reflective, or translucent objects.

Fig. 2
Fig. 2 Polarization Camera.When an unpolarized light source reflects on an object surface, the resulting reflection comprises both a refracted and a reflected part, both of which are partially polarized.A polarization sensor captures this reflected light.In front of each pixel of the sensor, there are four polarization filters (PF) arranged at different angles: 0 • , 45 • , 90 • , and 135 • .Additionally, a colour filter array (CFA) is used to separate the reflected light into different wavebands.

Fig. 3
Fig.3Depth Artifacts.The RealSense L515 depth sensor exhibits miscalculations in depth values for common household objects.Specifically, boundaries (1,3) invalidate pixels, and strong reflections (2,3) lead to incorrect depth estimates that are too far from the true value.In the case of semi-transparent objects like the vase (4), the depth sensor has difficulty detecting them, resulting in partially invisible objects and inaccurate measurements of the distance to objects behind them.

Fig. 5 S
Fig. 5 S 2 P 3 Teacher Network.The network takes the shape priors and polarimetric representations, both derived from the analytical physical model from four polarized images, as input.Before retrieving the 6D object pose, intermediate geometrical representations are predicted.A differentiable renderer utilizes the predicted pose to provide a rendered normal map and object mask.

Fig. 6 S
Fig. 6 S 2 P 3 Student Network.Different from the teacher network in Figure 5, the student is more light-weight by neglecting the explicit decoding of predicted geometric representations.

Fig. 7 S
Fig. 7 S 2 P 3 Pipeline Overview.Our proposed teacher-student training scheme takes four polarization images taken under different polarization filter angles as well as polarimetric and geometrical representations derived from the physical model as inputs to both the teacher and student networks.The student network is optimized not only towards the pseudo labels generated from the teacher denoted as L pseudo , but also by L physics which minimizes the discrepancy between ρ from the physical model and ρ from the inverted physical model.During inference, the lightweight student network only predicts direct pose estimates as indicated by the gray background color.

Fig. 8
Fig. 8 Synthetic Dataset.Samples of objects with varying photometric complexity are illustrated from different viewpoints.

Fig. 14
Fig. 14 Examples of Polarimetric and Depth Quality.

Table 3
Domain Shift and S 2 P 3 's Self-Supervision.Average recall of ADD(-

Table 5
Ablation on Loss Terms.Average recall of ADD(-S) metric is reported.

Table 6 S
2 P 3 Ablations on Depth Modality.Average recall of ADD(-S) metric is reported.