Neural network adaption for depth sensor replication

Kunert, Christian; Schwandt, Tobias; Nadar, Christon R.; Broll, Wolfgang

doi:10.1007/s00371-022-02531-0

Neural network adaption for depth sensor replication

Original article
Open access
Published: 23 June 2022

Volume 38, pages 4071–4081, (2022)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Neural network adaption for depth sensor replication

Download PDF

Christian Kunert ORCID: orcid.org/0000-0003-4187-8365¹,
Tobias Schwandt¹,
Christon R. Nadar² &
…
Wolfgang Broll¹

1557 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In recent years, various depth sensors that are small enough to be used with mobile hardware have been introduced. They provide important information for use cases like 3D reconstruction or in the context of augmented reality where tracking and camera data alone would be insufficient. However, depth sensors may not always be available due to hardware limitations or when simulating augmented reality applications for prototyping purposes. In these cases, different approaches like stereo matching or depth estimation using neural networks may provide a viable alternative. In this paper, we therefore explore the imitation of depth sensors using deep neural networks. For this, we use a state-of-the-art network for depth estimation and adapt it in order to mimic a Structure Sensor as well as an iPad LiDAR sensor. We evaluate the network which was pre-trained on NYU V2 directly as well as several variations where transfer learning is applied in order to adapt the network to different depth sensors while using various data preprocessing and augmentation techniques. We show that a transfer learning approach together with appropriate data processing can enable an accurate modeling of the respective depth sensors.

Performance Evaluation of Depth Completion Neural Networks for Various RGB-D Camera Technologies in Indoor Scenarios

Depth Estimation Using Convolutional Neural Network with Transfer Learning

Designing and Searching for Lightweight Monocular Depth Network

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the advent of powerful mobile hardware with increasing sensor capabilities, immersive and elaborate Augmented Reality (AR) and Mixed Reality (MR) applications have become easier to realize than ever before. They can range from entertainment applications like gaming to education, architecture, or medical applications. For this, several subtopics like tracking, registration, and scene understanding in general are relevant. One important factor is the possibility to have depth information within a scene which can be utilized for tasks like user interaction, accurate occlusion when rendering AR content, and 3D scene reconstruction. While very rough information about the environment geometry like simple planes as they are provided by some AR frameworks might be enough in some scenarios, tasks like 3D scene reconstruction will usually require more accurate depth information. For this reason, low-cost depth sensors have become more popular in recent years. These include examples like the Structure Sensor by Occipital Inc. [14], the Kinect v1 and v2, the Leap Motion Sensor by Leap Motion [19], or the LiDAR sensor directly built into Apple iPads. Such sensors are based on Time of Flight (ToF) or structured light [25]. The Structure Sensor and its successors in particular are designed to be attached to a mobile device to allow for high flexibility while the LiDAR sensor in newer iPads are already included in the device which is why those two sensors are particularly interesting for this study. While sensors have become smaller and more affordable, it is still an interesting possibility to outright replace them by algorithmic solution like accurate stereo matching or neural networks.

In this study, this topic is addressed by using datasets from the Structure Sensor and the LiDAR sensor that are used to train a neural network in order to simulate or replicate them. Hereby, the goal is not merely to get a good depth estimation but rather to mimic the sensors including their specific characteristics. This especially useful for things like prototyping AR applications. A typical problem for AR development is the need to deploy to a physical device due to the lack of tools that provide an accurate simulation of the AR experience. Unity MARS^{Footnote 1} is a tool that tries to address this problem by providing a simulation environment including details like feature tracking. However, depth information that mimics the real-world behavior of a sensor would aid the simulation and prototyping process. Therefore, this paper aims to show a procedure to create a sensor simulation using machine learning.

In this study, Deep Neural Networks (DNNs) are employed to simulate depth sensors. As shown in Sect. 2, there is a wide variety of depth estimation networks as well as large datasets. However, simply using those networks is generally not possible for our stated goals as they are trained for different sensors and therefore provide different sensor properties, artifacts, and overall quality. Therefore, we aim to use existing networks as a basis for a transfer learning approach with a smaller dataset in order to adapt it to a different sensor and its specific characteristics. Special consideration has to be given to the fact that some sensor artifacts might negatively impact the learning procedure. For example, depth images from the Structure Sensor are far less accurate above a certain distance threshold and contain additional inaccuracies especially around the edges of objects. We will therefore experiment with different data preparation techniques like inpainting and distance clipping in order to handle those problems. On the other hand, the iPad LiDAR sensor can already provide smoothed depth images which do not exhibit the same problems. As they do not contain holes and we aim to use a small dataset for the transfer learning, we will instead experiment with data augmentation techniques. We will therefore employ different ways for data preparation and augmentation and evaluate their influence on the overall model quality.

Ultimately, this paper offers the following key contributions:

A method for adapting a neural network to a new domain in order to simulate depth sensors like the Structure Sensor or the Apple iPad LiDAR sensor.
Two datasets including RGB-D and tracking data, one created with a Structure Sensor and with an Apple iPad LiDAR sensor.
An evaluation of different strategies for mimicking real depth sensors using a neural network, including inpainting for the Structure Sensor or data augmentation for the LiDAR sensor dataset as well as the use of different backbones for the neural network.

The paper is structured as follows: Sect. 2 presents related research and the state of the art. Central information about our used network, loss function, and the datasets can be found in Sect. 3. Section 4 will present our evaluation in order to show how well the different models and configurations perform and Sect. 5 provides our conclusion as well as some ideas for future research in order to extend our work.

2 Related work

2.1 Architectures

The detection of depth data can be achieved by active or passive methods. Active methods involve depth sensors that employ LiDAR, ultrasound, structured light, or Time-of-Flight (ToF) which actively send out signals that interact with the environment in order to estimate the scene depth. In addition to these active sensors, passive methods are also used more frequently which estimate the depth data with approaches like stereo matching or DNNs [15].

In recent years, the field of machine learning has generated a variety of approaches which provide promising results regarding topics like detection, classification, and depth estimation. One of the first approaches to estimate the depth is a probabilistic model by Saxena et al. [26]. This network was one of the first to estimate depth data from a single image. In 2014, Ladicky et al. [16] showed limitations of monocular depth estimation approaches due to perspective geometry and introduced a simpler classifier by relying on image manipulation. Laina et al. [17] presented a deep learning approach based on ResNet-50 in 2016. This novel architecture replaced the last layer with upsampling layers for reconstructions and proposed a novel Huber loss function. In 2018, Zamir et al. [32] showed that the amount of labeled data used in the training procedure can be reduced by applying transfer learning. Transfer learning offers the advantage that certain structures are already part of the model and can be beneficial when adapting the model to a new domain. Xu et al. [31] showed the combination of multi-scale features from encoders with the decoder. It was shown that learned features from upper layers contain higher level information. This data provides an indication of a global understanding of the structural aspects of an image or scene [9]. Encoder–decoder architectures like this include a larger pre-trained model as a backbone. The best known encoder backbones are Residual Network (ResNet), Densely Connected Convolutional Network (DenseNet), Squeeze and Excitation Network (SENet), or Visual Geometry Group Network (VGGNet) [9]. In particular, ResNet allows intermediate layers to be crossed which enables faster learning. Specifically, concatenations of all skipped feature maps are built, supporting feature propagation and reuse which consequently reduces the parameters [10]. SENet consists of a squeeze operator that aggregates the feature map while the excitation operator aggregates the learned activations [12]. Other methods such as GeoNet [23], SharpNet [24], and Pattern Affinitive Propagation (PAP-Depth) [33] have also been based on the ResNet backbone for depth prediction. Alhashim and Wonka [1] show that using DenseNet-169 as a backbone provides comparable results without a complex and large architecture. Later, they introduce adaptive bins to estimate the depth map more precisely [2]. In our previous work [21], we employed the model architecture from Alhashim and Wonka [1] in order to simulate the behavior of a Structure Sensor. Hereby, a transfer learning was used in conjunction with various data processing techniques.

2.2 Datasets

A variety of different datasets exist which are often originally used for different purposes but are applicable in the context of depth estimation. Important datasets are the KITTI dataset [6], Berkeley 3D Object Dataset (B3DO) [13], NYU V1 dataset [27], NYU Depth-V2 [22], Scene Flow dataset [20], and the depth-in-the-wild dataset [3].

In 2012, Silberman et al. [22] presented a high-quality Kinect dataset (NYU Depth-V2). This consists of 1,449 densely labeled pairs of aligned RGB and depth images and 407,024 unlabeled images in 464 different indoor scenes. All in all, the dataset has high-quality depth maps of indoor environments suitable for 3D reconstruction or depth estimation as in this study.

For outdoor scenes, the KITTI dataset of Geiger et al. [6] is available. The images in the dataset were created from stereo depth images using high-resolution color and grayscale stereo camera imagery, laser scans, high-precision GPS measurements, and Simultaneous Localization and Mapping (SLAM) data.

Another interesting dataset was compiled by Chen et al. [3] by means of crowd-sourcing. In the Depth-in-the-Wild dataset, users were shown two points inside an image to choose which point is closer in comparison with the other one. So, the dataset consists of 495.000 different images with an indication of relative depth.

Table 1 gives an overview of different datasets and the corresponding properties.

Table 1 Comparison of various existing datasets

Full size table

3 Method

3.1 Architecture

The architecture is based on DenseNet mentioned in Sect. 2 which is an extension of ResNet. This offers two main advantages: First, there is a strong gradient flow. Second, DenseNet is more diversified. This means that it can contain good generalized information from previous layers which tend to have richer patterns. Such information is usually lost in a skip-style ResNet. For these reasons, we chose DenseNet over ResNet as the backbone. Accordingly, our study uses an encoder–decoder architecture as described by Alhashim and Wonka [1]. Figure 1 shows an overview of the architecture. To show the advantage of DenseNet over ResNet, we compared their performances in Sect. 4.3 with regard to the LiDAR dataset which shows that DenseNet can provide more accurate results.

The model uses DenseNet-169 as the backbone, which is pre-trained on ImageNet [4]. The structure of the DenseNet architecture conforms to the feed-forward design where every each layer is directly connected to every other layer. The decoder consists of four blocks, each with four layers. The block includes upsampling through a bilinear layer which is then followed by a concatenation operation and two 2D convolutional layers as shown in Fig. 1. The convolutional layers have a kernel size of $3\times 3$ while the number of filters is determined by the number of filters obtained from the encoding layer. This is given by $D_n = {E_m} / {2^n}$, where $D_n$ is the number of filters in the decoding block n and $E_m$ is the number of filters in the layer. DenseNet consists of series of dense blocks connected by transition layers [10]. The various decoder blocks are connected to the transition layers of the DenseNet architecture in order to access intermediate information.

3.2 Depth datasets

The focus of the work is especially on indoor spaces. For this reason, the NYU V2 dataset was used as it provides high-quality indoor depth maps. Based on this, we created our own two datasets for two different sensors to do a transfer learning on top of the NYU V2 dataset. One dataset was created using an iPad with an Occipital Structure Sensor while the other one was created with an iPad and its built-in LiDAR sensor. Both datasets will be further described in the following sections and are provided in the repository found under https://doi.org/10.5281/zenodo.5115276.

3.2.1 Structure Sensor

For the Structure Sensor, we generated a dataset that consists of 20 scenes with 2, 693 images taken with an Apple iPad Pro 12.9 inch (2015) and an Occipital Structure Sensor. ’Image’ hereby refers to a RGB-D image where color and depth are stored separately in practice. The focus here was on the office and laboratory spaces, which is why these make up a major part. The RGB images were recorded with a resolution of $640\times 360$ stored in 24-bit color while raw depth images have a resolution of $640\times 480$ in a 16 bit floating point format. Due to the network input size, the RGB images were scaled to $640\times 480$ and the depth images were scaled and slightly cropped accordingly. For further use, both images are saved in a lossless compression format.

As already mentioned, the images were captured with the Structure Sensor which is directly attached to the Apple iPad. The sensor allows for a frame rate of 30 frames per second at the above-mentioned resolution. The data were then recorded at 6 frames per second as having a faster frame rate would lead to redundant data.

A look at a particular example in the dataset in Fig. 2 shows that some information is missing in the depth data at some places. This happens especially when viewing certain reflective surfaces as well as the areas around object edges. These artifacts are caused by two different reasons. First, certain surfaces absorb light or do not reflect it sufficiently. And second, the distance and angle between sensor and camera may be too high to provide an accurate depth sensing around edges [18]. This has the consequence that black areas at the edge of objects exist and a depth shadowing becomes apparent. To counteract these effects, an inpainting procedure is used.

Another problem with currently available depth sensors is the decrease in accuracy with increasing depth [8, 14] which leads to rather noisy results for far-away objects which is problematic for the training procedure as well as other topics like 3D reconstruction. For this reason, depth data are truncated at four meters (m) and higher values are removed. We therefore experiment with two different techniques for handling the missing data. In one case, the data are marked as invalid by setting the pixels to zero while the other approach sets the pixels to the maximum value of 4m which creates a wall effect. An example where a depth image is processed by those procedures is shown in Fig. 4 where $GT_{hole}$ is the results of setting the pixels to zero and $GT_\mathrm{wall}$ shows the aforementioned wall effect. These configurations are meant to facilitate the training process by providing a consistent way to tag invalid data that can be learned by the neural network. In an actual application like 3D reconstruction where depth data are predicted by a neural network that was trained with one these configurations, hole or wall pixels would naturally be removed from the predicted depth image before further processing.

3.2.2 Apple iPad LiDAR sensor

Additionally to the Structure Sensor dataset, we also created a dataset using an Apple iPad (2020). Here, we do not depend on an external sensor as it already includes a built-in depth sensor using LiDAR technology. The depth images provided by ARKit have a resolution of $256\times 192$ stored as 16 bit floating point. Additionally, we activated the smoothedSceneDepth property in ARKit when creating the dataset which removes holes and reduces noise. The color images are resized to a resolution of $480\times 360$ and stored as 24-bit color which is a quarter of the original size from ARKit which provides a resolution of $1920\times 1440$. For the network input, we kept the size of $640\times 480$ which means that the data is resized accordingly for training and estimations. For this dataset, we chose a rate of 1 frame per second and recorded a total of 3, 637 images in 19 different scenes.

Figure 3 shows an example of a color frame and its corresponding depth image. Contrary to the Structure Sensor dataset as shown in Fig. 2, the LiDAR depth images appear much smoother and do not include the same artifacts like empty regions. This is mostly due to the smoothing capabilities that are applied by ARKit. This also means that special dataset processing techniques regarding hole filling as in the Structure Sensor dataset are not necessary in this case. Instead, we are able to feed the data directly into the neural network in order to learn the characteristics of the LiDAR sensor. However, since the data does not exhibit certain direction-dependent artifacts, we additionally experiment with data augmentation techniques like rotation and mirroring.

3.3 Loss function

The loss function in this paper is based on the loss function of Alhashim and Wonka [1]. It especially considers higher frequencies and determines the error value accordingly. The reason for weighting higher frequencies of an image is to improve the quality of the network with regard to the object boundaries. However, the final loss value L is calculated as a weighted sum with a pairwise loss $L_\mathrm{depth}$, the image gradient loss $L_\mathrm{grad}$ and the structure similarity loss $L_\mathrm{SSIM}$:

$$\begin{aligned} L(y, {\hat{y}})= & {} \lambda _{1} L_\mathrm{depth}(y, {\hat{y}}) + \lambda _{2} L_\mathrm{grad}(y, {\hat{y}}) \nonumber \\&\quad + \lambda _{3} L_\mathrm{SSIM}(y, {\hat{y}}) \end{aligned}$$

(1)

$$\begin{aligned} L_\mathrm{depth}(y, {\hat{y}})= & {} \frac{1}{n} \sum _{p}^{n} \left| y_{p} - {\hat{y}}_{p} \right| \end{aligned}$$

(2)

First, the pairwise depth is given by Eq. 2. In this way, we obtain a pixel-level understanding of the error between the ground truth and the predicted image.

$$\begin{aligned} L_\mathrm{grad}(y, {\hat{y}}) = \frac{1}{n} \sum _{p}^{n} \left| g_{x} (y_{p} - {\hat{y}}_{p}) \right| + \left| g_{y} (y_{p} - {\hat{y}}_{p}) \right| \end{aligned}$$

(3)

Second, $L_\mathrm{grad}$ calculates the changes in the gradient with respect to the neighboring pixels. The image gradient g is calculated in two directions $g_{x}$ and $g_{y}$, which gives the differences in the x and y components of the depth images.

$$\begin{aligned} L_\mathrm{SSIM}(y, {\hat{y}}) = \frac{1- SSIM(y, {\hat{y}})}{2} \end{aligned}$$

(4)

The third part of the loss function is the proposed Structural Similarity Index Measure (SSIM) of Wang et al. [29]. This metric provides a value about the perception of the three components: luminance, contrast, and c $L_\mathrm{SSIM}$ is a reciprocal version of SSIM [1, 11, 28].

The individual parts of the loss function Eq. 1 were set with weight factor to $\lambda _{1} = 0,1, \lambda _{2} = 1, \lambda _{3} = 1$ as suggested by Alhashim and Wonka [1]. This weighting leads to a training which considers structural similarity and gradient more than pairwise depth.

4 Evaluation & results

We trained our models using Tensorflow and Keras. For the optimization, we chose the Adam variant AMSGrad with a learning rate of $10^{-4}$ which was maintained for every experiment. Training was stopped early when the model did not improve for 10 epochs.

The datasets which were further described in Sect. 3.2 are divided into individual rooms and hallways. For each dataset, we use one room for testing while the rest of the datasets are used for training and validation in order to make sure that the networks are evaluated using a room they have not been trained on. For the testing sets, we chose one room per dataset that makes up roughly 10% of the total image count and divide the remaining images randomly into training and validation data in order to have a rough 80-10-10 split. The datasets were uploaded to a repository (https://doi.org/10.5281/zenodo.5115276) where each room is stored in a numbered subfolder and the subfolder with the largest number contains the testing set. For the Structure Sensor dataset, we have a total of 2693 images where one room with 272 images is used for testing while the rest of the dataset is divided into 2179 training and 242 validation images. In the case of the LiDAR dataset which has a total of 3637 images, we chose a room for testing that includes 322 images while training and validation use 2985 and 331 images, respectively.

The results from the evaluation using the test sets are listed in Sect. 4.2 where the metrics shown in Sect. 4.1 were applied.

4.1 Metrics

The evaluation results listed in Table 2 are obtained using the accuracy and error metrics shown here. We rely on established metrics that were used in prior work regarding depth estimation evaluation [1, 5, 21].

Root mean squared error (RMSE):

$$\begin{aligned} \mathrm{RMSE} = \sqrt{\frac{1}{n} \sum _{p}^{n}{(y_{p} - {\hat{y}}_{p}})^2} \end{aligned}$$

(5)

Average $log_{10}$ error:

$$\begin{aligned} \mathrm{log}_{10} = \frac{1}{n} \sum _{p}^{n} \left| \mathrm{log}_{10}(y_{p}) - \mathrm{log}_{10}({\hat{y}}_{p}) \right| \end{aligned}$$

(6)

Threshold accuracy ($\delta _{i}$):

$$\begin{aligned} {\delta = \max \left( \frac{{y_{p}}}{\hat{y_{p}}}, \frac{\hat{y_{p}}}{{y_{p}}}\right) } \end{aligned}$$

(7)

For the threshold accuracy, $\delta $ is first computed for every pixel. The resulting values are then compared with three different threshold values $a_{1}= 1.25^1$, $a_{2} = 1.25^2$, and $a_{i} = 1.25^3$. The threshold accuracy represents the percentage of pixels that fall within the given threshold.

4.2 Experiments

Table 2 shows three different sections. The reference section refers to the pre-trained model from Alhashim and Wonka [1] which originally used the NYU V2 dataset. Here, the test sets of the two datasets are evaluated without applying additional training in order provide a baseline and to show that an existing model is not generally able to adapt for different input data that it was not specifically trained for. The other sections show experiments that are specific to the Structure Sensor dataset (SD) or the LiDAR dataset (LD). All results in Table 2 rely on the DenseNet backbone. A comparison between DenseNet and ResNet for the LiDAR dataset is shown in Sect. 4.3.

Table 2 Here, the various experiments with their respective accuracy and error values are listed

Full size table

The Structure Sensor section is concerned with various input representation and transfer learning techniques. hole and wall denote the input representations, i.e., the dataset preprocessing. For the hole configuration, invalid distant pixels are set to zero while wall uses the maximum value (4m). $SD_\mathrm{hole}$ and $SD_\mathrm{wall}$ have been used to train the network which was pre-trained on ImageNet [4]. $SD^{T}_\mathrm{hole}$ and $SD^{T}_\mathrm{wall}$ on the other hand refer to the same dataset configurations but are used for a transfer learning which uses the network trained on the NYU V2 dataset as a basis.

Since the transfer learning approach shows the best results for SD, we also applied it in the case if the LiDAR dataset ($LD^{T}$). Due to the smoothing that is applied within ARKit, the input data from the Apple iPad LiDAR sensor does not exhibit the same artifacts and distance inaccuracies as the Structure Sensor data so we did not experiment with different input data processing techniques and used the raw data as it was recorded from ARKit for the training. However, since the data does not contain direction-dependent artifacts, we instead experimented with two different data augmentation schemes which is a common technique used to prevent overfitting. For $LD^{T}_{2x}$, we duplicated the training data and rotated the copy as if the sensor was simply turned upside down. $LD^{T}_{4x}$ further augmented the training set. Here, in addition the rotation, we also mirrored the dataset horizontally and vertically. The validation set and the test set are not augmented so that the evaluation only considers real-world data.

Figure 4 shows a visual comparison of various example estimations as well as the absolute difference between the predictions and the actual sensor data.

4.2.1 Structure Sensor

The reference experiment $SD_\mathrm{ref}$ was done by feeding the test set into the neural network which was only trained with the NYU V2 dataset. This was done in order to test if the existing network would already be able to provide good results when using a different sensor than the Kinect v2. $SD^{T}_\mathrm{wall}$ instead uses a transfer learning approach after which the same evaluation is done. The results show that the existing network cannot generalize to a different domain and a transfer learning approach can greatly improve the result. We found that the original model can still provide somewhat plausible depth maps for the RGB images from the Structure Sensor dataset but when actually comparing them with the ground truth data, the depth values are too small, i.e., the point cloud appears to be too close to the camera. This shows that the network is not able to adapt to a camera with markedly different intrinsic parameters. While the difference image for $SD^T_\mathrm{wall}$ in Fig. 4 still shows some artifacts around edges, the transfer learning mostly corrected the range compared to the reference network.

As mentioned previously, the Structure Sensor has a rapidly decreasing quality with increasing range and a threshold of 4m is therefore introduced. The difference between $SD_\mathrm{hole}$ and $SD_\mathrm{wall}$ lies in the way these regions are treated when processing the data and feeding it into the training procedure. While the data is set as invalid (i.e., to zero) in the case of $SD_\mathrm{hole}$, it is set to a maximum value in $SD_\mathrm{wall}$. Table 2 clearly shows that when using a maximum value to mark invalid regions instead of removing them, the neural network can learn a much better representation of the input data. This can also be seen in Fig. 4 where the network is unable to learn a good representation of invalid regions and instead generates obvious artifacts.

Lastly, $SD^{T}_\mathrm{hole}$ and $SD^{T}_\mathrm{wall}$ use the same data processing techniques as $SD_\mathrm{hole}$ and $SD_\mathrm{wall}$, respectively, but the training procedure uses a transfer learning approach from the network by Alhashim and Wonka [1] instead of just the basic network with the DenseNet backbone trained on ImageNet. As the results in Table 2 show, the transfer learning can generally improve the results but in the case of the hole data processing technique, the network still has problems in learning the features which leads to rather low accuracy and large errors in both cases.

Table 3 This comparison has the same structure as Table 2 but compares the results of using a ResNet backbone and DenseNet with regard to the LiDAR dataset including the different augmentation factors

Full size table

4.2.2 LiDAR

As shown in Sect. 4.2.1, transfer learning generally leads to better results. Therefore, we applied it in the experiments when using the LiDAR dataset. Similar to $SD_\mathrm{ref}$, $LD_\mathrm{ref}$ shows that the existing network that was trained on NYU V2 is not able to provide accurate results for a different camera as the range of the resulting data does deviates from the test set even though the result for LD is better than the one for SD. $LD^{T}$ is obtained by applying a transfer learning which greatly improves the result as shown in Table 2. In fact, the results generally exceed the best results from the Structure Sensor experiments including the reference network as shown in Fig. 4. This is likely due to the fact that the LiDAR sensor data are much smoother to begin with and include less artifacts that may disturb the training procedure. The smoothed depth provided by ARKit is not a pure sensor output and instead includes some internal processing that smooths the data over various frames. Since the resulting depth frames from ARKit do not exhibit obvious artifacts or holes due to shadowing that only occur in a specific direction, we augmented the dataset in two different ways in order to test if it may improve the result. Table 2 shows that $LD^{T}_{2x}$ which doubled the dataset by rotating the images by 180 degrees led to a slight improvement. For $LD^{T}_{4x}$, the dataset is further augmented by adding vertically and horizontally mirrored versions of the images. While the rotation of the images is comparable to a rotation of the sensor, mirroring may introduce samples that deviate from the real-world sensor. Still, the quadruple augmentation can still slightly improve the results further even though the gains are rather minor.

4.3 Comparison of ResNet and DenseNet

As mentioned in Sect. 3, we chose DenseNet as the backbone as it is advantageous with regard to depth estimation. DenseNet is based on ResNet and is a much larger network so it may be interesting to use the more light-weight ResNet if it still provides accurate results. Therefore, Table 3 shows a comparison between DenseNet-169 and ResNet-101 as the backbone. For ResNet, the decoder blocks were connected in a similar fashion to DenseNet which was described in Sect. 3. Here, it becomes clear that DenseNet outperforms ResNet for every metric and dataset. However, ResNet can still provide reasonable results and may be useful in applications where DenseNet requires too much processing time or memory.

5 Conclusion & future work

Depth information is an important feature for topics like 3D reconstruction or AR. Since such information may not always be available, for example due to hardware limitations or in the context of prototyping, a virtual replica of depth sensor is a possible alternative. In this paper, several approaches for the simulation of depth sensors have been explored and evaluated. We included two different sensors with very distinct characteristics and used various data processing and augmentation techniques in order to achieve models that can closely approximate the physical sensors. One sensor is the Structure Sensor which has been attached to an Apple iPad while the second sensor is the LiDAR sensor that is built into more recent iPads.

We employ a network architecture that consists of a DenseNet backbone with several upsampling layers as proposed by Alhashim and Wonka [1]. We showed that DenseNet outperforms ResNet as a backbone in terms of accuracy. The model trained on the NYU V2 dataset was tested using RGB input images and evaluated with the Structure Sensor and LiDAR data to see how close it would get to the real data. However, the model is not able to adapt to the new domain as it provides depth images in a different range when fed with the respective RGB images. While it is expected that it cannot replicate a different sensor including its specific artifacts without training on data from it, this also shows that it cannot adapt to a camera with different intrinsic parameters. We therefore created datasets for both sensors in order to provide better approximations. We found that using a transfer learning approach from the model pre-trained on the NYU V2 dataset generally led to the best results as it can adapt the existing network to the features of the new sensor without the need of creating another dataset with the size of NYU V2.

For the Structure Sensor where the range was limited, we tested two different ways of dealing with regions that are beyond the distance threshold. We found that inpainting the holes with a maximum value lead to superior training procedure compared the removing the invalid regions by setting them to zero. When doing an inpainting before training, the network could replicate the respective regions for the most part while removing them would lead to strong artifacts. In the latter case, these artifacts were largely impossible to differentiate from the valid depth information while the wall effect in the inpainting case means that a simple threshold after the prediction from the network can simply recreate the empty regions.

The experiments using data from the Apple iPad LiDAR sensor show how smoother data clearly facilitates the training procedure as the data provided by ARKit contains far less noise compared to the Structure Sensor. Therefore, no special data processing techniques are required in this case as the results already show a high accuracy. The accuracy can be further improved by relying on data augmentation.

For future work, the models may be further improved by providing larger datasets with more variety as we relied largely on typical university indoor environments like seminar rooms, offices, and hallways. Further experiments where the best performing networks are evaluated in tasks like 3D reconstruction would also be interesting for future research. In this context, it is also desirable to simplify the network in order to allow for shorter prediction times to facilitate realtime usage. While the network can provide data in realtime, it does not reach the typical 30 frames per second that are possible with physical sensors. As the network is generally better at approximating the LiDAR sensor than the Structure Sensor, further refinement of the method used for the latter can be considered. Due to the applied inpainting, object boundaries appear somewhat unclear in certain cases. This could be remedied by applying sharpening [7] and edge feature extraction [30] which may further improve the depth estimation.

Notes

https://unity.com/de/products/unity-mars, last accessed 31.12.2021.

References

Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. arXiv e-prints arXiv:1812.11941 (2018)
Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4009–4018 (2021)
Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. CoRR (2016), arXiv:1604.03901
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. IEEE (2009)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems. pp. 2366–2374 (2014)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. Int. J. Robot. Res. (IJRR) (2013)
Gui, Z., Liu, Y.: An image sharpening algorithm based on fuzzy logic. Optik Int. J. Light Electron Opt. 122(8), 697–702 (2011)
Article Google Scholar
He, Y., Liang, B., Zou, Y., He, J., Yang, J.: Depth errors analysis and correction for time-of-flight (tof) cameras. Sensors 17, 92 (2017). https://doi.org/10.3390/s17010092
Article Google Scholar
Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1043–1051. IEEE (2019)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708 (2017)
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learning multi-view stereopsis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2821–2830 (2018)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Janoch, A.: The Berkeley 3D Object Dataset. Master’s thesis, EECS Department, University of California, Berkeley (2012). http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-85.html
Kalantari, M., Nechifor, M.: Accuracy and utility of the structure sensor for collecting 3d indoor information. Geo-Spatial Inf. Sci. 19, 1–8 (2016). https://doi.org/10.1080/10095020.2016.1235817
Article Google Scholar
Khan, F., Salahuddin, S., Javidnia, H.: Deep learning-based monocular depth estimation methods-a state-of-the-art review. Sensors 20(8), 2272 (2020)
Article Google Scholar
Ladicky, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 89–96 (2014)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV). pp. 239–248. IEEE (2016)
Lindner, M., Schiller, I., Kolb, A., Koch, R.: Time-of-flight sensor calibration for accurate range sensing. Comput. Vis. Image Understand. 114(12), 1318–1328 (2010). https://doi.org/10.1016/j.cviu.2009.11.002, http://oceanrep.geomar.de/41830/
Marin, G., Dominio, F., Zanuttigh, P.: Hand gesture recognition with leap motion and kinect devices. In: 2014 IEEE International Conference on Image Processing (ICIP). pp. 1565–1569. IEEE (2014)
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4040–4048 (2016)
Nadar, C.R., Kunert, C., Schwandt, T., Broll, W.: Sensor simulation for monocular depth estimation using deep neural networks. In: 2021 International Conference on Cyberworlds (CW). pp. 9–16 (2021)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R..: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)
Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: Geometric neural network for joint depth and surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 283–291 (2018)
Ramamonjisoa, M., Lepetit, V.: Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2019)
Salvi, J., Pages, J., Batlle, J.: Pattern codification strategies in structured light systems. Pattern Recogn. 37(4), 827–849 (2004)
Article MATH Google Scholar
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems. pp. 1161–1168 (2006)
Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition (2011)
Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: Depth and motion network for learning monocular stereo. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5038–5047 (2017)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1395–1403 (2015)
Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3712–3722 (2018)
Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

Download references

Acknowledgements

This work has partially been funded by the CYTEMEX project funded by the Free State of Thuringia, Germany (FKZ: 2018-FGI-0019) as well as the CO-HUMANICS project funded by the Carl Zeiss Foundation in their breakthroughs program.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Ilmenau University of Technology, Ehrenbergstraße 29, Ilmenau, Germany
Christian Kunert, Tobias Schwandt & Wolfgang Broll
Fraunhofer IDMT, Ehrenbergstraße 31, Ilmenau, Germany
Christon R. Nadar

Authors

Christian Kunert
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Schwandt
View author publications
You can also search for this author in PubMed Google Scholar
Christon R. Nadar
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Broll
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Kunert.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kunert, C., Schwandt, T., Nadar, C.R. et al. Neural network adaption for depth sensor replication. Vis Comput 38, 4071–4081 (2022). https://doi.org/10.1007/s00371-022-02531-0

Download citation

Accepted: 13 May 2022
Published: 23 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00371-022-02531-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Neural network adaption for depth sensor replication

Abstract

Similar content being viewed by others

Performance Evaluation of Depth Completion Neural Networks for Various RGB-D Camera Technologies in Indoor Scenarios

Depth Estimation Using Convolutional Neural Network with Transfer Learning

Designing and Searching for Lightweight Monocular Depth Network

1 Introduction