1 Introduction

With the advent of powerful mobile hardware with increasing sensor capabilities, immersive and elaborate Augmented Reality (AR) and Mixed Reality (MR) applications have become easier to realize than ever before. They can range from entertainment applications like gaming to education, architecture, or medical applications. For this, several subtopics like tracking, registration, and scene understanding in general are relevant. One important factor is the possibility to have depth information within a scene which can be utilized for tasks like user interaction, accurate occlusion when rendering AR content, and 3D scene reconstruction. While very rough information about the environment geometry like simple planes as they are provided by some AR frameworks might be enough in some scenarios, tasks like 3D scene reconstruction will usually require more accurate depth information. For this reason, low-cost depth sensors have become more popular in recent years. These include examples like the Structure Sensor by Occipital Inc. [14], the Kinect v1 and v2, the Leap Motion Sensor by Leap Motion [19], or the LiDAR sensor directly built into Apple iPads. Such sensors are based on Time of Flight (ToF) or structured light [25]. The Structure Sensor and its successors in particular are designed to be attached to a mobile device to allow for high flexibility while the LiDAR sensor in newer iPads are already included in the device which is why those two sensors are particularly interesting for this study. While sensors have become smaller and more affordable, it is still an interesting possibility to outright replace them by algorithmic solution like accurate stereo matching or neural networks.

In this study, this topic is addressed by using datasets from the Structure Sensor and the LiDAR sensor that are used to train a neural network in order to simulate or replicate them. Hereby, the goal is not merely to get a good depth estimation but rather to mimic the sensors including their specific characteristics. This especially useful for things like prototyping AR applications. A typical problem for AR development is the need to deploy to a physical device due to the lack of tools that provide an accurate simulation of the AR experience. Unity MARSFootnote 1 is a tool that tries to address this problem by providing a simulation environment including details like feature tracking. However, depth information that mimics the real-world behavior of a sensor would aid the simulation and prototyping process. Therefore, this paper aims to show a procedure to create a sensor simulation using machine learning.

In this study, Deep Neural Networks (DNNs) are employed to simulate depth sensors. As shown in Sect. 2, there is a wide variety of depth estimation networks as well as large datasets. However, simply using those networks is generally not possible for our stated goals as they are trained for different sensors and therefore provide different sensor properties, artifacts, and overall quality. Therefore, we aim to use existing networks as a basis for a transfer learning approach with a smaller dataset in order to adapt it to a different sensor and its specific characteristics. Special consideration has to be given to the fact that some sensor artifacts might negatively impact the learning procedure. For example, depth images from the Structure Sensor are far less accurate above a certain distance threshold and contain additional inaccuracies especially around the edges of objects. We will therefore experiment with different data preparation techniques like inpainting and distance clipping in order to handle those problems. On the other hand, the iPad LiDAR sensor can already provide smoothed depth images which do not exhibit the same problems. As they do not contain holes and we aim to use a small dataset for the transfer learning, we will instead experiment with data augmentation techniques. We will therefore employ different ways for data preparation and augmentation and evaluate their influence on the overall model quality.

Ultimately, this paper offers the following key contributions:

  • A method for adapting a neural network to a new domain in order to simulate depth sensors like the Structure Sensor or the Apple iPad LiDAR sensor.

  • Two datasets including RGB-D and tracking data, one created with a Structure Sensor and with an Apple iPad LiDAR sensor.

  • An evaluation of different strategies for mimicking real depth sensors using a neural network, including inpainting for the Structure Sensor or data augmentation for the LiDAR sensor dataset as well as the use of different backbones for the neural network.

The paper is structured as follows: Sect. 2 presents related research and the state of the art. Central information about our used network, loss function, and the datasets can be found in Sect. 3. Section 4 will present our evaluation in order to show how well the different models and configurations perform and Sect. 5 provides our conclusion as well as some ideas for future research in order to extend our work.

2 Related work

2.1 Architectures

The detection of depth data can be achieved by active or passive methods. Active methods involve depth sensors that employ LiDAR, ultrasound, structured light, or Time-of-Flight (ToF) which actively send out signals that interact with the environment in order to estimate the scene depth. In addition to these active sensors, passive methods are also used more frequently which estimate the depth data with approaches like stereo matching or DNNs [15].

In recent years, the field of machine learning has generated a variety of approaches which provide promising results regarding topics like detection, classification, and depth estimation. One of the first approaches to estimate the depth is a probabilistic model by Saxena et al. [26]. This network was one of the first to estimate depth data from a single image. In 2014, Ladicky et al. [16] showed limitations of monocular depth estimation approaches due to perspective geometry and introduced a simpler classifier by relying on image manipulation. Laina et al. [17] presented a deep learning approach based on ResNet-50 in 2016. This novel architecture replaced the last layer with upsampling layers for reconstructions and proposed a novel Huber loss function. In 2018, Zamir et al. [32] showed that the amount of labeled data used in the training procedure can be reduced by applying transfer learning. Transfer learning offers the advantage that certain structures are already part of the model and can be beneficial when adapting the model to a new domain. Xu et al. [31] showed the combination of multi-scale features from encoders with the decoder. It was shown that learned features from upper layers contain higher level information. This data provides an indication of a global understanding of the structural aspects of an image or scene [9]. Encoder–decoder architectures like this include a larger pre-trained model as a backbone. The best known encoder backbones are Residual Network (ResNet), Densely Connected Convolutional Network (DenseNet), Squeeze and Excitation Network (SENet), or Visual Geometry Group Network (VGGNet) [9]. In particular, ResNet allows intermediate layers to be crossed which enables faster learning. Specifically, concatenations of all skipped feature maps are built, supporting feature propagation and reuse which consequently reduces the parameters [10]. SENet consists of a squeeze operator that aggregates the feature map while the excitation operator aggregates the learned activations [12]. Other methods such as GeoNet [23], SharpNet [24], and Pattern Affinitive Propagation (PAP-Depth) [33] have also been based on the ResNet backbone for depth prediction. Alhashim and Wonka [1] show that using DenseNet-169 as a backbone provides comparable results without a complex and large architecture. Later, they introduce adaptive bins to estimate the depth map more precisely [2]. In our previous work [21], we employed the model architecture from Alhashim and Wonka [1] in order to simulate the behavior of a Structure Sensor. Hereby, a transfer learning was used in conjunction with various data processing techniques.

2.2 Datasets

A variety of different datasets exist which are often originally used for different purposes but are applicable in the context of depth estimation. Important datasets are the KITTI dataset [6], Berkeley 3D Object Dataset (B3DO) [13], NYU V1 dataset [27], NYU Depth-V2 [22], Scene Flow dataset [20], and the depth-in-the-wild dataset [3].

In 2012, Silberman et al. [22] presented a high-quality Kinect dataset (NYU Depth-V2). This consists of 1,449 densely labeled pairs of aligned RGB and depth images and 407,024 unlabeled images in 464 different indoor scenes. All in all, the dataset has high-quality depth maps of indoor environments suitable for 3D reconstruction or depth estimation as in this study.

For outdoor scenes, the KITTI dataset of Geiger et al. [6] is available. The images in the dataset were created from stereo depth images using high-resolution color and grayscale stereo camera imagery, laser scans, high-precision GPS measurements, and Simultaneous Localization and Mapping (SLAM) data.

Another interesting dataset was compiled by Chen et al. [3] by means of crowd-sourcing. In the Depth-in-the-Wild dataset, users were shown two points inside an image to choose which point is closer in comparison with the other one. So, the dataset consists of 495.000 different images with an indication of relative depth.

Table 1 gives an overview of different datasets and the corresponding properties.

Table 1 Comparison of various existing datasets

3 Method

3.1 Architecture

The architecture is based on DenseNet mentioned in Sect. 2 which is an extension of ResNet. This offers two main advantages: First, there is a strong gradient flow. Second, DenseNet is more diversified. This means that it can contain good generalized information from previous layers which tend to have richer patterns. Such information is usually lost in a skip-style ResNet. For these reasons, we chose DenseNet over ResNet as the backbone. Accordingly, our study uses an encoder–decoder architecture as described by Alhashim and Wonka [1]. Figure 1 shows an overview of the architecture. To show the advantage of DenseNet over ResNet, we compared their performances in Sect. 4.3 with regard to the LiDAR dataset which shows that DenseNet can provide more accurate results.

Fig. 1
figure 1

Overview of the model architecture using a DenseNet backbone and multiple upsampling layers

The model uses DenseNet-169 as the backbone, which is pre-trained on ImageNet [4]. The structure of the DenseNet architecture conforms to the feed-forward design where every each layer is directly connected to every other layer. The decoder consists of four blocks, each with four layers. The block includes upsampling through a bilinear layer which is then followed by a concatenation operation and two 2D convolutional layers as shown in Fig. 1. The convolutional layers have a kernel size of \(3\times 3\) while the number of filters is determined by the number of filters obtained from the encoding layer. This is given by \(D_n = {E_m} / {2^n}\), where \(D_n\) is the number of filters in the decoding block n and \(E_m\) is the number of filters in the layer. DenseNet consists of series of dense blocks connected by transition layers [10]. The various decoder blocks are connected to the transition layers of the DenseNet architecture in order to access intermediate information.

3.2 Depth datasets

The focus of the work is especially on indoor spaces. For this reason, the NYU V2 dataset was used as it provides high-quality indoor depth maps. Based on this, we created our own two datasets for two different sensors to do a transfer learning on top of the NYU V2 dataset. One dataset was created using an iPad with an Occipital Structure Sensor while the other one was created with an iPad and its built-in LiDAR sensor. Both datasets will be further described in the following sections and are provided in the repository found under https://doi.org/10.5281/zenodo.5115276.

3.2.1 Structure Sensor

For the Structure Sensor, we generated a dataset that consists of 20 scenes with 2, 693 images taken with an Apple iPad Pro 12.9 inch (2015) and an Occipital Structure Sensor. ’Image’ hereby refers to a RGB-D image where color and depth are stored separately in practice. The focus here was on the office and laboratory spaces, which is why these make up a major part. The RGB images were recorded with a resolution of \(640\times 360\) stored in 24-bit color while raw depth images have a resolution of \(640\times 480\) in a 16 bit floating point format. Due to the network input size, the RGB images were scaled to \(640\times 480\) and the depth images were scaled and slightly cropped accordingly. For further use, both images are saved in a lossless compression format.

As already mentioned, the images were captured with the Structure Sensor which is directly attached to the Apple iPad. The sensor allows for a frame rate of 30 frames per second at the above-mentioned resolution. The data were then recorded at 6 frames per second as having a faster frame rate would lead to redundant data.

A look at a particular example in the dataset in Fig. 2 shows that some information is missing in the depth data at some places. This happens especially when viewing certain reflective surfaces as well as the areas around object edges. These artifacts are caused by two different reasons. First, certain surfaces absorb light or do not reflect it sufficiently. And second, the distance and angle between sensor and camera may be too high to provide an accurate depth sensing around edges [18]. This has the consequence that black areas at the edge of objects exist and a depth shadowing becomes apparent. To counteract these effects, an inpainting procedure is used.

Fig. 2
figure 2

Example from the Structure Sensor dataset that shows an RGB image with its corresponding depth map. Notice the black areas which are missing depth information caused by reflective surfaces and acute angles

Fig. 3
figure 3

Example from the LiDAR sensor dataset that shows an RGB image with its corresponding depth map. Contrary to the Structure Sensor dataset, the depth images in this dataset do not exhibit the same artifacts like depth shadowing or noise in distant areas

Another problem with currently available depth sensors is the decrease in accuracy with increasing depth [8, 14] which leads to rather noisy results for far-away objects which is problematic for the training procedure as well as other topics like 3D reconstruction. For this reason, depth data are truncated at four meters (m) and higher values are removed. We therefore experiment with two different techniques for handling the missing data. In one case, the data are marked as invalid by setting the pixels to zero while the other approach sets the pixels to the maximum value of 4m which creates a wall effect. An example where a depth image is processed by those procedures is shown in Fig. 4 where \(GT_{hole}\) is the results of setting the pixels to zero and \(GT_\mathrm{wall}\) shows the aforementioned wall effect. These configurations are meant to facilitate the training process by providing a consistent way to tag invalid data that can be learned by the neural network. In an actual application like 3D reconstruction where depth data are predicted by a neural network that was trained with one these configurations, hole or wall pixels would naturally be removed from the predicted depth image before further processing.

3.2.2 Apple iPad LiDAR sensor

Additionally to the Structure Sensor dataset, we also created a dataset using an Apple iPad (2020). Here, we do not depend on an external sensor as it already includes a built-in depth sensor using LiDAR technology. The depth images provided by ARKit have a resolution of \(256\times 192\) stored as 16 bit floating point. Additionally, we activated the smoothedSceneDepth property in ARKit when creating the dataset which removes holes and reduces noise. The color images are resized to a resolution of \(480\times 360\) and stored as 24-bit color which is a quarter of the original size from ARKit which provides a resolution of \(1920\times 1440\). For the network input, we kept the size of \(640\times 480\) which means that the data is resized accordingly for training and estimations. For this dataset, we chose a rate of 1 frame per second and recorded a total of 3, 637 images in 19 different scenes.

Figure 3 shows an example of a color frame and its corresponding depth image. Contrary to the Structure Sensor dataset as shown in Fig. 2, the LiDAR depth images appear much smoother and do not include the same artifacts like empty regions. This is mostly due to the smoothing capabilities that are applied by ARKit. This also means that special dataset processing techniques regarding hole filling as in the Structure Sensor dataset are not necessary in this case. Instead, we are able to feed the data directly into the neural network in order to learn the characteristics of the LiDAR sensor. However, since the data does not exhibit certain direction-dependent artifacts, we additionally experiment with data augmentation techniques like rotation and mirroring.

3.3 Loss function

The loss function in this paper is based on the loss function of Alhashim and Wonka [1]. It especially considers higher frequencies and determines the error value accordingly. The reason for weighting higher frequencies of an image is to improve the quality of the network with regard to the object boundaries. However, the final loss value L is calculated as a weighted sum with a pairwise loss \(L_\mathrm{depth}\), the image gradient loss \(L_\mathrm{grad}\) and the structure similarity loss \(L_\mathrm{SSIM}\):

$$\begin{aligned} L(y, {\hat{y}})= & {} \lambda _{1} L_\mathrm{depth}(y, {\hat{y}}) + \lambda _{2} L_\mathrm{grad}(y, {\hat{y}}) \nonumber \\&\quad + \lambda _{3} L_\mathrm{SSIM}(y, {\hat{y}}) \end{aligned}$$
(1)
$$\begin{aligned} L_\mathrm{depth}(y, {\hat{y}})= & {} \frac{1}{n} \sum _{p}^{n} \left| y_{p} - {\hat{y}}_{p} \right| \end{aligned}$$
(2)

First, the pairwise depth is given by Eq. 2. In this way, we obtain a pixel-level understanding of the error between the ground truth and the predicted image.

$$\begin{aligned} L_\mathrm{grad}(y, {\hat{y}}) = \frac{1}{n} \sum _{p}^{n} \left| g_{x} (y_{p} - {\hat{y}}_{p}) \right| + \left| g_{y} (y_{p} - {\hat{y}}_{p}) \right| \end{aligned}$$
(3)

Second, \(L_\mathrm{grad}\) calculates the changes in the gradient with respect to the neighboring pixels. The image gradient g is calculated in two directions \(g_{x}\) and \(g_{y}\), which gives the differences in the x and y components of the depth images.

$$\begin{aligned} L_\mathrm{SSIM}(y, {\hat{y}}) = \frac{1- SSIM(y, {\hat{y}})}{2} \end{aligned}$$
(4)

The third part of the loss function is the proposed Structural Similarity Index Measure (SSIM) of Wang et al. [29]. This metric provides a value about the perception of the three components: luminance, contrast, and c \(L_\mathrm{SSIM}\) is a reciprocal version of SSIM [1, 11, 28].

The individual parts of the loss function Eq. 1 were set with weight factor to \(\lambda _{1} = 0,1, \lambda _{2} = 1, \lambda _{3} = 1\) as suggested by Alhashim and Wonka [1]. This weighting leads to a training which considers structural similarity and gradient more than pairwise depth.

4 Evaluation & results

We trained our models using Tensorflow and Keras. For the optimization, we chose the Adam variant AMSGrad with a learning rate of \(10^{-4}\) which was maintained for every experiment. Training was stopped early when the model did not improve for 10 epochs.

The datasets which were further described in Sect. 3.2 are divided into individual rooms and hallways. For each dataset, we use one room for testing while the rest of the datasets are used for training and validation in order to make sure that the networks are evaluated using a room they have not been trained on. For the testing sets, we chose one room per dataset that makes up roughly 10% of the total image count and divide the remaining images randomly into training and validation data in order to have a rough 80-10-10 split. The datasets were uploaded to a repository (https://doi.org/10.5281/zenodo.5115276) where each room is stored in a numbered subfolder and the subfolder with the largest number contains the testing set. For the Structure Sensor dataset, we have a total of 2693 images where one room with 272 images is used for testing while the rest of the dataset is divided into 2179 training and 242 validation images. In the case of the LiDAR dataset which has a total of 3637 images, we chose a room for testing that includes 322 images while training and validation use 2985 and 331 images, respectively.

The results from the evaluation using the test sets are listed in Sect. 4.2 where the metrics shown in Sect. 4.1 were applied.

4.1 Metrics

The evaluation results listed in Table 2 are obtained using the accuracy and error metrics shown here. We rely on established metrics that were used in prior work regarding depth estimation evaluation [1, 5, 21].

Root mean squared error (RMSE):

$$\begin{aligned} \mathrm{RMSE} = \sqrt{\frac{1}{n} \sum _{p}^{n}{(y_{p} - {\hat{y}}_{p}})^2} \end{aligned}$$
(5)

Average \(log_{10}\) error:

$$\begin{aligned} \mathrm{log}_{10} = \frac{1}{n} \sum _{p}^{n} \left| \mathrm{log}_{10}(y_{p}) - \mathrm{log}_{10}({\hat{y}}_{p}) \right| \end{aligned}$$
(6)

Threshold accuracy (\(\delta _{i}\)):

$$\begin{aligned} {\delta = \max \left( \frac{{y_{p}}}{\hat{y_{p}}}, \frac{\hat{y_{p}}}{{y_{p}}}\right) } \end{aligned}$$
(7)

For the threshold accuracy, \(\delta \) is first computed for every pixel. The resulting values are then compared with three different threshold values \(a_{1}= 1.25^1\), \(a_{2} = 1.25^2\), and \(a_{i} = 1.25^3\). The threshold accuracy represents the percentage of pixels that fall within the given threshold.

4.2 Experiments

Table 2 shows three different sections. The reference section refers to the pre-trained model from Alhashim and Wonka [1] which originally used the NYU V2 dataset. Here, the test sets of the two datasets are evaluated without applying additional training in order provide a baseline and to show that an existing model is not generally able to adapt for different input data that it was not specifically trained for. The other sections show experiments that are specific to the Structure Sensor dataset (SD) or the LiDAR dataset (LD). All results in Table 2 rely on the DenseNet backbone. A comparison between DenseNet and ResNet for the LiDAR dataset is shown in Sect. 4.3.

Table 2 Here, the various experiments with their respective accuracy and error values are listed

The Structure Sensor section is concerned with various input representation and transfer learning techniques. hole and wall denote the input representations, i.e., the dataset preprocessing. For the hole configuration, invalid distant pixels are set to zero while wall uses the maximum value (4m). \(SD_\mathrm{hole}\) and \(SD_\mathrm{wall}\) have been used to train the network which was pre-trained on ImageNet [4]. \(SD^{T}_\mathrm{hole}\) and \(SD^{T}_\mathrm{wall}\) on the other hand refer to the same dataset configurations but are used for a transfer learning which uses the network trained on the NYU V2 dataset as a basis.

Since the transfer learning approach shows the best results for SD, we also applied it in the case if the LiDAR dataset (\(LD^{T}\)). Due to the smoothing that is applied within ARKit, the input data from the Apple iPad LiDAR sensor does not exhibit the same artifacts and distance inaccuracies as the Structure Sensor data so we did not experiment with different input data processing techniques and used the raw data as it was recorded from ARKit for the training. However, since the data does not contain direction-dependent artifacts, we instead experimented with two different data augmentation schemes which is a common technique used to prevent overfitting. For \(LD^{T}_{2x}\), we duplicated the training data and rotated the copy as if the sensor was simply turned upside down. \(LD^{T}_{4x}\) further augmented the training set. Here, in addition the rotation, we also mirrored the dataset horizontally and vertically. The validation set and the test set are not augmented so that the evaluation only considers real-world data.

Figure 4 shows a visual comparison of various example estimations as well as the absolute difference between the predictions and the actual sensor data.

Fig. 4
figure 4

Here, a visual comparison between the different sensors and the estimated depth maps is shown. The images on the right show the absolute difference and were slightly increased in brightness for illustration purposes. Note that \(SD_\mathrm{ref}\) is compared with \(GT_\mathrm{wall}\) for the evaluation as well as the difference image

4.2.1 Structure Sensor

The reference experiment \(SD_\mathrm{ref}\) was done by feeding the test set into the neural network which was only trained with the NYU V2 dataset. This was done in order to test if the existing network would already be able to provide good results when using a different sensor than the Kinect v2. \(SD^{T}_\mathrm{wall}\) instead uses a transfer learning approach after which the same evaluation is done. The results show that the existing network cannot generalize to a different domain and a transfer learning approach can greatly improve the result. We found that the original model can still provide somewhat plausible depth maps for the RGB images from the Structure Sensor dataset but when actually comparing them with the ground truth data, the depth values are too small, i.e., the point cloud appears to be too close to the camera. This shows that the network is not able to adapt to a camera with markedly different intrinsic parameters. While the difference image for \(SD^T_\mathrm{wall}\) in Fig. 4 still shows some artifacts around edges, the transfer learning mostly corrected the range compared to the reference network.

As mentioned previously, the Structure Sensor has a rapidly decreasing quality with increasing range and a threshold of 4m is therefore introduced. The difference between \(SD_\mathrm{hole}\) and \(SD_\mathrm{wall}\) lies in the way these regions are treated when processing the data and feeding it into the training procedure. While the data is set as invalid (i.e., to zero) in the case of \(SD_\mathrm{hole}\), it is set to a maximum value in \(SD_\mathrm{wall}\). Table 2 clearly shows that when using a maximum value to mark invalid regions instead of removing them, the neural network can learn a much better representation of the input data. This can also be seen in Fig. 4 where the network is unable to learn a good representation of invalid regions and instead generates obvious artifacts.

Lastly, \(SD^{T}_\mathrm{hole}\) and \(SD^{T}_\mathrm{wall}\) use the same data processing techniques as \(SD_\mathrm{hole}\) and \(SD_\mathrm{wall}\), respectively, but the training procedure uses a transfer learning approach from the network by Alhashim and Wonka [1] instead of just the basic network with the DenseNet backbone trained on ImageNet. As the results in Table 2 show, the transfer learning can generally improve the results but in the case of the hole data processing technique, the network still has problems in learning the features which leads to rather low accuracy and large errors in both cases.

Table 3 This comparison has the same structure as Table 2 but compares the results of using a ResNet backbone and DenseNet with regard to the LiDAR dataset including the different augmentation factors

4.2.2 LiDAR

As shown in Sect. 4.2.1, transfer learning generally leads to better results. Therefore, we applied it in the experiments when using the LiDAR dataset. Similar to \(SD_\mathrm{ref}\), \(LD_\mathrm{ref}\) shows that the existing network that was trained on NYU V2 is not able to provide accurate results for a different camera as the range of the resulting data does deviates from the test set even though the result for LD is better than the one for SD. \(LD^{T}\) is obtained by applying a transfer learning which greatly improves the result as shown in Table 2. In fact, the results generally exceed the best results from the Structure Sensor experiments including the reference network as shown in Fig. 4. This is likely due to the fact that the LiDAR sensor data are much smoother to begin with and include less artifacts that may disturb the training procedure. The smoothed depth provided by ARKit is not a pure sensor output and instead includes some internal processing that smooths the data over various frames. Since the resulting depth frames from ARKit do not exhibit obvious artifacts or holes due to shadowing that only occur in a specific direction, we augmented the dataset in two different ways in order to test if it may improve the result. Table 2 shows that \(LD^{T}_{2x}\) which doubled the dataset by rotating the images by 180 degrees led to a slight improvement. For \(LD^{T}_{4x}\), the dataset is further augmented by adding vertically and horizontally mirrored versions of the images. While the rotation of the images is comparable to a rotation of the sensor, mirroring may introduce samples that deviate from the real-world sensor. Still, the quadruple augmentation can still slightly improve the results further even though the gains are rather minor.

4.3 Comparison of ResNet and DenseNet

As mentioned in Sect. 3, we chose DenseNet as the backbone as it is advantageous with regard to depth estimation. DenseNet is based on ResNet and is a much larger network so it may be interesting to use the more light-weight ResNet if it still provides accurate results. Therefore, Table 3 shows a comparison between DenseNet-169 and ResNet-101 as the backbone. For ResNet, the decoder blocks were connected in a similar fashion to DenseNet which was described in Sect. 3. Here, it becomes clear that DenseNet outperforms ResNet for every metric and dataset. However, ResNet can still provide reasonable results and may be useful in applications where DenseNet requires too much processing time or memory.

5 Conclusion & future work

Depth information is an important feature for topics like 3D reconstruction or AR. Since such information may not always be available, for example due to hardware limitations or in the context of prototyping, a virtual replica of depth sensor is a possible alternative. In this paper, several approaches for the simulation of depth sensors have been explored and evaluated. We included two different sensors with very distinct characteristics and used various data processing and augmentation techniques in order to achieve models that can closely approximate the physical sensors. One sensor is the Structure Sensor which has been attached to an Apple iPad while the second sensor is the LiDAR sensor that is built into more recent iPads.

We employ a network architecture that consists of a DenseNet backbone with several upsampling layers as proposed by Alhashim and Wonka [1]. We showed that DenseNet outperforms ResNet as a backbone in terms of accuracy. The model trained on the NYU V2 dataset was tested using RGB input images and evaluated with the Structure Sensor and LiDAR data to see how close it would get to the real data. However, the model is not able to adapt to the new domain as it provides depth images in a different range when fed with the respective RGB images. While it is expected that it cannot replicate a different sensor including its specific artifacts without training on data from it, this also shows that it cannot adapt to a camera with different intrinsic parameters. We therefore created datasets for both sensors in order to provide better approximations. We found that using a transfer learning approach from the model pre-trained on the NYU V2 dataset generally led to the best results as it can adapt the existing network to the features of the new sensor without the need of creating another dataset with the size of NYU V2.

For the Structure Sensor where the range was limited, we tested two different ways of dealing with regions that are beyond the distance threshold. We found that inpainting the holes with a maximum value lead to superior training procedure compared the removing the invalid regions by setting them to zero. When doing an inpainting before training, the network could replicate the respective regions for the most part while removing them would lead to strong artifacts. In the latter case, these artifacts were largely impossible to differentiate from the valid depth information while the wall effect in the inpainting case means that a simple threshold after the prediction from the network can simply recreate the empty regions.

The experiments using data from the Apple iPad LiDAR sensor show how smoother data clearly facilitates the training procedure as the data provided by ARKit contains far less noise compared to the Structure Sensor. Therefore, no special data processing techniques are required in this case as the results already show a high accuracy. The accuracy can be further improved by relying on data augmentation.

For future work, the models may be further improved by providing larger datasets with more variety as we relied largely on typical university indoor environments like seminar rooms, offices, and hallways. Further experiments where the best performing networks are evaluated in tasks like 3D reconstruction would also be interesting for future research. In this context, it is also desirable to simplify the network in order to allow for shorter prediction times to facilitate realtime usage. While the network can provide data in realtime, it does not reach the typical 30 frames per second that are possible with physical sensors. As the network is generally better at approximating the LiDAR sensor than the Structure Sensor, further refinement of the method used for the latter can be considered. Due to the applied inpainting, object boundaries appear somewhat unclear in certain cases. This could be remedied by applying sharpening [7] and edge feature extraction [30] which may further improve the depth estimation.