1 Introduction

Artificial intelligence (AI) and deep learning are key enablers in automated driving [RMM21]. One important task in the development of automated cars is the perception of the vehicle environment. While several sensors can be used to solve this task, the most promising approaches in industry and research rely at least partly on camera-based perception [RNFP19]. In the perception sub-tasks of object detection and semantic segmentation, deep learning-based approaches already outperform classical methods [GLGL18, ZZXW19].

However, those deep learning-based approaches suffer from several insufficiencies. Following Sämann et al., one central insufficiency is the ability to generalize from given training data [SSH20]. If the underlying distribution of the input data differs from the distribution of the training dataset, the prediction accuracy may decrease.

In the context of automated driving, a decrease in prediction accuracy can have dramatic consequences, e.g., fatal accidents due to traffic participants which were not detected, e.g., Uber accident [SSWS19]. One possibility to decrease the probability of these consequences is the optimization of used datasets. A high coverage over all possible situations in the training data is crucial to improve the quality of trained neural networks. However, since the input spaces (e.g., image data) are very high-dimensional and complex, the rule-based assessment of coverage is not feasible [BLO+17].

Hence, in this work, the capabilities of variational autoencoders (VAEs) of extracting meaningful distributions from datasets are used to compare different datasets in an unsupervised fashion. Since the VAE works directly with raw data, no assumptions over relevant dataset dimensions for the comparison are necessary. Additionally, no already trained perception function is necessary. The underlying data distributions found by the VAE are analyzed to find similarities and novelties, using the latent space of the VAE. This makes it possible to estimate whether a particular dataset is in-domain or out-of-domain of a second dataset, thus allowing the detection of domain shifts between datasets.

During early function development, the presented methods can be used to extract datasets with manageable sizes, still maintaining high coverages over relevant information in the complete data pool. These smaller datasets lead to faster training and validation and allow faster iterations. Moreover, the method can detect an unknown domain shift between the training, validation, and test datasets, which could lead to wrong conclusions about the performance of the function. In later development stages, the dataset comparison enables a prediction, if an already trained function can be safely used in a different environment. Additionally, for domain shift detections with very high confidences, an online parallel execution of a perception neural network is thinkable, acting as a warning system for possibly unknown input data.

A domain shift can be caused by a huge variety of factors, some of them potentially unknown to the developer. This makes the application of rule-based approaches impossible. So, the idea of the presented dataset comparison method is to learn a multivariate distribution for all relevant aspects of one dataset and find useful metrics to compare these distributions. The used metrics in this investigation are related to known novelty detection techniques for VAEs.

The main contribution of this chapter is the application of various novelty detection methods and various VAE architectures in several experiments, each testing a different kind of domain shift. The applied novelty detection methods comprise five different nearest neighbor approaches applied to the latent spaces. As a benchmark, the reconstruction error is also computed [VGM+20]. The applicability of these methods is investigated on three datasets, each providing one or more inherent domain shifts (e.g., weather, location, or daytime). The presented broad evaluation of different approaches can give insights into the practical application of such approaches in the development of safe perception functions for automated driving.

2 Related Works

In this section, we provide an overview of related topics. There are various ways to useVAEs for anomaly detection in datasets [AC15, SWXS18, VGM+20]. An et al. compute the reconstruction probability of a given test sample by inspecting several samples drawn from the latent distribution of test samples [AC15]. The underlying assumption is that anomalous data samples have a higher variance and will therefore show a lower reconstruction quality. Vasilev et al. propose several methods and metrics to compute novelty scores using VAEs [VGM+20]. They describe 19 different methods for novelty calculation, including latent space-based and reconstruction-based approaches. Their evaluation on a magnetic resonance imaging dataset and on MNIST showed that a variety of metrics outperforms classical approaches such as nearest neighbor distance or principal component analysis (PCA) in the feature space.

In the domain of skin disease detection, VAE-based novelty detection methods were applied by Lu et al. [LX18]. They decompose the VAE loss into the reconstruction term and the KL term and use these as a novelty metric. Their results show better performance of the reconstruction-based approaches.

In the automotive domain, the issue of designing meaningful test datasets is addressed on various levels. Bach et al. address the selection of well-adjusted datasets for testing with a visual method for time series in recorded real-world data [BLO+17]. Langner et al. enhanced this approach with a method for dataset design, using the novelty detection capabilities of classical autoencoders [LBR+18]. A related approach uses methods from natural language processing (NLP) to describe sequential contexts in automotive datasets, allowing similarity analysis and novelty detection [RSBS20].

These approaches can only indirectly find domain shifts for perception functions, since they are not designed to work in the feature space itself. For the detection of domain shifts directly in raw data, Löhdefink et al. propose the analysis of the reconstruction error of an autoencoder to predict the quality of a semantic segmentation network [LFK+20]. Sundar et al. address the issue of the diversity of automotive data by training various \(\beta \)-VAEs using different labels such as weather or daytime [SRR+20].

There is a lot of research in relation to application domains, proposed metrics and network architectures in the area of dataset analysis with VAEs. However, most of the related approaches restrict their analysis on datasets with only a few degrees of freedom (e.g.,  [VGM+20] and [LX18]) or test their method on only one type of domain shift (e.g., different locations). In this work, a broad spectrum of approaches (different network architectures, different metrics) were evaluated on three datasets including five different domain shifts.

3 Building Blocks of Our Approach

In this work, four different VAE architectures and five different metrics for anomaly detection are examined on their applicability to compare automotive datasets. Four methods use a nearest neighbor approach in the latent space for detection of domain shifts.

In this section, the VAE DNN is introduced. Afterwards, the used novelty detection methods are presented, and the section concludes with the explanation for dataset comparison and domain shift detection.

3.1 Variational Autoencoder DNN

The difference between the original autoencoder and the VAE is that the VAE uses a stochastic model. The VAE consists of a stochastic encoder, a latent space, and a stochastic decoder. The general structure of a VAE can be seen in Fig. 1. Simplified, one could say that the VAE transfers the input data into a low-dimensional space to then reconstruct the input data from this space. The underlying stochastic model is now explained.

Fig. 1
figure 1

Basic variational autoencoder structure, consisting of an encoder to transform data into the latent space and a decoder to reconstruct images from sampled values \(\mathbf {z}\). The encoder receives an image \(\mathbf {x}\) as input, where H is the height of the image, W is the width of the image, and C is the number of color channels

The probabilistic generative model of the VAE \({\mathrm p_{}}_{\boldsymbol{\theta }}(\mathbf {x}|\mathbf {z})\) is a deep neural network (DNN) with network parameters \(\boldsymbol{\theta }\) mapping \(\mathbf {z}\) into \(\mathbf {x}\) and is named decoder [VGM+20, Dup18]. This decoder model learns a joint distribution of the data \(\mathbf {x}\) and a continuous latent variable \(\mathbf {z} \in \mathbb {R}^{d_\text {lat}}\) [CHK21]. In other words, the decoder describes the distribution of the decoded variable \(\hat{\mathbf {x}}\) given the encoded one \(\mathbf {z}\). The inference model of the VAE is trained to approximate the posterior distribution \(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})\sim \mathcal {N}(\mu (\mathbf {x}),\,\sigma ^{2}(\mathbf {x}))\) with parameters \(\boldsymbol{\phi }\) mapping \(\mathbf {x}\) into \(\mathbf {z}\) and is named encoder [VGM+20]. The encoder describes the distribution of the encoded variable given the decoded one.

The probabilistic generated model and the inference model then give rise to the first loss term, since the aim for the input data of the encoder is to be reconstructed by the decoder. The reconstruction error loss is defined as follows: \(\mathbb {E}_{\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})}[\log {\mathrm p_{}}_{\boldsymbol{\theta }}(\mathbf {x}|\mathbf {z})]\)  [Dup18]. In order to improve the disentanglement properties of \(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})\), the VAE has a loss optimization that tries to fit \(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})\) to \({\mathrm p_{}}(\mathbf {z})\) [HMP+17]. This loss optimization can be both controlling the capacity of the latent information bottleneck and embodying statistical independence. This is achieved by defining \({\mathrm p_{}}(\mathbf {z})\) as an isotropic unit Gaussian distribution \({\mathrm p_{}}(\mathbf {z}) = \mathcal {N} (0, I)\) [HMP+17]. In order to train the disentanglement in the bottleneck, the VAE uses the Kullback–Leibler (KL) divergence of the approximate from the true posterior \(D_{KL}(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})||{\mathrm p_{}}(\mathbf {z}))\). Before we define the loss term, we introduce the variable \(\beta \), which serves as a weighting factor. This allows us to define the loss function for the VAE and \(\beta \)-VAE as follows:

$$\begin{aligned} J(\boldsymbol{\theta },\boldsymbol{\phi }) = \mathbb {E}_{\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})}[\log {\mathrm p_{}}_{\boldsymbol{\theta }}(\mathbf {x}|\mathbf {z})] -\beta D_{KL}(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})||{\mathrm p_{}}(\mathbf {z})). \end{aligned}$$
(1)

The first term of the loss function is the reconstruction error, which is responsible for the VAE encoding the informative latent variable \(\mathbf {z}\) and allowing the input data \(\mathbf {x}\) to be reconstructed [CHK21]. The second term regulates the posterior distribution by adjusting the distribution of the encoded latent variables \(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})\) to the prior \({\mathrm p_{}}(\mathbf {z})\) [CHK21]. If \(\beta > 1\) is selected, the loss function of the \(\beta \)-VAE as shown in (1) is used, and if \(\beta = 1\) is selected, (1) breaks down to the standard VAE loss [HMP+17]. The variable \(\beta \) scales the regularization term for the posterior distribution, with \(\beta \ >1\) enforcing the \(\beta \)-VAE to encode more disentangled latent variables by matching the encoded latent variables with the prior by higher pressure on \(D_{KL}(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})||{\mathrm p_{}}(\mathbf {z}))\) [HMP+17, CHK21].

The VAE and \(\beta \)-VAE only use continuous latent variables to model the latent variable \(\mathbf {z}\) [CHK21]. The next VAE architecture that we use for our experiments additionally uses discrete variables to disentangle the generative factors of the observed data [CHK21]. The JointVAE uses the discrete variable \(\mathbf {c}\) to decode the generative factors of the observed data [Dup18, CHK21]. The loss function of the JointVAE is an extension of the loss function of the \(\beta \)-VAE and results in (2).

$$\begin{aligned} \begin{aligned} J(\theta ,\phi ) =&\,\, \mathbb {E}_{\mathrm {q}_{\phi }(\mathbf {z},\mathbf {c}|\mathbf {x})}[\log {\mathrm p_{}}_{\theta }(\mathbf {x}|\mathbf {z},\mathbf {c})]-\gamma \big |D_{KL}(\mathrm {q}_{\phi }(\mathbf {z}|\mathbf {x})||{\mathrm p_{}}(\mathbf {z}))-C_{\mathbf {z}}\big |\\ {}&-\gamma \big |D_{KL}(\mathrm {q}_{\phi }(\mathbf {c}|\mathbf {x})||{\mathrm p_{}}(\mathbf {c}))-C_{\mathbf {c}}\big |. \end{aligned} \end{aligned}$$
(2)

The term \(D_{KL}(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {c}|\mathbf {x})||{\mathrm p_{}}(\mathbf {c})))\) is the Kullback–Leibler divergence for the discrete latent variables and \(D_{KL}(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z}|\mathbf {x})||{\mathrm p_{}}(\mathbf {z}))\) is the Kullback–Leibler divergence for the continuous latent variables [Dup18]. The variables \(\gamma \), \(C_{\mathbf {z}}\), and \(C_{\mathbf {c}}\) are hyperparameters [Dup18]. The hyperparameters \(C_{\mathbf {z}}\) and \(C_{\mathbf {c}}\) are gradually increased during training and both hyperparameters control the amount of information the model can encode [Dup18]. The hyperparameter \(\gamma \) forces the KL divergence to match the capacity \(C_{\mathbf {z}}\) and \(C_{\mathbf {c}}\). As with the standard VAE, \(\mathrm {q}_{\boldsymbol{\phi }}(\mathbf {z},\mathbf {c}|\mathbf {x})\) is the inference model which maps \(\mathbf {x}\) into \(\mathbf {z}\) and \(\mathbf {c}\).

In contrast to \(\beta \)-VAE and JointVAE which are focused on optimized latent space distributions, the goal of the Nouveau VAE (NVAE) is to reconstruct the best possible images [VK21]. Compared to the other used VAEs, the NVAE is a hierarchical VAE. This means that the NVAE has a hierarchical encoder and decoder structure and multiple latent spaces on different stages. Another special property is that this VAE has a bidirectional encoder [VK21]. This means that the information does not only pass from one encoder to the other but also the information from a deeper encoder passed back to the higher encoder. A focus during the development of the NVAE was to design expressive neural networks for VAEs [VK21].

In this chapter, the VAE [KW14], \(\beta \)-VAE [HMP+17], JointVAE [Dup18], and Nouveau VAE (NVAE) [VK21] are considered for the five experiments.

3.2 Methods for Novelty Detection

Nearest neighbor approaches for novelty detection have the assumption that the distributions of new and unknown data are different from the distributions of normal data. So, the latent space distance between a test sample distribution \({\mathrm p_{}}_{t}\) and the closest normal sample can be used as a novelty score. Since latent spaces consist of high-dimensional distributions, the selection of a feasible metric to compute meaningful distances is non-trivial. According to Sect. 3.1, the used normal distributions \({\mathrm p_{}}_{n}\) are defined by their parameters \(\boldsymbol{\mu }_n\) and \(\boldsymbol{\sigma }_n\). As the JointVAE has different latent spaces, for the calculation of the nearest neighbor only the continuous latent space was used.

In this chapter, five metrics in the latent space and one metric in the pixel space are evaluated. The five metrics in the latent space can be further divided into two groups. Group one includes the metrics that only use the means (\(\boldsymbol{\mu }\)) of the distribution, and metric group two uses both means (\(\boldsymbol{\mu }\)) and variance (\(\boldsymbol{\sigma }\)) for the calculation.

First, the metrics from group one are presented that only use means for the calculation. The cosine distance is a widely used metric to determine the similarity of vectors [NB10, Ye11]. Since it is designed for vectors, it can not directly be applied to distributions. Therefore, only the means \(\boldsymbol{\mu }\) of the normal distributions are used for calculating the cosine distance. The cosine distance can be computed by

$$\begin{aligned} m_{\mathrm {cos}}({\mathrm p_{}}_n,{\mathrm p_{}}_t)= \frac{\boldsymbol{\mu }_n\boldsymbol{\mu }_t^T}{\Vert \boldsymbol{\mu }_n \Vert \Vert \boldsymbol{\mu }_t\Vert }, \end{aligned}$$
(3)

where \(\boldsymbol{\mu }_n=(\mu _n(\delta ))\) and \(\boldsymbol{\mu }_t=(\mu _t(\delta ))\) describe the d-dimensional mean vectors of two distributions \({\mathrm p_{}}_n\) and \({\mathrm p_{}}_t\), related to a normal and a test sample.

The Euclidean distance is the distance that can be measured with a ruler between two points and is defined as

$$\begin{aligned} m_{\mathrm {euc}}({\mathrm p_{}}_n,{\mathrm p_{}}_t)= \sqrt{\sum _{\delta =1}^{d} (\mu _{n}(\delta ) - \mu _{t}(\delta ))^{2}}. \end{aligned}$$
(4)

Here and in the following definitions, d is the dimensionality of the latent space. The Manhattan distance is inspired by the quadratic layout of Manhattan, New York City, and measures the distance between two vectors as the sum of their components’ absolute differences [BR14].

$$\begin{aligned} m_{\mathrm {man}}({\mathrm p_{}}_n,{\mathrm p_{}}_t)= \sum _{\delta =1}^{d} |\mu _{n}(\delta ) - \mu _{t}(\delta )|. \end{aligned}$$
(5)

The former presented metrics only used the means \(\boldsymbol{\mu }\). Next, two metrics are introduced that use both means \(\boldsymbol{\mu }\) and standard deviations \(\boldsymbol{\sigma }\) for the calculation of the nearest neighbor. The next metric was defined by the authors of this chapter, using the well-known z-score. From the test samples’ distribution, a vectorial value \(\mathbf {z}_t=(z_t(\delta ))\) is sampled. Then the distance is computed via

$$\begin{aligned} m_{\mathrm {zsc}}({\mathrm p_{}}_n,{\mathrm p_{}}_t) = \sum _{\delta =1}^{d} \biggr \Vert \frac{z_t(\delta )-\mu _{n}(\delta )}{\sigma _n(\delta )} \biggr \Vert ^2, \end{aligned}$$
(6)

where the difference between \(\mathbf {z}_t\) and the normal sample mean \(\boldsymbol{\mu }_n\) being element-wise divided by its standard deviation \(\boldsymbol{\sigma }_n\). Since we are only interested in the absolute distance of the distributions, we compute the squared L2 norm. The Bhattacharyya distance

$$\begin{aligned} m_{\mathrm {bha}}({\mathrm p_{}}_n,{\mathrm p_{}}_t)= \sum _{\delta =1}^{d} \frac{1}{4} \cdot \log \left( \frac{1}{4} \left( \frac{\sigma _{n}^{2}(\delta )}{\sigma _{t}^{2}(\delta )} + \frac{\sigma _{t}^{2}(\delta )}{\sigma _{n}^{2}(\delta )} + 2\right) \right) + \frac{1}{4} \left( \frac{(\mu _{n}(\delta ) - \mu _{t}(\delta ))^{2}}{\sigma _{n}^{2}(\delta ) + \sigma _{t}^{2}(\delta )} \right) , \end{aligned}$$
(7)

as proposed by Vasilev et al. [VGM+20], is also tested. The Bhattacharyya distance measures the similarity of two distributions [CA79].

To consider also a metric which does not use the latent space, the reconstruction error in the pixel space is considered. Since the VAE is trained only with normal data, the reconstruction quality should decrease, when new data samples are processed. Specifically, we chose the deterministic reconstruction error

$$\begin{aligned} m_{\mathrm {dre}}(\mathbf {x},\hat{\mathbf {x}})= \big \Vert \mathbf {x} - \hat{\mathbf {x}}\big \Vert ^{2}, \end{aligned}$$
(8)

as proposed by Vasilev et al. [VGM+20], since it yielded the best results in their experiments.

3.3 Dataset Comparison Approach

In this section, we explain how the methods described in Sect. 3.2 can be leveraged to compare the whole datasets by explaining the data pipeline, also visible in Fig. 2. As starting points of the pipeline, a training dataset \(\mathcal {D}^\mathrm {tr}\), a validation dataset \(\mathcal {D}^\mathrm {val}\), which is in the distribution of the training dataset, and a test dataset \(\mathcal {D}^\mathrm {dom}\) are needed. \(\mathcal {D}^\mathrm {dom}\) is the dataset, for which the method tests, whether the dataset is in-domain or out-of-domain of the respective training dataset . The presented metrics should therefore be able to separate \(\mathcal {D}^\mathrm {val}\) and \(\mathcal {D}^\mathrm {dom}\). Dataset \(\mathcal {D}^\mathrm {val}\) most probably will have single images with higher novelty scores and \(\mathcal {D}^\mathrm {dom}\) can have already known images. Therefore, the metrics most probably cannot perfectly separate \(\mathcal {D}^\mathrm {val}\) and \(\mathcal {D}^\mathrm {dom}\). This is solved by an analysis of the novelty metric results on a dataset level.

Fig. 2
figure 2

Approach for dataset comparison based on latent space-based novelty scores. After training, the encoder is used to transform images into their latent space. In this space, different metrics can be applied to calculate nearest neighbor distances between different datasets. Domain shifts can then be detected by analyzing the histograms deviations

The first step of the pipeline is the training of a VAE with the \(\mathcal {D}^\mathrm {tr}\) dataset. For the novelty detection, the decoder can be discarded after training. As shown in Fig. 2, the trained encoder can now be used to generate the latent space representations of \(\mathcal {D}^\mathrm {tr}\), \(\mathcal {D}^\mathrm {val}\), and \(\mathcal {D}^\mathrm {dom}\), spanned by the vectors \(\boldsymbol{\mu }^\mathrm {tr}\), \(\boldsymbol{\sigma }^\mathrm {tr}\), \(\boldsymbol{\mu }^\mathrm {val}\), \(\boldsymbol{\sigma }^\mathrm {val}\), \(\boldsymbol{\mu }^\mathrm {dom}\), and \(\boldsymbol{\sigma }^\mathrm {dom}\).

Now, the novelty scores provided in Sect. 3.2 can be applied to every latent space representation of all data samples in \(\mathcal {D}^\mathrm {val}\) and \(\mathcal {D}^\mathrm {dom}\). Hence, for the decision, whether \(\mathcal {D}^\mathrm {dom}\) has a large gap to \(\mathcal {D}^\mathrm {val}\), the novelty scores have to be compared. In case of the nearest neighbor approaches, the resulting novelty scores \(m_{\mathrm {val}}=\{(\mathrm {val}_1,w_{\mathrm {val}_1}),\ldots ,(\mathrm {val}_n,w_{\mathrm {val}_n})\}\) and \(m_{\mathrm {dom}}=\{(\mathrm {dom}_1,w_{\mathrm {dom}_1}),\ldots ,(\mathrm {dom}_n,w_{\mathrm {dom}_n})\}\) are represented as a histogram where \(n \in \mathbb {N}\) is the number of buckets, \(\mathrm {val}_a\) and \(\mathrm {dom}_b\) are bucket representatives with \(0 < a,b \le n\), with \(a,b \in \mathbb {N}\), and \(w_{\mathrm {val}_a}\), \(w_{\mathrm {dom}_b}\) being the weights of the bucket [RTG00]. To compare the histograms we use the earth mover’s distance (EMD) [RTG00]. The earth mover’s distance can calculate the difference between the histogram \(m_{\mathrm {val}}\) and \(m_{\mathrm {dom}}\) by calculating the earth amount \(f_{ab}\) which is to be moved from bucket \(\mathrm {val}_a\) to \(\mathrm {dom}_b\) [RTG00]. To calculate the distance, the EMD also needs the ground distance matrix \(\mathbf {D}=(d_{ab})\), where \(d_{ab}\) is the ground distance between bucket \(\mathrm {val}_a\) and \(\mathrm {dom}_b\) [RTG00]. With these terms, the EMD can be defined as follows:

$$\begin{aligned} EMD(m_{\mathrm {val}},m_{\mathrm {dom}})= \frac{\sum _{a=1}^{n} \sum _{b=1}^{n} d_{ab}f_{ab}}{\sum _{a=1}^{n} \sum _{b=1}^{n} f_{ab}}. \end{aligned}$$
(9)

Here, \(\sum _{a=1}^{n} \sum _{b=1}^{n} d_{ab}f_{ab}\) is the flow between \(m_{\mathrm {val}}\) and \(m_{\mathrm {dom}}\), that minimizes the overall cost with the following constraints [RTG00]:

$$\begin{aligned} f_{ab} \ge 0 \qquad 1 \le a\le n, 1 \le b\le n, \end{aligned}$$
(10)
$$\begin{aligned} \sum _{b=1}^{n} f_{ab}\le w_{\mathrm {val}_a} 1 \le a\le n, \end{aligned}$$
(11)
$$\begin{aligned} \sum _{a=1}^{n} f_{ab}\le w_{\mathrm {dom}_b} 1 \le b\le n, \end{aligned}$$
(12)
$$\begin{aligned} \sum _{a=1}^{n} \sum _{b=1}^{n} f_{ab} = \min \biggr (\sum _{a=1}^{n} w_{\mathrm {val}_a}, \sum _{b=1}^{n} w_{\mathrm {dom}_b}\biggr ). \end{aligned}$$
(13)

The used histograms are normalized on the x-axis to a range of values between 0 and 1. Additionally, the y-axis is also normalized, such that the histogram bars sum to 1. With this normalization, comparability between different experiments is achieved, since the results are independent of absolute metric values and dataset sizes.

For the reconstruction error, the histogram approach is comparable with the distances to the nearest neighbors. The histograms then are defined over the deterministic reconstruction error instead of the latent space distance.

4 Experimental Setup

In this section, first the selection of the datasets is explained. Afterwards, the experiments are explained and the hyperparameter values are given. The chapter is concluded with the training parameters for the experiments.

4.1 Datasets

For the evaluation of our method, we selected different datasets which all have different focuses. The first dataset is the Oxford RobotCar Dataset [MPLN17]. In this dataset, the same route was driven multiple times over the time period of 1 year, except for a few small deviations. The special feature of this dataset is that different weather conditions and seasons were recorded on the same route. This allows various domain shift analyses, such as changing weather conditions, e.g., sunny weather vs. snow. In the training, validation, snow test, rain test, and night test data splits there are 9130, 3873, 4094, 9757, and 9608 images, respectively. For all experiments, the test split is created to test whether the data is in-domain or out-of-domain.

As a second dataset, the Audi autonomous driving dataset (A2D2) [GKM+20] was used. This dataset is relatively new and was acquired in different German cities. For our evaluation, we compared the images from the city of Gaimersheim with the images from the city of Munich. The assumption with this dataset is that it shows relatively little variation because it was recorded in two German cities in the state of Bavaria. Thus, it should result in a very low dataset distance. The small difference is expected because Munich is larger than Gaimersheim, and therefore it is a comparison between an urban and a suburban city. The dataset is divided into training dataset of 12550 images, validation dataset of 3138 images, and test dataset of 5490 images.

As a third dataset, the nuScenes dataset [CBL+20] was selected. This dataset was selected because it was recorded in Boston and Singapore. The interesting point in the comparison is that there is right-hand traffic in Boston and left-hand traffic in Singapore. A number of 30013 images are used for training, 40152 images for validation, and 20512 images for test.

All datasets consist of a number of unconnected video sequences. To avoid similar images in different datasets splits, the split between \(\mathcal {D}^\mathrm {tr}\), \(\mathcal {D}^\mathrm {val}\), and \(\mathcal {D}^\mathrm {dom}\) was done sequence-wise. This ensures, that every scene only occurs in exactly one dataset. Only for the A2D2 dataset, the split between \(\mathcal {D}^\mathrm {tr}\), \(\mathcal {D}^\mathrm {val}\) was inside the Gaimersheim sequence, resulting in a few very similar images just before and after the split. However, this should not have a big effect, since the similar images take only a very small percentage of the whole dataset.

4.2 Experiments

For the evaluation of our method, we selected three different datasets, which all have a different focus. With these datasets, five different domain shifts were generated. These domain shifts can be seen in Table 1.

Table 1 All our datasets and experiments

The five different domain shifts were investigated with four different VAE variants. Each variant was tested with 6 different metrics, leading to 120 experiments in total.

Experiments 1–3 were performed with the Oxford dataset. For training, images were taken on sunny days and as validation data, images were selected that were taken on an alternative route under the same weather conditions. This results in the following experiments for the Oxford dataset. Experiment 1 tests if the latent space can be used to distinguish sunny days from snowy days. The second experiment tests how rain can be distinguished from sun, and the third experiment investigates how sunny days differ from night scenes (Fig. 3).

Experiment 4 uses the A2D2 dataset. This experiment only has a very small domain shift, since we compare data recorded in the sub-urban area of the small German city Gaimersheim compared to the urban area of Munich. Experiment 5 uses the nuScenes dataset. This experiment used images from Boston as training data and, as in Experiment 4, used a holdout of the data as validation data. Images from Singapore were used as a test set. Since in Singapore also night scenes were recorded, it has been ensured that no night scenes are in the test dataset.

To allow comparison of the experiments and metrics, baseline experiments were conducted in addition to the domain shift experiments. This is necessary to find out which value the metric has for in-domain data. For this purpose, the validation dataset \(\mathcal {D}^\mathrm {val}\) was divided into \(\mathcal {D}^\mathrm {val_0}\) and \(\mathcal {D}^\mathrm {val_1}\). This makes it possible to define a baseline for each experiment by setting \(\mathcal {D}^\mathrm {val}=\mathcal {D}^\mathrm {val_0}\) and \(\mathcal {D}^\mathrm {dom}=\mathcal {D}^\mathrm {val_1}\).

Fig. 3
figure 3

These histograms show the results for Experiment 2 and the JointVAE with the z-score and cosine distance. Training was done with the sunny Oxford images (oxford sun), and it was tested if rainy images (oxford rain) are out-of-domain. You can clearly see that the histogram on the left shows a difference between rain and sun, and the histograms on the right are rather on top of each other and poorly separated

4.3 VAE Hyperparameters

To ensure comparability between the VAE, \(\beta \)-VAE, and JointVAE, we used the same layer architecture and hyperparameters. The goal of this chapter is to be able to give an initial indication of whether a domain shift can be detected with simply chosen hyperparameters. It is to be expected, that with optimization of hyperparameters on multiple datasets, better results can be achieved.

To evaluate the performance of the presented methods, we have used the standard VAE, \(\beta \)-VAE, JointVAE, and the NVAE. The architecture was the same for the VAE, \(\beta \)-VAE, and the JointVAE. We use five hidden layers with dimensions 32, 64, 128, 256, and 512. After the hidden layers, we add three residual blocks. The residual blocks follow a latent space with 100 neurons. Both \(\beta \) and \(\gamma \) were set to 10.5 and the input size of the images was specified to a format of \(512 \times 256\). For the NVAE, we adopted the architecture from the paper [VK21].

4.4 Training

The VAE models used for the domain shift analysis were trained and implemented using Pytorch [PGC+17]. All VAE architectures were trained for 50 epochs with a batch size of 128 images. In addition, a learning rate of 0.0001 in combination with the Adam optimizer [KB15] was used. Figure 4 shows the original image and the reconstruction of the different VAEs. The selected image is from the validation dataset.

Fig. 4
figure 4

Figure a shows an image from the Oxford validation dataset \(\mathcal {D}^\mathrm {val}\). Figures be show the reconstruction of the VAE architectures used in this chapter. It is clear that the NVAE has the best reconstruction quality. As expected, the reconstruction of the VAE (b), \(\beta \)-VAE (c), and JointVAE (TheVAEmodelsu) is worse compared to the NVAE. However, there are differences in reconstruction quality of (b)–(d), e.g., in the reconstruction of the JointVAE, compared to the VAE and \(\beta \)-VAE, you can imagine the car on the right edge. However, the fact that the reconstruction is worse is not a problem, because the focus of our method lies on the latent space

5 Experimental Results and Discussion

In this section, the results are presented. Section 5.1 first explains the results with the nearest neighbor method and after that in Sect. 5.2 the results will be discussed.

5.1 Results

To make the results of the different architectures and the different metrics comparable, we calculated the ratio of the in-domain data with the out-of-domain data for each experiment and architecture. This is done by

$$\begin{aligned} Q_{EMD} = \frac{EMD(\mathcal {D}^\mathrm {val},\mathcal {D}^\mathrm {dom})}{EMD(\mathcal {D}^\mathrm {val_0},\mathcal {D}^\mathrm {val_1})} \end{aligned}$$
(14)

and leads to \(Q_{EMD} > 1\) for a detected domain shift. The intuition behind this is that the earth mover’s distance (meaning the distribution of latent space distances) between two in-domain datasets has to be very low, whereas the latent space distances to out-of-domain data should be higher. The results for the Oxford data can be seen in Tables 23, and 4. It can be seen that the NVAE achieves the best results in all experiments. The reconstruction error works with the NVAE in all three experiments, even if the z-score and the NVAE work slightly better in the rain images.

The values in the A2D2 experiment are lower compared to the Oxford values. This was expected, as the domain shift is significantly smaller compared to the Oxford domain shifts. Different architectures and metrics are close to each other in this experiment. The VAE with the z-score has the best result.

In the nuScenes dataset, the ratio is higher than in the A2D2 experiment. For this experiment, different combinations are again nearly the same, but the VAE with the Manhattan distance is the best.

An important aspect of the experiment results is the stability of results, meaning a combination of architecture and metric should be able to correctly detect every kind of domain shift. For most combinations, this is not fulfilled, as can be seen in Tables 2, 3, 4, 5, and 6. One example of this is the VAE with the Bhattacharyya distance: A combination with values over one in four of the five experiments is given (Tables 2, 3, 4, 6), while failing to detect a domain shift for the A2D2 experiment in Table 5.

Table 2 The result of \(Q_{EMD}\) (14) from Oxford alternate route and \(\mathcal {D}^\mathrm {dom}\) from Oxford snow
Table 3 The result of \(Q_{EMD}\) (14) from Oxford alternate route and \(\mathcal {D}^\mathrm {dom}\) from Oxford rain
Table 4 The result of \(Q_{EMD}\) (14) from Oxford alternate route and \(\mathcal {D}^\mathrm {dom}\) from Oxford night
Table 5 The result of \(Q_{EMD}\) (14) from Gaimersheim (holdout) and \(\mathcal {D}^\mathrm {dom}\) from Munich
Table 6 The result of \(Q_{EMD}\) (14) from Boston (holdout) and \(\mathcal {D}^\mathrm {dom}\) from Singapore

5.2 Discussion

Since this chapter aims to provide insights into the applicability of latent space methods for dataset comparison, some central findings are presented here. One goal was to use various experiments to find a combination that finds all domain shifts, if possible, irrespective of the extent of the domain shift. To achieve the goal, different latent space methods were used, as there are critics who say that a nearest neighbor search does not work in high-dimensional space [AHK01]. In the work of Goos et al. [AHK01], it was said that the Euclidean distance does not work in high-dimensional spaces as well, and that the Manhattan distance achieves better results. For the NVAE, the latent space dimension of 4096 is significantly higher than for the other VAE architectures with 100 dimensions. The NVAE has this high number of latent space dimensions because we used the parameters from the NVAE paper [VK21]. The parameters were adopted to achieve the best possible reconstruction without hyperparameter optimization runs. Our results show a similar pattern. The Manhattan distance can work better than the Euclidean distance. However, both metrics have the problem that they do not always give reliable results and have multiple values less than one in the ratio calculation. In addition to investigating whether the nearest neighbor method works, we also wanted to find an architecture and metric that works stable, meaning that they are able to correctly detect all tested domain shifts. Three architectures and two metrics were found that work and are shown graphically in Fig. 5. The three architectures found have in common that they do not have a ratio with a value smaller than one in all experiments. It can be seen that the NVAE with the reconstruction error works best with strong domain shifts, resulting in large visual differences between images such as rain, snow, and night. If the domain shifts consist of different locations, resulting in a variety of only small dataset differences, the z-score with the VAE or JointVAE works better compared to the NVAE with the reconstruction error. Due to the additional categorical latent space of the JointVAE, the discrete latent space is built up better. This leads to the JointVAE functioning more stable with the z-score than the VAE. The VAE is significantly worse with the Oxford dataset and nuScenes. However, in rain and snow, the VAE works not so good, as can be seen in Fig. 5. The ratio 1.01 is so close to the in-domain/out-of-domain boundary that one can say that the VAE should rather not be used for weather domain shifts. The fact that the JointVAE works better than the VAE may not only be due to the latent space but also to the loss, which has more pressure on the latent spaces and is thus better structured for a nearest neighbor search. In Fig. 5, the \(\beta \)-VAE is missing. This is due to the A2D2 dataset, where the domain shift was not detected with each of the metrics. This could maybe be improved with a better choice of \(\beta \). The optimal parametrization of each network structure was beyond the scope of this chapter. As the results show, at least in some experiments, the NVAE outperforms the other approaches. However, due to the more complex architecture, the duration of one training is four times higher than for the other architectures (see Table 7). Also, the inference time, here meaning the time to execute the encoder part, is significantly longer with factors between 4 and 5. This aspect should definitely be considered, since a practical application of such approaches probably would implicate frequent retraining and encoding of large amounts of images.

Fig. 5
figure 5

In this plot, the stability of the three combinations of VAE and metrics can be seen. These three combinations have a value greater than one for each of the five experiments and therefore give stable correct results. The red dashed line indicates the in-domain/out-of-domain boundary

Table 7 The training time of the different VAE architectures with the nuScenes dataset (30013 images). Training was done on an Nvidia A100 with 50 epochs. Additionally, we also report the time needed to transform an image into the latent space

In summary, the JointVAE with the z-score works as a fast alternative for the NVAE with the reconstruction error.

6 Conclusions

This chapter investigated whether the latent space of a VAE can be used to find domain shifts in the (image) context of the automotive industry. One goal was to find a combination of metrics and VAE architectures that work in general, i.e., for different datasets and domain shifts.

The novelty of our chapter is the different types of VAE used to evaluate the shift of image data in different datasets and under different conditions.

To find out which combination is the most stable and best compared to the others, we conducted 120 experiments. In these experiments, four different VAE architectures were trained with three different datasets and five different domain shifts were evaluated. The VAE architectures differ in that three of them have the same layer structure and only the loss changes, and one architecture focuses on reconstructing the images.

Six different metrics were used to detect the domain shifts, including the reconstruction error and four different nearest neighbor searches.

The most stable results were obtained with the JointVAE and the z-score, and with the NVAE and the reconstruction error. With “stable results” we mean that some ratio of earth movers distances between the histograms of the nearest neighbor distances for in-domain and domain shift data is greater than one for all experiments. The VAE with the z-score was also above one in all experiments, but in the Oxford dataset, it was so minimally above one that this combination of metric and VAE cannot be recommended without limitations. In the experiments, the combination JointVAE and z-score performed better for more subtle domain shifts like the change of location, whereas the combination NVAE with reconstruction error performed better for visually present domain shifts like snow or rain.

The results of the experiments show that JointVAE with the latent spatial metric can be an alternative to NVAE with the reconstruction error.

In the current state, it is possible to support the function development with the selection of the best out-of-domain data. This selection can help to select useful data subsets with all relevant data present.

Better results can be expected if the function for which domain shifts are being searched is included in the training. If the information from the function, e.g., inner representations of a neural network is used to build the latent space, a performance drop of this neural network can probably be predicted better. However, this approach would be no longer function-agnostic, since it requires an initially trained function.

As a conclusion, the presented methods can be a useful tool to compare and optimize datasets with only few assumptions and can therefore be one component in a strategy to develop functions with better generalization capabilities. Yet, for a complete approach, it has to be complemented with other methods such as rule-based or function-specific dataset optimizations to mutually compensate each other’s weaknesses.