1 Introduction

Simulations and generative models, such as Generative Adversarial Networks (GANs), are often used to synthesize realistic training data samples to improve the performance of perception networks (Park et al., 2019; Xu et al., 2021; Löhdefink & Fingscheidt, 2022; Li et al., 2022). Assessing the realism of such synthesized samples is a crucial part of the process. This is usually done by experts, a cumbersome and time consuming approach. Though a lot of work has been conducted to determine the quality of generated images (Goodfellow et al., 2014; Salimans et al., 2016; Theis et al., 2016; Heusel et al., 2017; Lehmann & Romano, 2006), little work is published about how to quantify the realism of point clouds (Shu et al., 2019; Triess et al., 2021b). Visual inspection of such data is expensive and not reliable given that the interpretation of 3D point data is rather unnatural for humans. Because of their subjective nature, it is difficult to compare generative approaches with a qualitative measure. This work closes the gap and introduces a quantitative evaluation for LiDAR point clouds.

In recent years, a large amount of evaluation measures for GANs emerged (Borji, 2019). Many of them are image-specific and cannot be applied to point clouds. Existing work on generating realistic LiDAR point clouds mostly relies on qualitative measures to evaluate the generation quality. Alternatively, some works apply annotation transfer (Sallab et al., 2019) or use the Earth Mover’s Distance as an evaluation criterion (Caccia et al., 2019). However, these methods require either annotations associated with the data or a matching target, i.e.Ground Truth, for the generated sample. Both are often not feasible when working with large-scale data generation or transfer learning setups.

One main application of data generation is to train downstream perception models, i.e.segmentation or detection models that make use of the generated data. Here it is crucial to reduce the domain gap between generated data and target data on which the trained perception model is applied (Triess et al., 2021a). Therefore, the performance of the trained perception model itself can be used as an indication for the realism of the data. However, using this as a proper metric is impractical since it requires to re-train the target network on multiple versions of the data to evaluate their realism. A solution is a metric that can determine the realism of the data already while training the generative model.

To address this need, our previous work (Triess et al., 2021b) proposes a reliable metric that gives a quantitative estimate about the realism of generated LiDAR data. Fig. 1 shows the concept of the metric as a distance measure in high-dimensional feature space. The metric is trained to learn relevant features via a proxy classification task. To avoid learning global scene context, we use hierarchical feature set learning to confine features locally in space. To discourage the network from encoding dataset-specific information, we use an adversarial learning technique which enables robust quantification of unseen data distributions. In this work, we extend our previous approach (Triess et al., 2021b) with evaluations on the influence of data realism on segmentation performance and add additional ablations of the adversarial training. In summary, our contributions are:

  • We present a learning-based quantitative metric to measure the realism of LiDAR point clouds.

  • We use an adversarial learning technique to suppress irrelevant features, such that the metric can be applied to unseen data.

  • In experiments on generated LiDAR data, we analyze the relationship between data realism and downstream perception performance. We show that our metric is a good indicator for the resulting perception performance.

2 Related Work

First, this section discusses GAN evaluation measures and their applicability to generated LiDAR data. Second, we give a brief overview on metric learning.

2.1 GAN Evaluation Measures

A considerable amount of literature deals with how to evaluate generative models and proposes various evaluation measures. The most important ones are summarized in extensive survey papers (Lucic et al., 2018; Xu et al., 2018; Borji, 2019). They can be divided into two major categories: qualitative and quantitative measures.

Fig. 1
figure 1

Proposed approach: The realism measure has a tripartite understanding of the 3D-world (middle). The left and right image show the color-coded metric scores for query points on two example scenes. Both scenes are from the real-world dataset KITTI () and are augmented with dynamic objects from the simulated CARLA dataset (). The left image shows inserted cars from CARLA (left) next to real KITTI cars (right). The right image demonstrates the metric results for a synthetic bicycle-and-person object in a KITTI scene. Additionally, the terrain in the background is distorted with noise, which is detected as

2.1.1 Qualitative Evaluation

Qualitative evaluation (Goodfellow et al., 2014; Huang et al., 2017; Zhang et al., 2017; Srivastava et al., 2017; Lin et al., 2018; Chen et al., 2016; Mathieu et al., 2016) uses visual inspection of a small collection of examples by humans and is therefore of subjective nature. It is a simple way to get an initial impression of the performance of a generative model but cannot be performed in an automated fashion. In other previous work, we use the Mean Opinion Score (MOS) testing to verify the realism of generated LiDAR point clouds (Triess et al., 2019). It was previously introduced in (Ledig et al., 2017a) to provide a qualitative measure for realism in RGB images. In contrast to (Ledig et al., 2017a), where untrained people were used to determine the realism, (Triess et al., 2019) requires LiDAR experts for the testing process to assure a high enough sensor domain familiarity of the test persons. This makes the process even more time-consuming and expensive. Furthermore, the subjective nature of qualitative measures in general makes it difficult to compare performances across different works, even when a large inspection group, such as Mechanical Turk, is used. Therefore, quantitative metrics are crucial.

2.1.2 Quantitative Evaluation

Table 1 GAN evaluation measures: This table categorizes GAN evaluation measures and states their most important pros and cons according to our application

Quantitative evaluation is performed over a large collection of examples, often in an automated fashion. Table 1 categorizes a number of quantitative GAN measures into six categories according to their properties.

Feature-based (Salimans et al., 2016; Gurumurthy et al., 2017; Heusel et al., 2017; Che et al., 2017; Zhou et al., 2018; Shu et al., 2019): Feature-based metrics measure the realism of the data by computing a distance in high-dimensional feature spaces. The Inception Score (IS) (Salimans et al., 2016) and the Fréchet Inception Distance (FID) (Heusel et al., 2017) are the two most popular metrics and extract their features from the ImageNet dataset (Deng et al., 2009). This makes them exclusively applicable to camera image data. The Fréchet Point Cloud Distance (FPD) (Shu et al., 2019) is applicable to single-object point clouds, as it is based on features from the PointNet dataset (Charles et al., 2017). In contrast to our method, these measures require labels on the target domain to train the feature extractor, cannot handle variable sized point clouds, and do not provide local scores. Further, it is only possible to compare a sample to one particular distribution and therefore makes it difficult to obtain a reliable measure on unseen data.

Distribution-based (Goodfellow et al., 2014; Theis et al., 2016; Tolstikhin et al., 2017; Gretton et al., 2012; Achlioptas et al., 2018; Arora et al., 2018; Richardson & Weiss, 2018): Most distribution-based measures are independent of the data modality and thus can be used to evaluate GANs operating on point clouds. They successfully capture the sample diversity and mode collapse of the model, but cannot determine the realism of a single sample. Most approaches are labor intensive as they require manual checkpoint selection and several runs over the test data.

Classification (Arjovsky et al., 2017; Radford et al., 2016; Isola et al., 2017; Santurkar et al., 2018; Lehmann & Romano, 2006; Yang et al., 2017): Another common approach is to use classification networks to assess the quality of GAN outputs. Two-Sample Test (C2ST), for example, assesses whether two samples are drawn from the same distribution. This requires freshly trained discriminators for each test on a held-out subset of the data.

Output comparison (Wang et al., 2016; Xiang & Li, 2017): Among others, computing reconstruction errors is one common method to assess generated data. For point clouds, EMD and Chamfer’s Distance (CD) are often used, as they can operate in a permutation-invariant fashion. These metrics also serve as a basis for some distribution-based measures, such as coverage or Minimum Matching Distance (MMD) (Achlioptas et al., 2018). Caccia et al. (2019) use EMD and CD directly as a measure of reconstruction quality on entire scenes captured with a LiDAR scanner. However, this is only applicable to paired translation GANs or supervised approaches, because it requires a known target to measure the reconstruction error.

Model comparison  (Im et al., 2016; Olsson et al., 2018; Zhang et al., 2018): There exist two types of model comparison techniques. The first includes simple metrics that capture the performance of the discriminator relative to the current state of the generator. The other type focuses on the evaluation of sample diversity and comparison between several GAN architectures. However, these measures are labor intensive and of high complexity as they often require several network combinations and trainings.

Low-level statistics (Khrulkov et al., 2018; Wang et al., 2004): Computing low-level statistics of the underlying data is easy and fast. However, statistics like Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), sharpness, or contrast are specific for RGB images and not capable to capture higher-level information.

This work aims at providing a practical quantitative metric to determine the realism of individual generated samples via learned features. Therefore, we consider our proposed method as a combination of the following categories: feature-based, distribution-based, and output comparison.

2.2 Metric Learning

The goal of deep metric learning is to learn a feature embedding, such that similar data samples are projected close to each other while dissimilar data samples are projected far away from each other in the high-dimensional feature space. Common methods use siamese networks trained with contrastive losses to distinguish between similar and dissimilar pairs of samples (Chicco, 2021). Thereupon, triplet loss architectures train multiple parallel networks with shared weights to achieve the feature embedding (Hoffer & Ailon, 2015; Dong & Shen, 2018). This work uses an adversarial training technique to push features in a similar or dissimilar embedding.

3 Method

3.1 Objective and Properties

The aim of this work is to provide a method to estimate the level of realism for arbitrary LiDAR point clouds. We design the metric to learn relevant realism features directly from distributions of real-world data. The output of the metric can then be interpreted as a distance measure between the input and the learned distribution in a high dimensional space.

Fig. 2
figure 2

Architecture The feature extractor \(F_{\theta _F}\) uses hierarchical feature set learning from PointNet++ (Qi et al., 2017) to encode information about each of the Q query points and their K nearest neighbors. The neighborhood features z are then passed to the classifier \(C_{\theta _C}\) which outputs probability scores \(\textbf{p}_C\) for each category (Real, Syn, Misc). In training, z is fed to the adversaries \(A_{\theta _A}\), which output probability scores \(\textbf{p}_A\) for each dataset of their respective category. For the classifier and all three adversaries a multi-class cross-entropy loss is minimized. For C to perform as good as possible while A should perform as bad as possible, the gradient is inverted between the adversarial input and the feature extractor (Beutel et al., 2017). \(\lambda \) is a factor that regulates the influence of the adversarial loss, weighting the ratio of accuracy versus fairness. In our experiments we use a factor of \(\lambda =0.3\)

Based on the discussed aspects of existing point cloud and GAN measures, we expect a useful LiDAR point cloud metric to be:

Quantitative: The realism score is a quantitative measure that determines the distance of the input sample to the internal representation of the learned realistic distribution. The score \(S^\textit{Real}{}\) has well defined lower and upper bounds that reach from 0 (unrealistic) to 1 (realistic).

Universal: The metric has to be applicable to any LiDAR input and therefore must be independent from any application or task. This means no explicit ground truth information, such as class labels or bounding boxes, is required.

Transferable: The metric must give a reliable and robust prediction for all inputs, independent of whether the data distribution of the input sample is known by the metric or not. This makes the metric transferable to new and unseen data.

Local: The metric should be able to compute spatially local realism scores for smaller regions within a point cloud. These scores can then be combined with additional information, such as motion, semantics, or distance to provide a detailed analysis of the data. The metric is also expected to focus on identifying the realism of the point cloud properties while ignoring global scene properties as much as possible to reduce domain biases.

Flexible: Point clouds are usually sets of un-ordered points with varying size. Therefore, it is crucial to have a processing that is permutation-invariant and independent of the number of points to process.

Simple: Easy applicability and a fast computation time allows the metric to run in parallel to the training of a neural network for LiDAR data generation. This enables monitoring the realism of the generated sample during the training of the network.

We implement our metric in such a way that the described properties are fulfilled. To differentiate the metric from a GAN discriminator, we emphasize that a discriminator is not transferable to unseen data, since it recognizes only one specific data distribution to be realistic.

3.2 Architecture

Figure 2 shows the architecture of our approach. The following describes the components and presents how each part is designed to contribute towards achieving the desired metric properties. The underlying idea of the metric design is to compute a distance measure between different data distributions of realistic and unrealistic LiDAR point cloud compositions. The network learns features indicating realism from data distributions by using a proxy classification task. Specifically, the network is trained to classify point clouds from different datasets into three categories: Real, Syn, Misc. The premise is the possibility to divide the probability space of LiDAR point clouds into those that derive from real-world data (Real), those that derive from simulations (Syn), and all the others (Misc), e.g.distorted or randomized data. Refer to Fig. 1 for an impression. By acquiring the prior information about the tripartite data distribution, the metric does not require any target information or labels for inference.

The features are obtained with hierarchical feature set learning, explained in Sect. 3.2.1. Section 3.2.2 outlines our adversarial learning technique.

3.2.1 Feature Extractor

The blue parts of Fig. 2 visualize the PointNet++ (Qi et al., 2017) concept of the feature extractor \(F_{\theta _F}\). It has two abstraction levels, sampling \(Q_1=2048\) and \(Q_2=256\) query points with \(K_1=20\) and \(K_2=10\) nearest neighbors (KNN), respectively. Keeping the number of neighbors and abstraction levels low limits the network to only encode information about local LiDAR-specific statistics instead of global scenery information. On the other hand, the high amount of query points helps to cover many different regions within the point cloud and guarantees the local aspect of our method. In contrast to PointNet++, we use KNN search instead of radius search to find the neighboring points. PointNet++ was proposed for point clouds from the ShapeNet dataset (Chang et al., 2015), which have uniformly sampled points on object surfaces. In LiDAR point clouds, points are not uniformly distributed and with increasing distance to the sensor, also the distance between neighboring points increase. Therefore, we found KNN search more practical to obtain meaningful neighborhoods in LiDAR scans compared to radius search.

In each abstraction level, we use a 3-layer MLP with filter sizes of \(\left[ 64, 64, 128\right] \) and \(\left[ 128, 128, 256\right] \), respectively. This results in the neighborhood features \(z=F(x,\theta _F)\) of size \(\left[ Q, U_F\right] \) with \(U_F=256\) features for each of the \(Q=256\) query points. The features z are then fed to a densely connected classifier \(C_{\theta _C}\) (yellow block). It consists of a hidden layer with 128 units, to which 50% dropout is applied during training, and the output layer with \(U_C\) units.

The classifier output is a probability vector \(\textbf{p}_{C,q} ={\text {softmax}}(y_C) \in [0,1]^{U_C}\) per query point q. The vector has \(U_C=3\) entries for each of the categories Real, Syn and Misc. The component \(p_{C,q}^\textit{Real}{}\) quantifies the degree of realism in each local region q. The scores \(\textbf{S}=\frac{1}{Q} \sum _q \textbf{p}_{C,q}\) for the entire scene are given by the mean over all query positions. Here, \(S^\textit{Real}{}\) is a measure for the degree of realism of the entire point cloud. A score of 0 indicates low realism while 1 indicates high realism.

3.2.2 Adversarial Training

To obtain a transferable metric network, our metric leverages a concept often used to design fair network architectures or domain losses (Beutel et al., 2017; Raff & Sylvester, 2018). The idea is to force the feature extractor to encode only information into the latent representation z that is relevant for the realism estimation. This means, we actively discourage the feature extractor from encoding information that is specific to the distribution of a single dataset. In other words—using fair networks terminology (Beutel et al., 2017)—we treat the concrete dataset name as a sensitive attribute. With this procedure we can improve the generalization ability towards unknown data.

To achieve this behavior, we add a second output path for adversarial learning that consists of one adversary \(A_{\theta _A}\) for each category (see orange parts in Fig. 2). Each of the adversaries predicts classification probabilities for all the datasets in their respective category. To simplify the following explanation, we assume there is only one adversary. The architecture of the adversary is identical to the one of the classifier, except for the number of units in the output layer \(U_A\), which depends on the number of training datasets for the respective category (\(U_A^\textit{Real}{}=2, U_A^\textit{Syn}{}=2, U_A^\textit{Misc}{}=3\)). Following the designs proposed in (Beutel et al., 2017; Raff & Sylvester, 2018), we train all network components by minimizing the losses for both heads, \(\mathcal {L}_C=\mathcal {L}\left( \textbf{y}_C,\hat{\textbf{y}}_C\right) \) and \(\mathcal {L}_A=\mathcal {L}\left( \textbf{y}_A,\hat{\textbf{y}}_A\right) \), but reversing the gradient in the path between the adversary input and the feature extractor. The goal is for C to predict the category \(\textbf{y}_C\) and for A to predict the dataset \(\textbf{y}_A\) as good as possible, but for F to make it hard for A to predict \(\textbf{y}_A\). Training with the reversed gradient results in F encoding as little information as possible for predicting \(\textbf{y}_A\). The training objective is formulated as

$$\begin{aligned} \begin{aligned}&\min _{\theta _F,\theta _C,\theta _A} \mathcal {L} \Big ( C\big ( F(x;\theta _F);\theta _C \big ), {\hat{y}}_C \Big ) \\&\quad + \mathcal {L} \Big ( A\big ( J_\lambda [ F(x;\theta _F) ];\theta _A \big ), {\hat{y}}_A \Big ) \end{aligned} \end{aligned}$$
(1)

with \(\theta \) being the trainable variables and \(J_\lambda \) a special function

$$\begin{aligned} J_\lambda [F] = F \quad \text {but} \quad \nabla J_\lambda [F] = -\lambda \cdot \nabla F \end{aligned}$$
(2)

such that the forward pass is an identity function while the gradient is inverted in the backward pass while training. The factor \(\lambda \) determines the ratio of accuracy and fairness.

In the applications of the related literature (Beutel et al., 2017; Raff & Sylvester, 2018), the sensitive attribute and the requested attribute are often correlated but have no direct coupling. In our case, this would mean that different data samples from the same dataset could belong to multiple categories. But this is not the case, instead samples from one dataset always belong to the same category. Therefore, our sensitive attribute, the dataset, always directly determines the requested attribute, the category. A single adversary would now suppress all information of the sensitive attribute, thus also suppresses important information to obtain the requested attribute which then leads to unwanted decline in classifier performance. Therefore, a separate adversary for each category is needed, such that only the sensitive information regarding the dataset is suppressed, while keeping the requested information about the category intact. The adversaries \(A:\{A^\textit{Real}{}, A^\textit{Syn}{}, A^\textit{Misc}{}\}\) have the trainable variables \(\theta _A:\{\theta _A^\textit{Real}{}, \theta _A^\textit{Syn}{}, \theta _A^\textit{Misc}{}\}\). Each adversary outputs estimates for only the datasets of their respective category. This forces the feature extractor to encode only common features within one category, while not removing important features from other categories. The loss is now defined as \(\mathcal {L}_A=\mathcal {L}_{A^\textit{Real}{}}+\mathcal {L}_{A^\textit{Syn}{}}+\mathcal {L}_{A^\textit{Misc}{}}\).

4 Experimental Setup

4.1 Datasets

Table 2 Datasets: The table lists the datasets for each category

Table 2 shows the datasets used for this work. We use two different groups of datasets, one that is used to train and evaluate the metric while the other group is only used for evaluation. With the strict separation of training and evaluation datasets, additionally to the training and test splits, we demonstrate that our method is a useful measure on unknown data distributions. In both cases alike, the datasets stem from one of three categories: Real, Syn, Misc.

Within the Real category, publicly available real-world datasets are used for training (KITTI, nuScenes) and evaluation (PandaSet). For Syn, we use the CARLA simulator where we implement the sensor specifications of a Velodyne HDL-64 sensor to create ray-traced range measurements. GeoSet is the second dataset in this category. Here, simple geometric objects, such as spheres and cubes are randomly scattered on a ground plane in three dimensional space and ray-traced in a scan pattern. Additionally, we augment the synthetic data with little noise at training time, such that they are not trivially distinguishable from the other categories. For evaluation, we use the GTAV-LiDAR dataset (Hurl et al., 2019), which contains simulated LiDAR samples from the video game Grand Theft Auto V (GTA V). It has a large detailed world with realistic graphics, which provides a diverse data collection environment.

Finally, we add a third category, Misc, to allow the network to represent meaningless data distributions, as they often occur during GAN trainings or sensor failures. Therefore, Misc contains randomized data that is generated at training time. Misc 1 and Misc 2 are generated by linearly increasing the depth over the rows or columns of a virtual LiDAR scanner, respectively. Misc 3 is a simple Gaussian noise with varying standard deviations. Misc 4 is only used for evaluation and is created by setting patches of varying height and width of the LiDAR depth projection to the same distance. Varying degrees of Gaussian noise are added to the Euclidean distances of Misc \(\{1,2,4\}\).

In addition to the training data listed in the tables, we use 1000 samples from a different split of each dataset to obtain our evaluation results. No annotations or additional information are required to train or apply the metric, all operations are based on the xyz coordinates of the point clouds.

4.2 Up-sampling models

We use the task of up-sampling to demonstrate the application of our metric. Up-sampling is a type of domain adaptation, where the source domain is the low resolution data and the target domain is the high resolution data. In contrast to more complex adaptations, such as simulation-to-real or sensor-to-sensor setups, we can focus on evaluating the actual data realism instead of additional domain gaps introduced by scene content. However, this is still a complex task, since the model must understand the scene in order to synthesize realistic high-resolution LiDAR outputs. This makes it an ideal testing candidate for our realism metric.

In Sect. 6 we compare the realism of generated samples from five different up-sampling methods to the target high-resolution. The generation process is based on cylindrical depth projections of the LiDAR point clouds, as proposed in (Triess et al., 2019). We compare two traditional methods, i.e.nearest neighbor and bilinear interpolation, and three learning-based methods. The generator of all three learning-based methods is adapted from the SRGAN architecture (Ledig et al., 2017a). One version is trained with an \(\mathcal {L}_1\)-loss, another with \(\mathcal {L}_2\)-loss, and the GAN uses an adversarial loss. The GAN discriminator is also adapted from (Ledig et al., 2017a). We conduct the experiments for \(4\times \) up-sampling in the vertical dimension. Implementation and training details can be found in the appendix.

4.3 Baselines

As baselines for our metric, we report the reconstruction errors of the up-sampled data. These errors can serve as an indication of the generation quality, but are usually not suitable as a metric for synthesized data, since they require a target sample. In our case, this target is the original high-resolution sample from which we generate the low-resolution sample as input to the up-sampling network. We compute the Chamfer’s Distance (CD), Mean Absolute Error (MAE), and Mean Squared Error (MSE) between the predicted point cloud \(P^p\) and the target \(P^t\). For CD, the point clouds are considered as un-ordered sets \(P=\{p\}\), such that

$$\begin{aligned} \begin{aligned} d_\textit{CD}(P^p,P^t)&= \frac{1}{|P^p|} \sum \limits _{p^p \in P^p} \min \limits _{p^t \in P^t} \Vert p^p-p^t \Vert _2 \\&\quad + \frac{1}{|P^t|} \sum \limits _{p^t \in P^t} \min \limits _{p^p \in P^p} \Vert p^t-p^p \Vert _2 \end{aligned} \end{aligned}$$
(3)

while for \({\text {MAE}}=\Vert p^t_{ij} - p^p_{ij} \Vert _1\) and \({\text {MSE}}=\Vert p^t_{ij} - p^p_{ij} \Vert _2\), the point clouds are arranged as projected images \(P=\{p_{ij}\}\) with the indices i and j for the respective row and column of the projection. Typical GAN evaluation measures for point cloud generation are Coverage (Tolstikhin et al., 2017) and MMD (Gretton et al., 2012). Both are based on finding the best match between the generated and the target point cloud. We can assume that the best match is always the original high-resolution image of the same scene, then the metrics simplify to \(\text {Cov}\approx 1.0\) and \(\text {MMD}\approx d_\text {CD}\) due to our paired translation. Therefore, we do not report these metrics additionally to the reconstruction errors in the evaluation section.

4.4 Semantic Segmentation

The key application for our metric is to evaluate the generation capabilities of generative models to improve downstream perception. This enables checkpoint selection or early stopping of GAN trainings under the assumption that better data leads to better perception models. We investiagate this in our application experiments. Using the up-sampling models from Sect. 4.2, we transform data from the source (low-resolution) to the target (high-resolution) domain. This step generates pseudo-datasets of different quality for each method. We then use these pseudo-datasets to train semantic segmentation models which are finally evaluated on the target domain. It is expected that if the metric ranks the realism of a generated dataset higher than another one, training with this data also leads to better segmentation performance on the target domain. This is because the data is— per metric— more realistic, i.e.the domain gap is smaller (Triess et al., 2021a).

As a segmentation model, we use SqueezeSegV2 (Wu et al., 2019) and RangeNet21 (Milioto et al., 2019). Instead of the original 19 classes, we combine some of them and only predict 9 classes. Details on the architecture and training can be found in the appendix.

5 Metric Evaluation

5.1 Balance between Accuracy and Fairness

First, the metric has to be calibrated by choosing the correct factor \(\lambda \) of the adversarial loss during training. This is an important property which controls the ratio between accuracy and fairness. A well chosen factor will maximize the difference between a high classifier accuracy and a low adversary accuracy.

Figure 3 shows the classifier accuracy in black and the adversary accuracy in brown (weighted sum over the three category adversaries, shown as dashed lines). With increasing \(\lambda \), the adversarial accuracy decreases slowly, while the classification accuracy suddenly drops. This happens because the classifier gradients are overruled by the reversed gradients of the adversary, hindering it from train properly. Interestingly, the adversarial part of the Real category is significantly more influenced by \(\lambda \) than those of the other two. One reason might be that the Real datasets in themselves are already very diverse, especially compared to the Syn or Misc datasets. The number of different sceneries is higher, but the most variance is caused by more diverse appearance of the same object types (e.g.pedestrians) and the additional sensor noise, which is not present in the Syn datasets. This makes it hard for the model to extract only realism relevant features in form of common information from the Real datasets while not removing any other relevant information. Thus, the model requires more pressure in form of higher \(\lambda \) to accomplish this challenging task for the Real category, compared to Syn and Misc, where it is easier to extract common information while not removing any other relevant information.

We use a factor of \(\lambda =0.3\) for all further experiments in this paper (indicated by the gray vertical line). Here, the classifier has a good performance (93%) while the adversary operates slightly above chance level (50%).

Fig. 3
figure 3

Accuracy versus Fairness: Accuracy of classifier and adversaries over the loss factor \(\lambda \). At small \(\lambda \), the classification accuracy is high which means good performance. However, adversary accuracy is also quite high (at least for Real) which means no fairness in this part. With increasing \(\lambda \) the network gets fairer while maintaining its high level of classification accuracy. At a certain point the network becomes unstable and deteriorates into chance level performance in the classifier

Fig. 4
figure 4

Metric results: Shown is the metric output S for Real, Syn, and Misc on different datasets. The lower part shows the results for the test split of the known datasets, while the upper part depicts one unknown dataset from each category. The color of the dataset name indicates the respective category

Fig. 5
figure 5

Qualitative performance on unknown data: The figure shows the metric results on three unknown datasets. a shows the PandaSet dataset as an example for Real. b shows the GTAV dataset for Syn. The overall high Real scores seem to be caused by regions that contain cars. c shows an example for the Misc 4 dataset

5.2 Overall Dataset Results

We run our metric network on the evaluation datasets, as well as on the test split of the training datasets. Figure 4 shows the mean of the metric scores S for each of the three categories. The known datasets (lower part) clearly achieve well-separated scores and predict their respective category, e.g.CARLA is classified with a high Syn score.

We obtain notable results on the unknown datasets (upper part). Qualitative example frames are depicted in Fig. 5. The Real dataset PandaSet behaves similar to the two known Real datasets, KITTI and nuScenes. This shows that the metric focused to encode realism relevant features from KITTI and nuScenes, such that PandaSet is easily categorized as such as well. The randomly generated Misc 4 dataset is correctly located within the Misc category, however with higher deviations in the scores, leading to Misc scores around 70% and Real scores around 20%. The deviations are caused by the high variance that was used to generate this dataset, where some regions have slightly higher Real or Syn scores.

The Syn dataset GTAV has a slightly different behavior. Here, \(S^\textit{Syn}{}\) is around 60%, while the score for Real is around 35% and the deviation from those mean values is quite large. The reason for these high deviations and therefore lower Syn scores is a systematic behavior of the metric caused by the data distribution. Figure 5b shows that the high Real scores mainly stem from regions containing vehicles. GTAV has more detailed car models than CARLA which therefore appear almost like real vehicles in the point cloud. This example clearly demonstrates the benefit of the locality aspect of our metric which enables such detailed investigations.

Fig. 6
figure 6

Learned feature embedding: Shown are the t-SNE plots for the feature embedding z of four versions of the adversary configuration for the otherwise identical metric network. In a the model is trained without an adversary. b shows the features when a single adversary is used for training. c visualizes the features of our previous method (Triess et al., 2021b) that only used an adversary for the Real category. d depicts our approach, where one adversary per category is trained

5.3 Adversary Ablation

The proposed approach uses the adversarial loss to embed features for Real, Syn, and Misc while at the same time omit dataset-specific information as far as possible. To demonstrate the feature encoding behavior, we train additional metric networks with varying adversary configurations and visualize the learned features on the validation data.

Fig. 7
figure 7

Metric scores for up-sampling methods: The vertical axis lists five methods to perform \(4\times \) LiDAR scan up-sampling and the high-resolution target data (“KITTI”). The left plot shows the reconstructions errors of different baseline measures. The middle plot shows the three parts of our realism measure. The right plot shows the semantic segmentation results on the original KITTI dataset of a segmentation model trained with the data generated from the respective row. The methods are ordered from top to bottom by increasing human visual judgment ratings

Figure 6 shows plots of the t-SNE of the neighborhood features z. t-SNE is a dimensionality reduction method that tries to map data from a high dimension (z vector) to a low dimension (2D image) space while minimising information loss. Close points in the image have similar representations in z. Each metric category is represented by a different color, while the individual datasets are of different shades of this color. The darkest colors belong to the unknown datasets that were never seen by the metric network at training time, i.e.PandaSet, GTAV, Misc 4. We include them for demonstration purposes regarding the transferability to unseen data.

The two extreme cases of the configuration form Fig. 6a and b. Figure 6a represents the metric as a simple classifier without an adversary, where each shade of each color forms their own clusters with little overlap to others. This means the features of each dataset are distinct and make it hard for the metric to estimate a reasonable score for unseen datasets. Figure 6b, on the other hand, uses one common adversary which leads to decreased classifier accuracy since features from all sources are forced into a common representation. This can be observed by the mixed colors with no clusters, not even between categories.

A useful metric requires a mix of the two versions above, where features of one category are similar and features from different categories are dissimilar. Therefore, we propose to use per-category adversaries. In our previous work (Triess et al., 2021b) the adversary was only applied for Real, as depicted in Fig. 6c. In this work we use one adversary for each category, as represented by Fig. 6d. In both cases the green colors of the Real datasets are clearly mixed, while at the same time being sufficiently distinguishable from the blue or gray clusters. However, our per-category approach (Fig. 6d) also shows mixed features among the blue and gray points, whereas our previous approach shows more distinct clusters. This is especially visible for Misc, where Fig. 6c has one cluster for each shade but our method better combines them.

Further, the feature visualization shows that the unknown dataset PandaSet is fully integrated into the Real cluster for our method, as opposed to when using no adversary. The clusters of the unknown GTAV dataset mostly overlap with Syn, but also partially with Real. This aligns with the metric results that we saw previously for GTAV, where parts of the data containing vehicles appear quite realistic.

Fig. 8
figure 8

Qualitative up-sampling and segmentation results: The first row shows the metric results on an up-sampled KITTI scene. The original scan is shown in column “KITTI”. The colors are soft interpolations of Real , Syn , and Misc . The second row shows the color-coded depth projection of the point cloud. The third row shows the relative error between the generated sample and the original high resolution sample from column “KITTI”. A pixel is green  if the error is 0% and pink  if the error is higher than 10%, all values in between are linearly interpolated. The fourth and sixth row show the segmentation results of a model trained on the respective up-sampled data. A legend of the semantic colors is provided in Table 3. The ground truth semantic labels are shown in the leftmost column. For better comparison, the fifth column shows correctly classified pixels in green  and wrong classifications in pink . The visualized sample is from the validation split and was neither used to train the metric, nor the segmentation network

We conduct the adversary ablation only qualitatively, because it is not possible to compare the quantitative scores of the different versions. A metric trained as in Fig. 6b could have a different allocation of scores in range [0, 1] than a metric as in Fig. 6d.

Table 3 Semantic segmentation performance: The table lists the evaluation results of the DarkNet21 model for point-wise semantic segmentation

6 Metric Application

In this section we demonstrate how our realism measure ranks different datasets generated by neural networks. We then compare these results to our baseline evaluation measures (introduced in Sect. 4.3), and analyse the resulting performance of a segmentation network.

Figure 7 is divided into three parts horizontally. The leftmost plot shows the baseline metrics, the middle shows the results of our metric network, the rightmost plot shows the segmentation performance. The vertical axis on the left lists five different versions of KITTI data, generated as explained in Sect. 4.2. For the displayed segmentation results, different versions of the same model were trained with each of the datasets and then evaluated on the original KITTI data.

The realism score for the original KITTI is displayed for reference and has reconstruction errors of zero. The methods are ranked from top to bottom by increasing realism as approximately perceived by humans.Footnote 1 In general, the baseline metrics show a tendency but no clear correlation to the degree of realism and struggle to produce an unambiguous ordering of the methods. Our realism score, on the other hand, sorts the up-sampling methods according to human visual judgment. These results align with the ones in (Triess et al., 2019), which shows that a low reconstruction error does not necessarily imply high realism in the generated outputs. This is the main reason for the emergence of perceptual losses in recent years (Johnson et al., 2016; Ledig et al., 2017b).

The upper row of Fig. 8 shows an example scene for all up-sampling versions with their obtained scores. The \(\mathcal {L}_1\)-CNN produces an almost perfect version of the original high-resolution data, only with some noise at object boundaries. Bilinear interpolation works very well on large surfaces, but produced single noise points especially in regions where the LiDAR usually receives no return, e.g.windows. The \(\mathcal {L}_2\)-CNN can reconstruct the outlines of the scene, but suffers from high noise throughout the entire point cloud. Similarly, the up-sampling GAN suffers from high noise, but often is not able to reconstruct the outlines of the scene and forms random point clusters instead of clear objects. The nearest neighbor interpolation causes vertically stretched objects, which works fine for walls, poles, and other vertical objects, but fails for the ground.

These differences also cause different behavior in downstream perception in the target domain when the generated data is used for training. The rightmost plot in Fig. 7 shows the overall results, while Table 3 shows class-wise results. Additionally, the bottom row of Fig. 8 visualizes segmented example point clouds produced by the models trained with the respective data. Both segmentation models show similar trends for the order of the up-sampling methods as the realism metric. The slightly higher Real score for \(\mathcal {L}_1\)-CNN than for the original KITTI data can also be seen in the segmentation score of the SqueezeSegV2 model, but is neither significant nor does it behave in the same way for DarkNet21. Also the SqueezeSegV2 behavior on the nearest neighbor up-sampling is not equal to those of DarkNet21 and the metric. It can be assumed that two effects lead to this different behavior: First, as mentioned in footnote 1, it is not clear how exactly the nearest neighbor interpolation should be judged in terms of realism. Second, SqueezeSegV2 exhibits almost no variance on its performance scores. The combination of these two effects could cause the difference in behavior, but it is not quite clear how and therefore needs further investigation which is left for future work.

The class-wise results in Table 3 show that \(\mathcal {L}_2\)-CNN and GAN achieve quite good results for dynamic objects. At the same time, it is very hard to tell which of the point clusters in the 3D visualization of Fig. 8 belong to these objects. This raises the question why training with this highly distorted data achieves such good performance in the target domain. The question can be answered by looking at the projected LiDAR scan. Here it becomes visible that even regions that suffer from high noise can still be approximately detected by their edge outlines in the projection. The third row of Fig. 8 shows the point-wise relative error between the generated and the target point cloud with the error being clipped to a maximum of 10%. Even for the appearing noisy \(\mathcal {L}_2\)-CNN version, relative errors are quite low and therefore outlines are clearly visible in the depth projection (second row). We find that this is an indication that the segmentation model is not influenced by local noise perturbations, but rather learns a more generalized appearance of the object shapes.

7 Discussion

Our experiments show a correlation between measured training data realism and final perception performance. Qualitatively however, the segmentation performance seems to be less affected by reduced point cloud realism than expected by judging from the 3D images. We believe that this is caused by the selected architectures of the up-sampling and segmentation models. The segmentation networks operate on the 2D projections of the point clouds which is similar to the projection space used for up-sampling. Even though objects are blurred and unrecognizable when the GAN up-sampling is displayed as raw 3D data, objects shapes are still detectable on the 2D projections. We make two considerations from this observation:

First, visual judgment is highly dependent on the chosen data representation and their visualization. This is an important reason to use such a quantitative metric as ours on a large amount of data. Second, we believe that our metric might be more reliable to estimate the performance of downstream tasks operating on 3D space.

A major concept to keep in mind is the difference between domain gap and realism. If the task to solve is to train a method for KITTI-to-nuScenes adaptation, then both the target and the source domain are Real. Our metric can be used to rule out any unrealistic data compositions that form in the transition between those two datasets, e.g.while training a domain adaptation method. However, if the method is just outputting an identity function, the realism would be at maximum, while the domain gap still causes bad perception performance in the target domain. Therefore, tasks are only in parts dependent on the realism of the data and domain gaps have to be measured differently.

Table 4 Network architecture: Detailed network architecture and input format definition. The ID of each row is used to reference the output of the row. \(\uparrow \) indicates that the layer directly above is an input. N denotes the number of LiDAR measurements. \(Q_j\) are the number of query points at abstraction level j. \(K_j\) are the number of nearest neighbors to search at abstraction level j. U are the number of output units of the classifier and the adversaries
Table 5 SRGAN Generator Architecture: Detailed network architecture and input format definition of the SRGAN generator (Ledig et al., 2017a)
Table 6 SRGAN discriminator architecture: Detailed network architecture and input format definition of the SRGAN discriminator (Ledig et al., 2017a)

8 Conclusion

This paper presented a novel metric to quantify the degree of realism of local regions in LiDAR point clouds. Through adversarial learning, we obtain a feature encoding that is able to adequately capture data realism more generally instead of focusing on dataset-specific characteristic. In extensive experiments, we demonstrated the reliability and applicability of our metric on unseen data. The predictions of our method correlate well with visual judgment, unlike reconstruction errors serving only as a proxy for realism. In addition, we investigated the influence of data realism on a downstream perception task.

Future work includes to design a generative model that uses synthetic data, e.g.CARLA, as input to generate realistic real-world data, e.g.KITTI. Our metric is used in this setup to find the optimal point where the generated data actually improves the downstream perception performance of a segmentation or object detection model.

Table 7 Class mapping: This table shows the detailed class label mapping of the original dataset label ids to our custom mapping used for the segmentation experiments