1 Introduction

As one of the key challenges in autonomous driving [13] and robot perception [4], panoptic segmentation of 3D point clouds aims to unify semantic and instance segmentation, further achieving fine-grained 3D scene perception. Specifically, each 3D point is expected to be classified into background (stuff) or foreground (things) classes with a specific instance ID. Due to the irregular and disordered nature of 3D point clouds, coupled with the effects of occlusion, noise and incomplete scanning, achieving effective panoptic segmentation remains a major challenge.

Recently, Behley et al. [5] have first explored panoptic segmentation on the SemanticKITTI dataset. This pioneering method enriches the dataset with instance-level annotations and leverages both semantic segmentation [6] and object detection [7] techniques for panoptic segmentation. Inspired by this work, several dedicated network architectures have been developed with improved accuracy. These approaches can be divided into proposal-based and proposal-free methods. In particular, proposal-based methods [3, 5, 810] explicitly predict bounding boxes or binary masks to split instances from backgrounds for segmentation. In contrast, proposal-free methods [1115] learn discriminative point-wise embedding and adopt clustering techniques to group individual points. Considering the simplicity in terms of network architecture and low inference cost, proposal-free methods have drawn increasing interest.

Although substantial progress has been achieved in recent years, existing proposal-free methods still have two limitations. First, previous methods commonly use heuristic techniques (e.g., L2 distance) to distinguish different instances without considering intra-instance variance (the measurement of differences between points belonging to the same instance). In practice, intra-instance variance can be larger than inter-instance variance (the measurement of differences between instances), making these methods sensitive to outliers and prone to dividing an object into fragments (Fig. 1(a)). Second, existing methods formulate point-wise embedding learning and instance clustering as two decoupled steps for separate optimization. Consequently, discriminative embedding cannot be well learned, which hinders further performance improvement.

Figure 1
figure 1

Panoptic results on the nuScenes dataset. We set random colors for the points of each predicted instance to distinguish them

To address the above issues, we propose a method for panoptic segmentation of 3D point clouds in outdoor scenes using a Gaussian mixture model (GMM), termed GMM-PanopticSeg, by explicitly modeling the intra-instance variance of an object with a 2D Gaussian distribution. Furthermore, the point cloud can be simplified to a GMM [16] for panoptic segmentation. First, we formulate intra-instance variance in the embedding space as a Gaussian distribution. Specifically, we develop a distribution estimation module to predict the covariance of the Gaussian distribution to capture the intra-instance variance for diverse objects. Second, as the intra-instance variance is modeled as a Gaussian distribution, we further introduce a unified loss function to achieve joint optimization of embedding learning and instance clustering. Our framework is generic and can be seamlessly integrated with existing approaches to enable panoptic segmentation. Moreover, the proposed method can be integrated with existing panoptic segmentation networks to achieve consistent performance improvements. For example, with Panoptic-PolarNet [12] and DS-Net [13] serving as the backbone, our framework achieves an average improvement of 9.6%/5.9% in the \(\mathrm{PQ} ^{\text{Th}}\)/PQ score on the nuScenes dataset.

Overall, the contributions of this paper can be summarized as follows.

1) We propose modeling the intra-instance variance of an object in the embedding space as a 2D Gaussian distribution and employing a Gaussian mixture model to represent the point cloud. To the best of our knowledge, our framework is the first work that explicitly considers intra-instance variance during panoptic segmentation.

2) We introduce a unified loss function to integrate embedding learning and instance clustering for end-to-end joint optimization.

3) Our framework improves the discrimination capability of the embedding and can further boost the performance of previous state-of-the-art approaches on benchmark datasets.

2 Related work

In this section, we first review several point cloud instance segmentation methods. Then, we discuss recent advances in point cloud panoptic segmentation.

2.1 Instance segmentation of 3D point clouds

Existing point cloud instance segmentation techniques can be categorized into boundary-based and grouping-based methods.

Boundary-based methods. This category of methods commonly follows a two-stage pipeline. Specifically, bounding boxes are first predicted as the initial boundaries of instances and then further refined through bounding box regression or binary classification. For example, GSPN [17] uses an analysis-by-synthesis strategy to generate bounding boxes from shape proposals, and subsequently refines these boxes using R-PointNet. 3D-SIS [18] extracts geometry and color features from multi-views to generate bounding boxes and binary masks to segment instances. 3D-BoNet [19] leverages global features to directly regress bounding boxes and then matches proposals to instances using the Hungarian algorithm [20]. GICN [21] follows a bottom-up paradigm to first select center points and then predict corresponding bounding boxes. Although these methods produce promising results, a two-stage pipeline with costly post-processing technique (e.g., non-maximum suppression) introduces considerable overhead.

Grouping-based methods. Unlike boundary-based methods, grouping-based methods directly learn discriminative point-wise embeddings and adopt clustering techniques for instance segmentation. Specifically, SGPN [22] makes the points belonging to the same instance close to each other in the feature embedding space, and leverages a similarity matrix for grouping. Recently, several works [2326] have used the averaged embedding of the points that belong to an instance as the optimization goal of the embedding learning process. Using this method, point-wise embeddings tend to be similar to their corresponding averaged embeddings, but dissimilar to others. MASC [27] iteratively merges neighbor nodes into instance groups and clusters points using learnable multi-scale affinity. DyCo3D [28] and Mask3D [29] predict binary masks to assign instance IDs. Moreover, the other methods [3033] leverage positions of objects as additional information and adopt instance centers as the optimization goal of embedding learning. However, due to the occlusion and noise, predicted centers from incomplete partial point clouds usually deviate from the instance centroids, thereby resulting in limited performance.

2.2 Panoptic segmentation of 3D point clouds

Point cloud panoptic segmentation aims to provide unified semantic segmentation and unique instance segmentation results. Due to the advantages in handling instance ID conflicts, grouping-based methods have attracted increasing interest from researchers for the task of panoptic segmentation. Considering the irregular nature of point clouds, several methods transform them into other representations for panoptic segmentation. Specifically, LPSAD [11] encodes point clouds into a range-view representation to extract point-wise embeddings and extracts and utilizes a learnable radius for clustering. DS-Net [13] uses Cylinder3D [34] as the backbone and clusters different instances via dynamic shifting to address the issues of inconsistent accuracy of predicted centers from different instances. Panoptic-PolarNet [12] first projects point clouds onto the bird’s-eye view (BEV) plane and then predicts a 2D heatmap to conduct clustering. Panoptic-PHNet [14] modifies the BEV encoder and employs voxel features to improve segmentation performance. In addition, a k-nearest neighbor (KNN) transformer is used to predict a pseudo heatmap to avoid inconsistency between the heatmap and the offset branches. Since transformation inevitably introduces information loss, recent works have directly implemented panoptic segmentation on point clouds. PVCL [35] uses contrastive learning to learn stable and discriminative features. GP-S3Net [36] first excessively segments foreground points and then proposes graph convolutional neural networks (GCNNs) to merge fragments from the same instance. PolarStream [37] uses polar coordinate system and leverages wedge-shaped point cloud sectors to improve inference efficiency. Recently, mask-based methods [38, 39] have achieved outstanding performance on leaderboard. They use instance prototypes from learnable parameters matched with point-wise embeddings, and perform instance segmentation by predicting binary masks.

3 The proposed method

In this section, we first introduce the overview of our framework. Then, we present our distribution estimation module, distribution-instance matching strategy, and loss function in detail.

3.1 Overview

Given a point cloud with N points \(\mathcal{P} \in \mathbb{R}^{N \times d_{\text{in}}}\), our method aims to predict a semantic label and an instance ID (0 for stuff classes) for each point. Here, \(d_{\text{in}}\) refers to the input attributes, including 3D coordinates and the intensity of reflection. As illustrated in Fig. 2, our GMM-PanopticSeg method consists of a backbone module, a distribution estimation module (DEM) and a voting-based post-process module. Note that, we follow Ref. [40] to generate instance centers from the heatmap branch (see reference [21] for point-based methods). In addition, the heatmap branch can be replaced by other instance center generation modules [13, 14]. Our framework is generic and can be applied to extend different semantic segmentation networks to the panoptic segmentation task.

Figure 2
figure 2

The overall framework of our GMM-PanopticSeg model. The distribution estimation module (DEM) uses the likelihood probability of each point to model the intra-instance variance. BEV refers to the bird’s eye view

Our GMM-PanopticSeg method is composed of four major steps, as shown in Fig. 2. First, 3D point clouds are fed to the backbone network [34, 41] to predict the point-wise offset to move each point towards its corresponding instance center, resulting in predicted centers \(\mathcal{P}^{\text{ctr}}\). Moreover, point-wise semantic labels and a BEV heatmap are produced. The resultant heatmap is employed to predict \(N^{\text{inst}}\) instance centers \(\mathcal{I}^{\text{ctr}}\) using a window-based search strategy [12, 40]. Second, \(\mathcal{I}^{\text{ctr}}\), \(\mathcal{P}^{\text{ctr}}\) and point-wise features in the backbone are passed to the proposed distribution estimation module to produce a Gaussian distribution per instance and model the whole scene as a Gaussian mixture model [16] (a probabilistic model that assumes that all points are generated from a mixture of Gaussian distributions). Third, we calculate the likelihood probabilities that one point belongs to different Gaussian distributions and assign each point to the instance with the maximum probability. Finally, we merge the instance segmentation result and semantic segmentation result. Specifically, we accumulate the semantic segmentation results of points assigned to the same instance ID and set semantic labels to the highest scoring category.

3.2 Distribution estimation module

Our distribution estimation module called DEM aims to use a 2D Gaussian distribution to model the intra-instance variance of point-wise embeddings on BEV. As Gaussian distributions are characterized by their covariance matrices, we parameterize each covariance matrix using three hyperparameters, which are learned with neural networks. As illustrated in Fig. 3, our distribution estimation module consists of three steps, including hyperparameter prediction, covariance matrix generation and distribution learning.

Figure 3
figure 3

Network architecture of our distribution estimation module. N is the number of input points and \(N^{\text{inst}}\) is the number of predicted instances. σ, k and θ are parameters of covariance matrices C. \(\mathcal{\hat{I}}^{\text{ctr}}\) is the refined instance center. MLP1, MLP2 and MLP3 are multi-layer perceptron networks. Ind info is the information of 2D coordinates. Pos info is the information of 2D position

Hyperparameter prediction. Intuitively, the relationship between instance predicted centers \(\mathcal{I}^{\text{ctr}}\) and point-wise predicted centers \(\mathcal{P}^{\text{ctr}}\) reflects the intra-instance variance and can be used to generate the hyperparameters of Gaussian distributions. For each point \(\mathcal{P}^{\text{ctr}}_{i}\), we first find its nearest instance predicted center \(\mathcal{I}^{\text{ctr}}_{j}\), and concatenate the center feature with it. Next, the 2D indices of \(\mathcal{P}^{\text{ctr}}_{i}\) and \(\mathcal{I}^{\text{ctr}}_{j}\) with \(\mathcal{P}^{\text{ctr}}_{i} - \mathcal{I}^{\text{ctr}}_{j}\) are concatenated with the point feature, which is subsequently fed into a two-layer multilayer perceptron (MLP) (i.e., MLP1 in Fig. 3) for point-wise aggregation. Similarly, we further concatenate the 2D coordinates of \(\mathcal{P}^{\text{ctr}}_{i}\) and \(\mathcal{I}^{\text{ctr}}_{j}\) with their difference from the point feature and pass the concatenation to another MLP (i.e., MLP2). After aggregating the relationships between instance predicted centers and point-wise predicted centers for each point, we gather features for points belonging to the same predicted instance by indices and use a pooling operation to produce an instance representation. Finally, the hyperparameters σ, k, and θ and the refined instance center \(\mathcal{\hat{I}}^{\text{ctr}}\) are regressed for each instance using a three-layer MLP (i.e., MLP3).

Covariance matrix generation. A 2D Gaussian distribution can be characterized using a covariance matrix \(\boldsymbol{C} \in \mathbb{R}^{2 \times 2}\). With the predicted hyperparameters σ, k, and θ in the previous step, where σ is the minor axis variance, k measures the ratio of the major axis and minor axis variance, and θ represents the rotation angle of the distribution. The covariance matrix of the Gaussian distribution for the i-th predicted instance can be obtained as

$$ \boldsymbol{C}_{i} = \begin{bmatrix} \text{cos}(\theta _{i}) & -\text{sin}(\theta _{i}) \\ \text{sin}(\theta _{i}) & \text{cos}(\theta _{i}) \end{bmatrix} \begin{bmatrix} \sigma _{i}^{2} & 0 \\ 0 & k_{i}^{2} \sigma _{i}^{2} \end{bmatrix} \begin{bmatrix} \text{cos}(\theta _{i}) & \text{sin}(\theta _{i}) \\ -\text{sin}(\theta _{i}) & \text{cos}(\theta _{i}) \end{bmatrix} . $$
(1)

Without loss of generality, we keep \(k>1\) so that σ represents the minor axis variance. Here, we use the softplus activate function to keep k and σ positive. Compared with directly calculating the covariance matrix, our method decouples the distribution parameters and reduces the instability during training. Note that, Eq. (1) can also be generalized to higher-dimensional cases. Our ultimate goal is not to perfectly fit the predicted center but to distinguish points of different instances through the predicted distribution. Intuitively, points closer to the center of the distribution are more likely to be the same instance, and the variance represents the degree of confidence in the distribution. The degree of confidence of the distribution in different dimensions is different and correlated. We adopt a formulation of 2D Gaussian distribution in this paper due to its simplicity in the formulation of point clouds and low computational complexity.

Distribution learning. After the above steps, each predicted instance is modeled as a Gaussian distribution. However, how to fit these 2D Gaussian distributions, characterized by learnable hyperparameters, to diverse objects in point clouds still remains a challenge because the ground truth covariance matrix is not available in real-world scenes to provide supervision. To address this issue, we alternatively maximize the probability that each point in the instance belongs to the corresponding Gaussian distribution. Mathematically, the probability for the i-th predicted instance is calculated as follows:

$$ \begin{aligned} P_{i} &= \prod _{j \in S_{i}}P_{ij} \\ &=\prod_{j \in S_{i}} \frac {1}{\sqrt {2 \uppi \vert \boldsymbol{C}_{i} \vert }}\text{e}^{- \frac {1}{2} (\mathcal{\hat{I}}^{\text{ctr}}_{i}-\mathcal{P}^{ \text{ctr}}_{j})^{\text{T}}\boldsymbol{C}^{-1}_{i}(\mathcal{\hat{I}}^{ \text{ctr}}_{i}-\mathcal{P}^{\text{ctr}}_{j})}, \end{aligned} $$
(2)

where \(P_{ij}\) is the probability that the j-th point belongs to the i-th predicted instance. \(\boldsymbol{C}_{i}\) and \(\mathcal{\hat{I}}^{\text{ctr}}_{i}\) are the covariance matrix and the center of the Gaussian distribution for the i-th predicted instance, respectively. \(S_{i}\) is the point set for the i-th instance and \(\mathcal{P}^{\text{ctr}}_{j}\) is the predicted center of the j-th point. We assume that the probability of each instance is independent and the negative log-likelihood (NLL) is used as the loss function to help Gaussian distributions tend to fit the intra-instance variance:

$$ \mathcal{L}_{\text{dl}} = -\sum_{i=0}^{N^{\text{inst}}-1} \text{log}(P_{i}). $$
(3)

Due to the effects of occlusion and noise, the mismatches between points and corresponding instances usually lead to unstable training. To remedy this, points with scores lower than a threshold τ are filtered out in \(\mathcal{L}_{\text{dl}}\): \({P_{ij}} / {\sum_{k} P_{kj}} < \tau \). Specifically, for the j-th point in an instance, the denominator is the probability that a point j belongs to the i-th Gaussian distribution. Moreover, the denominator represents its summed probability with all Gaussian distributions. The smaller the value of \(P_{ij} / \sum_{k} P_{kj}\), the lower the confidence that the j-th point belongs to the i-th Gaussian distribution. Consequently, this point is excluded from Eq. (3) to increase the stability of our Gaussian distributions.

3.3 Distribution-instance matching

With each predicted instance modeled by a Gaussian distribution, the whole point cloud can be considered to be a Gaussian mixture model. In practice, the number of the predicted instances (i.e., Gaussian distributions) may not be consistent with the ground truth number of objects in the scene, and this matching imposes challenges to the optimization. To address this issue, we propose a distribution-instance matching method, which consists of three steps. First, we calculate the probabilities that associate each point with each predicted distribution (i.e., \(P_{ij}\) in Eq. (2) that associates the j-th point with the i-th distribution). To prevent missing instances, we first pre-define Gaussian distributions with the identity covariance matrices as padding distributions. Then, for each point, we calculate its probabilities of belonging to different Gaussian distributions, including both predicted and pre-defined distributions. Second, for the i-th Gaussian distribution, we aggregate the mean probabilities of points belonging to the k-th ground truth instance to calculate the distribution-instance matching probability \(P^{\text{match}}_{ik}\). Third, we use the Hungarian algorithm [20] to obtain the optimal matching with the highest probability. The PyTorch-style distribution-instance matching algorithm is displayed in Alg. 1.

Algorithm 1
figure a

Distribution-Instance Matching

Note that, we also match the pre-defined distribution from the ground truth to avoid the instance missing problem. We set the selected probability \(P^{\text{s}}\) of the distribution from the DEM to 1 and set the selected probability \(P^{\text{s}}\) of the i-th pre-defined distribution to \(0.1 \cdot F_{\mathrm{heat}}^{F^{\text{gt}}_{i}}\), where \(F^{\text{gt}}\) is the ground truth center and \(F_{\mathrm{heat}}\) is the heatmap value. Considering that the ground truth centers are potential predicted centers, we associate the selection probability with \(F_{\mathrm{heat}}\) and encourage proposing ground truth centers when instances are missing.

To achieve trainable distribution-instance matching, we introduce a matching loss to consider matched and missed instances separately:

$$ \mathcal{L}_{\text{prob}} = - \frac{1}{N^{\text{gt}}} \sum _{k} \textstyle\begin{cases} \frac {P^{\text{match}}_{ak}}{\sum_{i=0}^{N^{\text{inst}}-1}P^{\text{match}}_{ik}},& { \mathrm{if~matched},} \\ \frac {P^{\text{match}}_{bk}}{\sum_{i=N^{\text{inst}}}^{N^{\text{inst}}+N^{\text{gt}}-1}P^{\text{match}}_{ik}},& { \mathrm{if~missed.}} \end{cases} $$
(4)

If the k-th ground truth instance matches the a-th Gaussian distribution predicted by our distribution estimation module, the upper term is calculated to maximize the matching probability \({P^{\text{match}}_{ak}}\) among all the predicted Gaussian distributions. If the k-th ground truth instance has no matched predicted distribution, it is associated with pre-defined Gaussian distributions centered at ground truth instance centers to calculate \(P^{\text{match}}_{bk}\). Then, the lower term is calculated and our distribution estimation module is used to predict an additional Gaussian distribution to cover this missed instance.

3.4 Loss function

Previous methods consider point-wise embedding learning and instance clustering as two sequential steps and use two independent losses for separate optimization [13]. To address this issue, we introduce a unified loss function \(\mathcal{L}_{\text{prob}}\) for end-to-end joint optimization of the whole framework. The sum of loss is as follows:

$$ \mathcal{L} = a\mathcal{L}_{\text{cls}}+b\mathcal{L}_{\text{heatmap}}+c \mathcal{L}_{\text{offset}}+d\mathcal{L}_{\text{dl}}+e\mathcal{L}_{ \text{prob}}, $$
(5)

where we empirically set \(a=1\), \(b=100\), \(c=0.01\), \(d=5\), and \(e=10\). Note that, the first three loss terms are already used in existing panoptic segmentation methods (\(\mathcal{L}_{\text{cls}}\) is the loss of the semantic branch, \(\mathcal{L}_{\text{offset}}\) is the absolute loss of the offset branch, \(\mathcal{L}_{\text{heatmap}}\) is the mean square loss of the heatmap branch). \(\mathcal{L}_{\text{dl}}\) and \(\mathcal{L}_{\text{prob}}\) incorporate the learning of embedding and clustering into unified losses, thereby allowing for joint optimization of the whole framework.

4 Experiments

In this section, we first present the experimental setups. Then, we compare our method with previous state-of-the-art methods on two benchmark datasets. Finally, we perform ablation experiments to investigate the effectiveness of our framework.

4.1 Experimental setups

Datasets. In our experiments, we evaluate our method on two widely-used large-scale datasets, namely SemanticKITTI [2] and nuScenes [3].

SemanticKITTI. This dataset is composed of 22 sequences with 43,552 sparse LiDAR scans. Specifically, Sequences 00-07 and 09-10 are used for training (19,130 scans), sequence 08 with 4071 scans is used for validation, and the rest are used for online testing (20,351 scans). For the task of panoptic segmentation, the original annotations are remapped to 19 classes, of which there are 8 thing classes and 11 stuff classes. Each point is labeled with a unique semantic label and instance ID, where ID is set to 0 if it belongs to the stuff classes.

nuScenes. This dataset consists of 1000 sequences. We use 28,130 frames for training, 6019 frames for validation, and 6008 frames for testing. The nuScenes dataset contains 16 classes, 10 of which are things. There are more sparse point clouds and denser objects in nuScenes than in SemanticKITTI, which makes instance segmentation more difficult. Moreover, nuScenes has a more balanced distribution of categories than SemanticKITTI, which facilitates the learning of semantic segmentation.

Metrics. Following Alexander et al. [42], we use the widely used panoptic quality (PQ) as the main metric to evaluate the performance of panoptic segmentation.

$$ \text{PQ} = \underbrace{\frac{\sum_{(p,g)\in \text{TP}}\text{IoU}(p,g)}{ \vert \text{TP} \vert }}_{ \text{Segmentation quality (SQ)}} \underbrace{\frac{ \vert \text{TP} \vert }{ \vert \text{TP} \vert +\frac {1}{2} \vert \text{FP} \vert + \frac {1}{2} \vert \text{FN} \vert }}_{ \text{Recognition quality (RQ)}}, $$
(6)

where PQ can be decomposed into the product of recognition quality (RQ) and segmentation quality (SQ). RQ can be used to measure the recognition quality, and SQ represents the segmentation quality when the object is recognized. Considering that PQ over-penalizes errors for stuff, we also follow Porzi et al. [43] to use \(\mathrm{PQ} ^{\dagger }\) as evaluation metrics. In addition, we use the mean intersection-over-union (mIoU) as the metric for semantic segmentation performance.

Implementation details. Since the latest methods are not open source, we choose two representative methods as our baseline. For a fair comparison with the previous method, we use the same hyperparameter settings as in Ref. [12] for training and inference (all parameters not mentioned are the same). τ is empirically set to 0.01 in our experiments. As Dropblock [44] affects the generation of the heatmap, we do not activate Dropblock when training DEM. We also employ several data augmentation techniques, including instance oversampling, random rotation (\([-\uppi ,\uppi ]\)), random flipping (both the x-axis and the y-axis), and random scaling (\([0.95,1.05]\)). The default Adam optimizer for the backbone with a learning rate of 0.001 and the customized Adam optimizer for the DEM are used to train our method on RTX3090 GPUs.

Incorporate DEM to DS-Net. Unlike Panoptic-Polarnet, DS-Net does not have a heatmap branch, hence we project the sparse 3D feature from the backbone to a dense BEV and use U-Net style upsampling layers as the heatmap branch to predict prototypes. Prototypes and point-wise embedding are fed into the DEM to generate predicted distributions.

4.2 Comparison with the state-of-the-arts

Quantitative evaluation on nuScenes. Table 1 shows the quantitative comparison of different methods on the nuScenes test set. The proposed method significantly improves the performance of DS-Net and Panoptic-PolarNet in terms of almost all the metrics. Since the additional loss function only focuses on things, the segmentation performance of stuff may be slightly degraded. In other words, stuff receives less attention in relative terms. Notably, the most significant improvement lies in the \(\mathrm{PQ} ^{\text{Th}}\) score (an average improvement of 9.6%), indicating that our method can substantially improve the instance segmentation accuracy of the baseline network. Following recent works, we merge instance segmentation and semantic segmentation results via a major-voting strategy. Specifically, points with the same instance ID are updated with the same semantic label such that the semantic segmentation performance is improved. Moreover, Panoptic-PHNet has the best performance because it uses a stronger non-open source backbone and test time augmentation technology on the nuScenes test dataset, which increases the cost of inference. We compare our method with Panoptic-PHNet in ablation studies with the same backbone and post-processing.

Table 1 Quantitative results (%) of different approaches on the nuScenes [3] test dataset. PQ means panoptic quality. RQ denotes recognition quality. SQ represents segmentation quality. mIoU denotes intersection over union. \(^{\text{Th}}\) represents foreground classes. \(^{\text{St}}\) denotes stuff classes. § means our reproduced results. The blue number represents the growth compared to the baseline. The red number denotes the reduction compared to the baseline

We further provide quantitative comparisons on the validation set of the nuScenes dataset in Table 2. It is observed that the combination of the baseline networks with our method significantly improves their overall performance. In particular, by utilizing Eq. (4) to handle mismatches between instances and Gaussian distributions, a much higher accuracy is achieved. This can be reflected by the substantial increase in \(\mathrm{RQ} ^{\text{Th}}\) (an average increase of 8.7%). Additionally, the Gaussian distributions predicted by our DEM enable the baseline network to model intra-instance variance, which improves its robustness to occlusion and noise.

Table 2 Quantitative results (%) of different approaches on the nuScenes [3] validation dataset. PQ means panoptic quality. RQ denotes recognition quality. SQ represents segmentation quality. mIoU denotes intersection over union. \(^{\text{Th}}\) represents foreground classes. \(^{\text{St}}\) denotes stuff classes. § means our reproduced results. The blue number represents the growth compared to the baseline. The red number denotes the reduction compared to the baseline

Quantitative evaluation on SemanticKITTI. We have conducted experiments on the SemanticKITTI dataset, and the quantitative results on the validation set are presented in Table 3. For a fair comparison with corresponding baseline networks, their officially released pre-trained models are used for the initialization of our backbones. It can be observed that our framework significantly improves the performance of Panoptic-PolarNet. Since the SemanticKITTI dataset has a lower instance density and relatively rich point information, the main challenge on this dataset is not instance segmentation but semantic segmentation, as also noted in Panoptic-PolarNet [12]. Therefore, the performance improvements achieved on this dataset are relatively smaller than those achieved on the nuScenes dataset. Nevertheless, our GMM-PanopticSeg still improves the \(\mathrm{PQ} ^{\text{Th}}\) on SemanticKITTI from 65.7% to 68.6%, demonstrating the effectiveness of our framework.

Table 3 Quantitative results (%) of different approaches on the SemanticKITTI [2] validation dataset. PQ means panoptic quality. RQ denotes recognition quality. SQ represents segmentation quality. mIoU denotes intersection over union. \(^{\text{Th}}\) represents foreground classes. \(^{\text{St}}\) denotes stuff classes. The blue number represents the growth compared to the baseline. The red number denotes the reduction compared to the baseline

Qualitative results. We provide qualitative comparisons between the baseline and our GMM-PanopticSeg method on the SemanticKITTI and the nuScenes datasets in Fig. 4 and Fig. 5, respectively. Note that stuff is assigned a unique color according to the semantic label and each thing is assigned a random color according to the instance ID. We use black points to represent points that are mapped to the noise. Two important observations in Fig. 4 and Fig. 5 are noted here. First, large instances are segmented into multiple fragments by the baseline network. Since the point clouds on the surface of these large objects are far from their instance centers, high intra-instance variance limits the accuracy of previous methods. Second, dense objects are prone to being assigned wrong instance IDs. This is because the inter-instance differences for dense objects are relatively small, making previous methods sensitive to occlusion and noise. By explicitly modeling intra-instance variance and conducting embedding learning with instance clustering in an end-to-end framework, our method produces more accurate segmentation results for both large and dense objects.

Figure 4
figure 4

Qualitative results on SemanticKITTI. Note that stuff is assigned a unique color according to the semantic label and each thing is assigned a random color according to the instance ID. We use black points to represent points that are mapped to the noise. The errors made by the baseline method are indicated by the red circle

Figure 5
figure 5

Qualitative results on nuScenes. Note that stuff is assigned a unique color according to the semantic label and each thing is assigned a random color according to the instance ID. We use black points to represent points that are mapped to the noise. The errors made by the baseline method are indicated by the red circle

4.3 Ablation studies

To verify the effectiveness of the proposed components in our framework, we perform ablation studies in this section. Specifically, we start by reproducing our GMM-PanopticSeg method from Panoptic-PolarNet step by step, as shown in Table 4.

Table 4 Ablation studies on the nuScenes validation set. PQ means panoptic quality. mIoU denotes intersection over union. L2 means Euclidean norm

Learnable vs. pre-defined Gaussian distribution. One of the major contributions of our method is to model the intra-instance variance using Gaussian distributions characterized with learnable parameters. A straightforward alternative is to use pre-defined Gaussian distributions. To validate the effectiveness of our approach, we design model 1 and model 2 by introducing pre-defined and learnable Gaussian distributions to the baseline to model intra-instance variance, respectively. Note that models 1 and 2 use the same heuristic technique as the baseline for instance segmentation during inference and Gaussian distributions that are used only for embedding learning during the training phase. It is found that pre-defined Gaussian distributions boost the performance of the baseline, with PQ/mIoU scores improving from 63.2%/67.9% to 64.9%/66.8%. When learnable Gaussian distributions are employed, model 2 achieves further gains (68.5%/69.1%). This demonstrates that learnable Gaussian distributions can better improve the discrimination capability of point embeddings by explicitly modeling the intra-instance variance of an object.

With learnable Gaussian distributions, we can also replace the heuristic technique in the baseline (i.e., L2 distance) with the likelihood probability to make better use of the modeled intra-instance variance. Using the likelihood probability for instance segmentation during inference, model 3 further surpasses model 2 with notable improvements (69.6%/71.3%). Due to occlusion and noise in real-world scenarios, the intra-instance variance may be larger than inter-instance difference such that heuristic techniques produce limited accuracy. By modeling the intra-instance variance with Gaussian distributions, the likelihood probability measure is more robust than the other methods and can better distinguish different instances.

Visualization of learned Gaussian distributions. We further visualize learned Gaussian distributions for different objects in Fig. 6 and two important observations are reported here. First, larger objects (e.g., vehicles) with higher intra-instance variance have higher variance in their predicted Gaussian distributions than smaller objects (e.g., persons and bicycles). Second, the major axis of the predicted Gaussian distribution is usually along the long side of the object (e.g., cars and motorcycles). In summary, our predicted Gaussian distributions can model the intra-instance variance well for diverse instances.

Figure 6
figure 6

Visualization of the predicted Gaussian distributions on the SemanticKITTI dataset

End-to-End vs. decoupled optimization. Another major contribution of our framework is the unified loss function that optimizes embedding learning and instance clustering in an end-to-end manner. To validate its effectiveness, we have trained our method using two different optimization strategies. Specifically, the backbone of model 4 is first optimized and then frozen to train the subsequent modules. In contrast, in model 5, all modules are trained together via end-to-end fusion. Table 5 shows that model 5 outperforms model 4 on most metrics. This demonstrates the superiority of our end-to-end optimization paradigm for panoptic segmentation.

Table 5 Ablation studies on the SemanticKITTI validation set. All values are in [%]. PQ means panoptic quality. mIoU denotes intersection over union. \(^{\text{Th}}\) represents foreground classes. \(^{\text{St}}\) denotes stuff classes

Comparison of clustering methods. We compare our method with Meanshift, PHM [14] and LHM [12] in Table 6. The result of using ground truth instance labels is also provided. The same semantic branch is used for fair comparison. Our method outperforms existing clustering-based methods while being very close to the ground truth. We can also observe that even if the instance labels are replaced by ground truth labels, the PQ does not change much. This also explains why our method does not significantly improve the PQ score on the SemanticKITTI dataset.

Table 6 Results on the SemanticKITTI validation set. PQ means panoptic quality

Computational consumption. In our design, we only feed points predicted as things to the DEM. This design has the same effect as feeding the complete points but helps minimize the computational cost. The inference cost is presented in Table 7. We measure the average inference cost of our method with Panoptic-PolarNet as the backbone on SemanticKITTI. The DEM is tiny, and the additional computational consumption is focused mainly on calculating the probability of each point in each distribution.

Table 7 Computational consumption on SemanticKITTI. Params means parameters. FLOPs denotes floating point operations

5 Conclusion

In this paper, we introduce a Gaussian mixture model for 3D panoptic segmentation and employ learnable Gaussian distributions to capture the intra-instance variance of different objects. In addition, we propose an end-to-end loss function for the joint optimization of embedding learning and instance clustering. Extensive experiments on different benchmark datasets and backbones validate the effectiveness of the proposed method.