Introduction

Humans are able to perceive various levels of detail and abstraction of a scene. We can not only understand different semantic categories such as bus, car, and sky, but we can also distinguish between individual entities (instances) and their components (parts), such as windows or wheels. In computer vision, the estimation of these parallel layers of abstraction has recently been introduced as panoptic-part segmentation [10]. Yet, there exists no completely unified and joint approach for this problem.

According to [6], the two pieces that make up a scene are stuff and things. Things are countable objects such as persons, cars, or buses, whereas stuff, like the sky or road, is usually amorphous and innumerable. Those two categories are identified in the well studied tasks of semantic segmentation and instance segmentation. However, both tasks are incapable of describing the entirety of the scene. To fill this gap, panoptic segmentation [21] was presented, which recognizes and segments both, stuff and things. After this, several approaches for panoptic segmentation have been proposed [5, 20, 27, 42, 46, 57].

Part segmentation, or part parsing, on the other hand, seeks to semantically analyze the image based on part-level. There has been some effort in this area, where part segmentation is often treated as a semantic segmentation problem [12, 18, 19, 25, 35, 38]. A few methods are instance-aware [11, 25, 63] and even fewer handle multi-class part objects [41, 64].

With the release of datasets for panoptic-part segmentation [10, 40], the first methods for this problem have been proposed [17, 28, 29]. In [10], a baseline approach is presented in which two networks for panoptic and part segmentation are used. These two networks are trained independently and the results of both are combined using a uni-directional (top-down) merging strategy. This technique of independent training has significant drawbacks. Due to the use of two different networks, there is a computational overhead. As the authors employ different networks, there will be no consistency in their predictions, making the merging process ineffective. Also, the independent training strategy leads to learning redundancy since they could potentially share semantic information between segmentation heads.

Afterwards, Panoptic-PartFormer (PPF) [28] has been proposed, in which the authors present a unified, combined transformer for things, stuff, and parts that iteratively refines the individual segmentations to achieve consistency. In this design, redundancies are avoided and similarities between tasks are exploited, but we argue that an explicit modeling of multi-task fusion can produce more accurate results.

To this end, and to overcome the limitations of the top-down merging, we have presented a joint panoptic-part fusion (JPPF) for panoptic-part segmentation in [17], in which each sub-task is treated equally to allow for mutual benefits and maximal consistency (c.f Fig. 1). By sharing a backbone for all three tasks, the joint fusion is outperforming the top-down baseline, while being more efficient at the same time.

Fig. 1
figure 1

Our joint panoptic-part fusion (JPPF) combines individual predictions into a consistent panoptic-part segmentation

In this work, we re-present our JPPF [17] and extend the experiments, validation, and discussion. In short,

  • we present a single neural network that uses a shared encoder to perform semantic, instance, and part segmentation and fuses them efficiently to produce panoptic-part segmentation.

  • we propose a parameter-free joint panoptic-part fusion (JPPF) module that dynamically considers the logits from the semantic, instance, and part head and consistently integrates the three predictions.

  • we conduct a thorough analysis of our approach and demonstrate the efficacy, accuracy, and consistency of the joint fusion strategy.

  • we obtain state-of-the-art results for panoptic-part segmentation on various datasets and metrics, surpassing our previous work [17], the top-down baseline [10], and the transformer-based competitor PPF [28].

  • we demonstrate that our approach generalizes to many other datasets without fine-tuning.

Related Work

Towards Panoptic-Part Segmentation

Part-aware panoptic segmentation [10] is a recently introduced problem that brings semantic, instance, and part segmentation together. There have been several methods proposed for these individual tasks, including panoptic segmentation, which is a blend of semantic and instance segmentation.

Semantic segmentation PSPnet [62] introduced the pyramid pooling module, which focuses on the importance of multi-scale features by learning them at many scales, then concatenating and up-sampling them. Chen et al. [2] proposed atrous spatial pyramid pooling (ASPP), which is based on spatial pyramid pooling and combines features from several parallel atrous convolutions with varying dilation rates, as well as global average pooling. The incorporation of multi-scale characteristics and the capturing of global context increases computational complexity. So, Chen et al. [3] introduced the dense prediction cell (DPC) and [54] suggested multi-scale residual units with changing dilation rates to compute high-resolution features at various spatial densities, as well as an efficient atrous spatial pyramid pooling module called eASPP to learn multi-scale representation with fewer parameters and a broader receptive field. In the encoder-decoder architecture, a lot of effort has been advocated for improving the decoder’s upsampling layer. Chen et al. [4] extend DeepLabV3 [2] by adding an efficient decoder module to enhance segmentation results at object boundaries. Later, Tian et al. [53] suggest replacing it with data-dependent up-sampling (DUpsampling), which can recover pixel-wise prediction from low-resolution CNN outputs and take advantage of the redundant label space in semantic segmentation.

Instance segmentation Here, we mainly concentrate on proposal based approaches. Hariharan et al. [13] proposed a simultaneous object recognition and segmentation technique that uses multi-scale combinatorial grouping (MCG) [45] to generate proposals and then run them through a CNN for feature extraction. In addition, Hariharan et al. [14] presented a hyper-column pixel descriptor that captures feature representations of all layers in a CNN with a strong correlation for simultaneous object detection and segmentation. Pinheiro et al. [44] proposed the DeepMask network, which employs a CNN to predict the segmentation mask of each object as well as the likelihood of the object being in the patch. FCIS [30] employs position sensitive inside/outside score maps to simultaneously predict object detection and segmentation. Later, one of the most popular networks for instance segmentation, Mask-RCNN [16], was introduced. It extends Faster-RCNN [48] with an extra network that segments each of the detected objects. RoI-align, which preserves exact spatial position, replaces RoI-pool, which performs coarse spatial quantization for feature encoding.

Part segmentation Dense part-level segmentation, on the other hand, is instance agnostic and is regarded as a semantic segmentation problem [12, 18, 19, 25, 35, 39, 41, 64]. Most of the research has been conducted to perform human part parsing [7, 11, 22, 24, 31, 32, 49, 58, 63], and only little work has addressed multi-part segmentation tasks [41, 64].

Panoptic segmentation The authors of [21] combined the output of two independent networks for semantic and instance segmentation and coined the term panoptic segmentation. Panoptic segmentation approaches can be divided into top-down methods [23, 26, 34, 46, 51, 57] that prioritize semantic segmentation prediction and bottom-up methods [5, 8, 59] that prioritize instance prediction. Our previous in [17] builds on EfficientPS [42] and extends this model to obtain panoptic-part segmentation. This work, builds on our previous design of a joint architecture and exploits its modularity to replace individual components.

Panoptic-Part Segmentation

Datasets and Baselines

In recent years, part-aware panoptic segmentation [10] was introduced, which aims at a unified scene and part-parsing. Also, de Geus et al. [10] introduced a baseline model using a state-of-the-art panoptic segmentation network and a part segmentation network, merging them using heuristics. The panoptic and part segmentation is merged in top-down or bottom-up manner. In the top-down merge, the prediction from panoptic segmentation is re-used for scene-level semantic classes that do not consist of parts. Then for partitionable semantic classes, the corresponding segment of the part prediction is extracted. In case of conflicting predictions, a void label will be assigned. According to [10], top-down merge produces better results than the bottom-up approach. In addition, their paper has released two datasets with panoptic-part annotations: Cityscapes panoptic part (CPP) dataset and Pascal panoptic part (PPP) dataset [40]. Along with the drawbacks of employing independent networks as mentioned in “Introduction”, there are concerns with the usage of top-down merge. Due to inconsistencies, top-down merging may result in undefined regions around the contours of objects. Due to some imbalance between stuff and things, it also has trouble separating them. These issues are highlighted in Fig. 2. Furthermore, the uni-directional merge accounts higher importance to one of the predictions, neglecting the potential of mutual refinement during fusion. With our unified fusion for semantics, instances, and parts, we resolve these issues, giving equal priority to all individual predictions.

Fig. 2
figure 2

A typical issue with the top-down merging approach of [10] are the gaps around the contours of objects due to inconsistencies and difficulties in distinguishing between stuff and things

Unified Models

Panoptic-PartFormer (PPF) [28] was developed in parallel to [17] and follows a similar goal as our line of work: To unify panoptic-part segmentation. However, the authors of [28] approach the unification from the other side. While we suggest a unified fusion module to combine individual results in a well balanced manner, they propose a shared encoder and transformer-based decoder to predict stuff, things, and parts together via a single model. This way, they achieve remarkable consistency and results, however though the prediction of the individual tasks is fully unified in a single architecture that uses task-specific queries, it is followed by the same uni-directional top-down merging as in [10], leading to void labels where inconsistencies remain.

Li et al. [29] propose a second version of their Panoptic-PartFormer (PPF++), in which they also introduce a new metric, called Part-Whole Quality (PWQ). Compared to the PartPQ of [10], PWQ is supposed to resolve the bias towards the PQ metric of panoptic segmentation. In our experiments, we will consider both these metrics for thorough comparisons.

Unified Panoptic-Part Segmentation

The main contribution of our work is the joint panoptic-part fusion (JPPF) that produces highly dense and consistent panoptic-part segmentations in an efficient manner. To obtain individual predictions for our fusion, in theory any method could be applied. However, we argue that a combined network for all three segmentation tasks produces better results through mutual learning and reasoning. Therefore in this section, we first formalize the problem of panoptic-part segmentation, then explain our unified network architecture presented in [17], and lastly describe the inner workings of JPPF.

Fig. 3
figure 3

Our overall architecture for panoptic-part segmentation features a shared encoder, three specialized prediction heads, and the unified joint fusion module. Its modular structure allows to easily replace the feature backbone or use intermediate results from other approaches to perform a consistent fusion

Panoptic-Part Segmentation

The goal of panoptic-part segmentation is to predict a panoptic-part label (sidp) for each pixel of an image I. Here, s represents semantic scene level class, p represents the part-level class and id indicates the instance identifier for each object. It is important to note, that not all pixels in an image may represent all components of panoptic-part segmentation, e.g stuff is not instantiable, and there are many semantic classes for which it does not make sense to further subdivide them into parts, e.g the sky. Anyhow, the three labels can be obtained independently, however a valid panoptic-part label must be consistent, i.e free from contradiction. E.g a car can not share the object identifier of a bicycle or consist of human body parts. Achieving this consistency is the fundamental challenge in panoptic-part segmentation. To obtain this goal, different strategies can be followed, including naive merging [10], joint prediction [28, 29], or—as in our case—fusion [17].

Overall Architecture

To obtain individual predictions for semantics, instances, and parts, our previous work extends EfficientPS [42] by incorporating a part segmentation head. We reuse the backbone, semantic head, and instance head of EfficientPS. As part segmentation can be regarded as a semantic segmentation problem, we are replicating the architecture of the semantic branch of EfficientPS and train it for part-level segmentation. All three resulting heads share a common backbone—in our case EfficientNet [52]—which helps to ensure that the predictions made by the heads are consistent with one another. Sharing a single representation for all three tasks improves efficiency and is beneficial during learning, as shown by our experiments in “A Single Shared Encoder”. An overview of the architecture of our proposed model is shown in Fig. 3.

Part Segmentation Head

According to previous work [10], the grouping of parts yields better results. We have verified this finding for our architecture in [17] and consequently follow the same principle and group semantically identical parts, e.g the windows of cars and buses are grouped into a single window class. The grouping of elements allows the network to learn without ambiguity and provides more data per class for training. Additionally, we represent all non-partitionable semantic classes as a single background class within our part head. This avoids redundant predictions across different heads and further balances the learning of parts versus other classes. Both groupings of classes (semantic grouping of parts, as well as grouping of the background) can later be reverted into class-specific parts by the additional information of the other prediction heads to obtain a fine-grained panoptic-part segmentation (Fig. 4).

Joint Panoptic-Part Fusion

The mutual combination of the predictions for semantic segmentation, instance segmentation, and part segmentation are the core of our work. Inspired by the panoptic fusion module of EfficientPS [42], we propose a module that jointly fuses the individual results of the three heads by giving each prediction equal priority and thoroughly exploiting coherent predictions. Given the definition of panoptic-part segmentation, we identified four possible cases for fusion: Partitionable and non-partitionable stuff, and partitionable and non-partitionable things. In the following, we will first describe the required input for our fusion module and then describe the three combinations which actually occur in the existing datasets (partitionable stuff is not included). However, our approach generalized to the missing case as well. Figure 5 depicts our JPPF module.

Fig. 4
figure 4

Illustration of the pre-processing steps in [42] for predictions from the instance head. The remaining instances serve as input for our fusion

Input and pre-processing The input for our fusion are the individual dense predictions for semantics, instances, and parts. In our complete architecture, these are obtained from the three prediction heads using the shared backbone, but it could be any other source that satisfies the preconditions. More precisely, we require three input components:

  1. 1.

    A map of semantic logits \(S \in \textrm{R}^3\) of shape \(C_{st,th} \times H \times W\) in the interval [0, 1] (e.g via softmax activation), in which H and W are spatial dimensions of the input image (potentially resized) and \(C_{st,th} = C_{st} + C_{th}\) is the total number of semantic classes.

  2. 2.

    A set of instance predictions for the things classes, each consisting of:

    1. (a)

      A softmax-activated map of logits M of shape \(H_I \times W_I\) representing the object mask.

    2. (b)

      An axis-aligned 2D bounding box.

    3. (c)

      A class label c for this object.

    4. (d)

      A confidence score in the interval [0, 1].

  3. 3.

    A map of part logits \(P \in \textrm{R}^3\) of shape \((C_p+1) \times H \times W\) in the interval [0, 1], in which \(C_p\) is the number of (grouped) part classes.

Before actual fusion, the instance objects are pre-processed, following the steps in [42]. This includes confidence thresholding, confidence based sorting, spatial resizing and padding of the instance-specific mask logits and box coordinates to the relevant input size, i.e from \(H_I \times W_I\) to \(H \times W\), and a non-maximum suppression based on the overlap and confidence of boxes. After filtering, there remain \(N\textrm{th}\) instances. The pre-processing is illustrated in Fig. 4.

Fusion for things For the fusion of things, all three input components are considered, even if the specific class is not further partitionable. In this case, the generic background class of the part head, can still support this hypothesis during fusion. The individual instance objects guide our fusion process, however during actual fusion, all three predictions are treated equally.

Given a single one out of the \(N\textrm{th} = N\textrm{th}_{np} + N\textrm{th}_p\) pre-processed things instance of class c with its mask logits MLI, we first use the resized bounding box to mask the corresponding prediction from the semantic head. Precisely, class c is sliced out of the semantic logits S and all values outside of the bounding box are set to zero to obtain the masked semantic logits MLS.

In case class c is partitionable, then the corresponding subset of size \(C_{p,c}\) of the part logits P is selected from the part segmentation head, e.g for an instance of class \(c=person\), the part logits for head, torso, legs, and arms (\(C_{p,human}=4\)) are selected. These logits are again masked by the corresponding bounding box to produce the third masked logits for parts MLP. If class c can not be segmented into parts, the background class from the part logits is selected instead and masked likewise. In order to make the fusion operation feasible, we replicate MLS and MLI to match the number of channels in MLP. E.g, a person instance contains four parts (head, arms, torso, legs), thus MLP is of shape \(4 \times W \times H\). Therefore, MLS and MLI are replicated 4 times to match the shape of MLP. If the instance is not partitionable, MLP consists of the background class only and therefore MLS and MLI are not replicated.

At this point, we have obtained three sets of masked logits. We are now fusing these individual logits to obtain the fused logits for classes with parts FLP and class without parts FLNP.

Fig. 5
figure 5

Illustration of our proposed joint fusion module. For simplicity, we illustrate the process for a single instance object. Semantic, instance, and part predictions are equally balanced and combined

Fusion operation To compute the fused logits for any of the cases, we propose a uniformed fusion operation. This operation computes the sum of the sigmoid of the masked logits and the sum of the masked logits and calculates the Hadamard product of both. The procedure is formalized in Eq. 1:

$$\begin{aligned} FL\left( MLL\right) =\left( \sum _{l \in MLL}\sigma (l)\right) \odot \left( \sum _{l \in MLL} l\right) \end{aligned}$$
(1)

In this equation, \(\sigma (\cdot )\) denotes the sigmoid function, \(\odot\) denotes the Hadamard product, and MLL is a set of equally shaped masked logits which are supposed to be fused, e.g \(MLL = \{MLS, MLI, MLP\}\). This equation describes a generalized version of the fusion proposed by [42] that handles arbitrarily many logits.

Fusion for stuff To generate the fused logits FLS for the stuff classes, each of the \(C_{st}\) channels from the semantic head are selected and fused with the background channel of the part head in the same manner, i.e according to Eq. 1, but this time with only two sets of logits (no instance information). As mentioned, the same concept would also apply for stuff that is partitionable, i.e selecting the corresponding parts, replicating the stuff logits, followed by pair-wise fusion.

Overall fusion All three fused logits, FLP, FLNP, and FLS, are concatenated along the channel dimension to obtain the intermediate logits, in which each of the \(N_{pp}\) channels represents a valid panoptic-part label (see “Panoptic-Part Segmentation”). The total number of \(N_{pp}\) label candidates depends on the number of things \(N\textrm{th}\) predicted by the instance head and the number of parts of their classes \(C_{p,c}\). We produce an intermediate panoptic-part prediction by taking the argmax of these intermediate logits. Precisely, during fusion there will be \(N_{pp} = C_{st} + N\textrm{th}_{np} + \sum _{c \in N\textrm{th}_{p}}{C_{p,c}}\) candidate logits. Finally, we fill an empty canvas with the most probable panoptic-part label for all things and the remaining areas are filled with the prediction for stuff classes extracted from the semantic segmentation head. During fusion, the fused score increases if the predictions of all three heads are consistent, and likewise it is decreased if the predictions do not match with each other.

Post-processing Areas of stuff classes below a minimum threshold \(min_{st}=2048\) pixels are filtered out, as in [42].

Experiments and Results

In this section, we will first introduce the relevant datasets and then provide more details on the implementation and training of our model. Afterwards, we compare our JPPF to previous work and investigate our design choices in ablative experiments.

Datasets For most of our experiments, we use the recently introduced Cityscapes panoptic parts (CPP) and Pascal panoptic parts (PPP) datasets [10, 40]. CPP provides pixel-level annotations for 19 semantic categories, of which 11 are stuff and 8 are things classes. Out of the 8 things, 5 classes include annotations at the part level. There are 2975 images for training and 500 for validation in this finely annotated dataset. PPP consists of 20 things and 80 stuff classes. Part-level annotations are provided for 16 of the 20 things. As in previous work [40], we only consider a subset of 59 object classes for training and evaluation, including 20 things  39 stuff classes, and 58 part classes. These parts are detailed by [41] and [64]. PPP consists of a total of 10,103 images which are divided into 4998 images for training and 5105 for validation. Next to CPP and PPP, we perform some experiments on a variety of other datasets to demonstrate how our method generalizes across domains.

Metrics For the evaluation of individual semantic and part segmentations, we apply the mean Intersection-over-Union (mIoU), and the mean average precision (mAP) for instance segmentations. For the complete evaluation of combined panoptic-part segmentation, we use the Part Panoptic Quality (PartPQ) [10], which is an extension of the Panoptic Quality (PQ) that was proposed by [20]. Because the authors of [29] identified limitations in the expressiveness and interpretability of the PartPQ metric, they have introduced the Part-Whole Quality (PWQ), which we will also consider in our experiments.

Training and implementation details For the Cityscapes data, we use images of the original resolution, i.e \(1024 \times 2048\) pixels, and resize the input images of PPP to \(384\times 512\) pixels for training. We perform data augmentation, scaling and hyperparameter initialization as in EfficientPS [42]. We use a multi-step learning rate (lr) and train our network by stochastic gradient descent (SGD) with a momentum of 0.9. For the CPP and PPP, we use an initial lr of 0.07 and 0.01, respectively. We begin the training with a warm-up phase in which the lr is increased linearly from \(\frac{1}{3} \cdot lr\) up to lr within 200 iterations. The weights of all InPlace-ABN layers [1] are frozen, and we train the model for 10 additional epochs with a fixed learning rate of \(10^{-4}\). Finally, we unfreeze the weights of the InPlace-ABN layers and train the model for 50k iterations beginning with lr of 0.07 (CPP) and 0.01 (PPP), and reduce lr after 32k and 44k iteration by a factor of 10. Four GPUs are used for the training with a batch size of 2 per GPU for CPP and 8 per GPU for PPP. Our feature backbone is the most recent version of EfficientNet–EfficientNet-L2 [56]. This is in contrast to our previous work [17], in which we have used the preliminary EfficientNet-B5 [52]. We initialize the backbone with weights pre-trained on COCO [33]. The impact of this initialization is quantified in Table 1.

Table 1 Comparison of our updated model on CPP with and without pre-trained weights
Table 2 Comparison of EfficientPS trained on panoptic Cityscapes and on Cityscapes panoptic parts (CPP)

Comparison to State-of-the-Art

Our comparison to previous work and state-of-the-art considers the initially introduced baseline in [10] and the more recent unified transformer-based architecture of [28, 29]. The baseline by [10] uses the panoptic labels of the Cityscapes dataset [6] to train a panoptic segmentation network. Since this data is slightly different from the actual panoptic-part dataset (CPP), a direct, fair comparison is not possible. This deviation is indicated in Table 2 that shows the results of EfficientPS [42] trained on Cityscapes vs CPP. To make the baseline comparable in terms of data, we re-implement the baseline and train it on the same data. The re-implementation consists of EfficientPS [42] for panoptic segmentation, and our part segmentation network with a separate backbone (c.f “Part Segmentation Head”). Top-down merging is then used to combine the two independent results into a panoptic-part segmentation. The re-implementation and results are in line with our previous work in [17].

Finally, we compare our previous and updated model with JPPF to the reproduced baseline, the official baselines of [10], and multiple variants of both versions of the Panoptic-PartFormer (PPF) [28, 29]. The official baseline consists of EfficientPS [42] and BSANet [64] with top-down merging. The results of this comparison on CPP and PPP are shown in Table 3 for single-scale and multi-scale inference.

Table 3 Comparison of results for panoptic-part segmentation on Cityscapes and Pascal panoptic parts [40]
Fig. 6
figure 6

Qualitative results of our proposed model on Citscapes panoptic parts (first two rows) and Pascal panoptic parts (last two rows) compared to our reproduced baseline, ground-truth and the reference image. \(*\)Indicates the reproduced baseline which is detailed in “Comparison to State-of-the-Art”. The results for our JPPF are obtained with the backbone of the previous version. The graphic is adopted from [17]. More visual examples with our updated backbone for both datasets are provided in the appendix in Figs. 8 and 9

On CPP with single-scale testing, JPPF improves the accuracy significantly compared to the reproduced baseline. We surpass the reproduced baseline by 3.7 percentage points (pp) in overall PartPQ and by 5.3 pp in \(\text {PartPQ}_\text {P}\) with our updated backbone. Similarly for multi-scale testing, our updated model outperforms the baseline by 3.1 pp and 6.1 pp in PartPQ and \(\text {PartPQ}_\text {P}\), respectively. Our JPFF even outperforms the strong transformer-based competitor PPF and the non-peer reviewed extension PPF++ in terms of PartPQ and PWQ by a small margin. Especially for areas that can be segmented into parts, we achieve more accurate results, indicating the increased consistency after our fusion and leading to a higher PWQ metric. Interestingly, for these two metrics (\(\text {PartPQ}_\text {P}\) and PWQ), even our single-scale results better over the multi-scale results of any competitor.

For PPP, our model outperforms the top-down combination of DeepLabV3+ [4] and Mask RCNN [16] (Baseline-1), even though this baseline was trained with the original Pascal parts and Pascal panoptic segmentation datasets, which provide more annotations. Baseline-2 (top-down merging of DeepLabV3-ResNeSt269 [2, 61], DetectoRS [47], and BSANet [64]) yields even better results because of the more advanced backbones, and hence has a higher representational capacity. Similarly, the more sophisticated transformer of PPF (and PPF++) together with powerful backbone models achieves the best results for PartPQ and PWQ. However, in partitionable areas (\(PartPQ_P\)), we significantly outperform the baselines on PPP. We believe that this advantage can be attributed to the balanced integration of parts in our fusion module. In comparison to top-down merging, our design is also slightly favorable in terms of density, as presented in Table 5.

From Fig. 6, we can see that our proposed fusion is able to segment the parts of very small and distant object classes reliably. Also, our proposed fusion solves some typical problems of top-down merging, which are the bifurcation of things by stuff and the inconsistent parts within things. As illustrated in Fig. 6, our fusion gets rid of unknown regions within objects by giving equal priority to all three individual predictions. In Figs. 8 and 9 we provide more examples and a visual comparison to PPF [28]. There, we also present failure cases of our model to provide insights into its limitations. In some cases, especially on PPP, PPF produces finer details compared to our approach. In cluttered areas where small objects occlude each other, our JPPF seems to perform favorably.

Table 4 Comparison of three independent encoders to our design with a shared feature encoder with different backbones on Cityscapes panoptic parts
Table 5 Comparison between the uni-directional top-down merge [10] and our proposed joint fusion module using various input sources on Cityscapes and Pascal panoptic parts [40]

A Single Shared Encoder

As part of our contribution, we aim to unify semantic, instance, and part segmentation and jointly learn all three in a single, unified model. We are validating that these three tasks benefit from a shared feature representation by comparing the individual predictions before fusion to three separate equivalent networks that have been trained individually with different encoders. As shown in Table 4, both models with a single, shared encoder surpass the individual models in all three tasks. To no surprise, the more recent version of EfficientNet [52] produces already better initial results for our fusion. This experiment clearly indicates that using a shared encoder enables the learning of a common feature representation, resulting in more accurate individual outcomes of each head, which are also more consistent by design due to the shared representation.

Joint Fusion

Next, we compare our joint fusion module to the previously presented top-down merging strategy [10] in Table 5. It is important to note, that even the recently published state-of-the-art method PPF [28, 29] uses this merging strategy. The proposed fusion module surpasses the top-down merge in terms of PartPQ, \(\text {PartPQ}_\text {P}\), \(\text {PartPQ}_\text {NP}\), and PWQ on all datasets and settings. Even though our proposed fusion is admittedly only slightly better in some cases, the joint fusion produces also denser results than the uni-directional merge, indicating the improved consistency. The advantages of our fusion are mainly reflected for the results in areas that are partitionable. Since the things with part labels are limited in CPP, the impact is best observed on the PPP dataset. On this data, our proposed fusion module is significantly better. Specifically for our design, \(\text {PartPQ}_\text {P}\) is improved by 10.5 pp and 14.9 pp for the different backbones. That is a relative improvement of about 28 and 44%.

Table 6 Detailed run-time analysis of our JPPF and the baseline on full resolution images of Cityscapes panoptic parts using a Nvidia A100 GPU
Table 7 Comparison of model complexity for an input of size \(1200 \times 800\) pixels

Density, Run Time, and Model Size

Our JPPF produces results, which are at least as dense as the top-down merging (see Table 5). We further assessed the inference time of our proposed model with JPPF, and the results are displayed in Table 6. It is evident that the top-down merging requires more than twice the time compared to our proposed fusion. To obtain panoptic-part segmentation as proposed by [10], one must first perform a panoptic fusion and then combine it with the part segmentation, which adds an extra overhead. Table 7 shows that our approach, in terms of model size and number of floating point operations (FLOPs), is comparable to previous work for the smaller backbones (ResNet50 [15] and EffificentNet-B5 [52]). Our updated backbone (EfficientNet-L2 [56]) is significantly larger and more complex than the initial version, but adds only little overhead in terms of run time (see Table 6).

Fig. 7
figure 7

Visual results of our JPPF on the Indian Driving Dataset (IDD) [55] without fine-tuning

Generalization

Since our fusion module is free of learned parameters (i.e there are no trainable layers involved), it is independent of its input and supposed to exhibit a good generalization to unseen domains. However, the entire model (including the backbone and individual prediction heads) is restricted by the typical rigidity of deep neural networks and their sensitivity to shifts in the distribution. Yet, our complete model generalizes well to other datasets, as demonstrated in Fig. 7 even though it has only been trained on CPP for this experiment. We show the generalization for a more extensive set of various other datasets without fine-tuning in Figs. 11, 12 and 13, including a typical failure case. A qualitative comparison in terms of generalization between PPF++ [29] and our method is provided in Fig. 14. For all our results, we have resized the input images to fit the size of the original CPP dataset, i.e \(1024 \times 2048\) pixels.

Limitations

During the thorough evaluation of our approach, we have identified some remaining limitations, which we discuss here. Though, our fusion operation treats the logits of the three prediction heads equally, the overall process is mainly guided by the prediction of the instance branch. I.e the detected things and their classes and bounding boxes control the information flow during fusion, mostly. With respect to balance and importance of the individual predictions, this is a limitation. Additionally, as a side effect of this fact, the fusion of things is limited to the area within each bounding box. Thus, for very large objects that are not fully covered by the bounding box, the fusion can not compensate the initially too small estimated area of these objects. Furthermore, we have identified a theoretical limitation in the fusion operation in Eq. 1 itself. Our generalized version is indeed able to handle an arbitrarily sized set of input logits, however there is no explicit mechanism to balance (normalize) the fused output for different numbers of inputs. In practice, highly confident inputs produce similar highly confident outputs when they are consistent, independent of the number of inputs (e.g 2 or 3). For less confident areas, the imbalance between the fused stuff (two input logits) and the fused things (three input logits) might be an issue. Finally, we notice that post-processing step (filtering out small stuff areas) is the remaining factor that hinders fully dense predictions, i.e a valid (not necessarily correct) panoptic-part label for every pixel of the input image.

Conclusion

JPPF is a versatile fusion operation that combines semantic, instance, and part segmentation effectively into a consistent panoptic-part segmentation. It consistently outperforms uni-directional top-down merging for various input sources, e.g our previous and updated model. Our design with the updated backbone and joint fusion module surpasses the baseline on all datasets, achieves state-of-the-art results on Cityscapes panoptic parts, and ranks in between the first and second versions of the Panoptic-PartFormer on Pascal panoptic parts. The advantages of our proposed approach become most visible for partitionable areas. The increased consistency in the prediction of our model is highlighted by its increased density. We leave it for future work to find suitable solutions for the limitations that have been discussed.