1 Introduction

Image segmentation has significantly progressed in the deep learning era, especially class-specific semantic segmentation (SS) [1,2,3,4,5,6,7]. The goal of the SS task is to predict the class label of each pixel in an image from a set of predefined object classes. Adjacent pixels naturally belong together to form a segment when they share the same object category. Despite the considerable improvement, most SS settings follow a closed-set assumption that training and test data come from the same set of known object classes [8,9,10,11]. However, this assumption is rarely the case in practice, and it limits the generalization of segmentation models to unknown classes which models do not see during training.

Fig. 1
figure 1

Different tasks of image segmentation. For a given input image a that contains both known (“person”, “dog” and “vegetation”) and unknown objects (“sheep”, “rail” and “grass”), we show: b open-set semantic segmentation (OSS) by pixel identification, c generic segmentation (GS) by pixel clustering, and d generalized open-set semantic segmentation (GOSS)

Open-set semantic segmentation (OSS) [12,13,14,15,16,17] has recently been proposed to relax the above assumption, which aims to segment an image containing both known and unknown object classes. Unlike SS, OSS identifies the unknown region where pixels belong to unknown classes.

Although OSS aims to identify pixels that do not belong to one of the known classes, it does not provide any further processing or analysis amongst those identified unknown pixels. We argue that such a setting of OSS may limit the broad usage of vision-based intelligent agents when they encounter unfamiliar scenes where several unknown object classes are adjacent to each other rather than separated. Consider a scenario where an intelligent agent enters a new scene, as shown in Fig. 1a. OSS leaves the whole unknown region as a large segment without further processing (see “black region” in Fig. 1b). Insufficient information provided by the OSS setting might affect the decision-making of intelligent agents. Hence, we raise fundamental questions: how to improve OSS to generate richer representations for images in real-world scenes? Moreover, what is a more expressive and versatile image segmentation task beyond OSS? These questions are crucial in real-world applications like autonomous driving and robotics.

Towards the goal of better handling unknown regions in an image, inspired by the perception system of humans that they can jointly recognize the previously known objects and easily group unknown areas into different segments even though they do not know the categories of those unknown objects, this paper studies a new type of image segmentation task called Generalized Open-set Semantic Segmentation (GOSS). It aims to classify pixels belonging to known classes and group the unknown pixels (see Fig. 1d for GOSS prediction). As we can see from Fig. 1d, “unknown rail” (or “unknown 1”) and “unknown sheep” (or “unknown2”) are segmented out from “unknown grass” (or “unknown4”). GOSS takes advantage of generic segmentation (GS), which groups pixels into segments sharing similarities [18,19,20,21, 21,22,23]. Compared to OSS, GOSS can detect more “objects” inside unknown regions. We specify two real-world applications where GOSS can help. First, again considering the example in Fig. 1a, self-driving cars may be assisted in avoiding potential obstacles if “unknown sheep” or “unknown rail” inside “unknown grass” can be found in the OSS prediction (see Figure b). Another possible practical example is that new detected “objects” inside unknown regions of images could help accelerate the data annotation process, especially when images from unfamiliar scenes are being labelled.

Table 1 Comparisons of different image segmentation tasks. Compared to traditional segmentation tasks, GOSS takes better care of unknown objects

To enable intelligent agents to perform GOSS, we first build benchmarks using existing segmentation datasets, i.e. COCO-Stuff [11] and Cityscapes [9]. We split the full set of object categories into two sets: known classes and unknown classes. We keep the semantic annotations of known categories. For unknown categories, we use connectivity labelling [23] to convert their original semantic annotations into clustering ground truths. Along with the available datasets, a valid metric is also required to validate the quality of both OSS and GS. Although many existing metrics for segmentation tasks exist, such metrics are limited to measuring a single setting. In this work, we introduce a metric, termed GOSS Quality (\(\textrm{GQ}\)), which evaluates the segmentation quality of both known and unknown objects. Having the datasets and evaluation metrics at hand, we further establish a trainable framework, namely, GOSS SegmenTor (GST). The proposed GST adopts a dual-branch architecture with a shared backbone network. To perform the GOSS task, one branch conducts pixel classification, and the other performs pixel clustering. Moreover, to learn more discriminative embeddings and thus better process unknown objects, we leverage the pixel-wise contrastive learning loss into the training.

In summary, our contributions are as follows: (1) We present a new image segmentation task called GOSS, which jointly classifies known pixels and groups identified unknown pixels from OSS; (2) we propose the GQ metric that measures the quality of both pixel classification and pixel clustering, under open-set settings; (3) according to settings of GOSS, we build benchmarks by customizing existing datasets; (4) we show a simple yet effective baseline and its extended version, GST, to facilitate future research.

2 Related work

Image segmentation is one of the most widely explored tasks in computer vision. Throughout image segmentation research, novel segmentation tasks have been crucial in driving research directions and innovations. We provide comparisons between our new setting and relevant older tasks in Table 1.

Open-set Semantic Segmentation (OSS). OSS, capable of identifying unknown objects, has developed significantly recently. Performing OSS is essential for intelligent agents as they work in open-set settings where many objects are never seen. A natural solution, studied in [12, 24, 25], determines the unknown regions by directly computing the anomaly score from logit or confidence vectors provided by the model classifier. Alternatively, synthesis approaches [13, 26,27,28,29] are proposed to detect unexpected anomalies from reconstructed images. In addition, the work [14] employs metric learning to learn more discriminative features and incrementally label novel classes using a human-in-the-loop approach. Beyond the existing OSS setting, the proposed GOSS performs holistic segmentation via classifying the known objects and clustering the unknown objects, providing more expressive information about the environment than OSS. With richer information, GOSS could benefit practical usage in real-world scenarios.

Generic Segmentation (GS). The task of GS is to find groups of pixels that “go together” [30]. In the early days of computer vision, the term “image segmentation” and the bottom-up general (non-semantic) segmentation share the same meaning. Recently, it is often called “generic segmentation” [21,22,23] to distinguish it from other segmentation tasks. The pipeline of early segmentation methods consists of first extracting local pixel features such as brightness, colour, or texture and then clustering these features based on, e.g. mean shift [19], normalized cuts [18], random walks [31], graph-based representations [20], or oriented watershed transform [21]. Learning-based image segmentation methods have now also become popular. DEL [32] learns a feature embedding corresponding to a similarity measure between two adjacent superpixels. Saacs et al. [22] propose pixel-wise representations that reflect how segments are related. Super-BPD [23] learns super boundary-to-pixel direction to provide a direction similarity between adjacent pixels. Comparing the performance of different image segmentation algorithms, public datasets such as BSDB [21, 33] provide human-labelled class-agnostic ground truth. However, they do not provide any semantic information.

Image Segmentation as a subtask. Image segmentation is often taken as a subtask jointly solved with other vision problems in a single framework [34,35,36,37,38,39,40,41,42]. Panoptic segmentation [43,44,45] has recently become a standard image segmentation task by unifying semantic and instance segmentation.

3 Format and metric

3.1 Task format

Here, the task format for GOSS is formulated at a pixel level. For the ith pixel of an image, the GOSS output is defined as a pair \({\textbf {goss}}_i = (s_i, g_i)\), where the classification label \(s_i\) indicates the pixel’s semantic class and the clustering (or grouping) label \(g_i\) represents the cluster id. Suppose that there are N known semantic classes \({\textbf {L}}^{kn} \in \mathbb {R}^N\) and an unknown class indicator \(L^{uk} \in \mathbb {R}\), we have the semantic label set \({\textbf {L}} = \{{\textbf {L}}^{kn}, L^{uk}\}\) which is encoded by \({\textbf {L}}:= \{ 0,..., N-1, N\}\). Each pixel can be predicted in our formulation as either one of the known or unknown classes. In the first case, each pixel must have a semantic label, while the cluster id is unnecessary. This is due to the fact that once the ith pixel is labelled with \(s_i \in {\textbf {L}}^{kn}\), its corresponding cluster id \(g_i\) is invalid (which is denoted by void). When the pixel is predicted as the unknown class, it can be clustered to \(g_i\). Hence, the ith pixel with known classes (or unknown classes) can be assigned with \({\textbf {goss}}_i=(s_i, void)\) (or \({\textbf {goss}}_i=(N, g_i)\)). In practice, a classification model can predict \(s_i\), and \(g_i\) can be determined after the unknown pixels are clustered.

3.2 Evaluation metrics

Appropriate evaluation metrics are fundamental in driving the popularization of a new image segmentation task [43, 46, 47]. In this subsection, we briefly review some popular existing metrics for relevant segmentation tasks and then introduce a metric tailored for the proposed GOSS.

Fig. 2
figure 2

Toy model of ground truth and predicted GOSS of an image. The predicted segments for “unknown” are partitioned into true positives \(\textrm{TP}^{uk}\), false positives \(\textrm{FP}^{uk}\), and false negatives \(\textrm{FN}^{uk}\)

Previous Metrics. Standard metrics for OSS include the false positive rate at 95% true positive rate (FPR at 95% TPR), the area under the receiver operating characteristics (AUROC) [48, 49], and the area under the precision recall (AUPR) [50, 51]. Such metrics assess performance based on the overlap of anomaly score distributions between the known and unknown classes. However, they are not suited for evaluating GOSS since they do not need to classify the input as a known or unknown class, as GOSS requires each input pixel to be explicitly classified as belonging to a known or unknown class. Instead, GOSS requires each input pixel to be explicitly classified as belonging to a known or unknown class. Well-known metrics for GS include the variation of information [52], probabilistic rand index [53], F-measure [54], and segmentation covering [21]. These metrics are initially proposed to evaluate data clustering or edge detection quality. As no multi-class semantic labels are considered, they cannot be directly used to measure the performance of joint GS and OSS.

GOSS Quality We borrow the idea of the segment matching from the panoptic quality (\(\textrm{PQ}\)) [43] in panoptic segmentation (PS) and adapt the panoptic quality as GOSS quality to be suitable for evaluating GOSS. As shown in Fig. 1d, the GOSS output can be viewed as a set of predicted segments, which is similar to the panoptic output of PS. The primary distinction between GOSS and panoptic predictions lies in the capability of GOSS to predict unknown segments.

Fig. 3
figure 3

The framework of the baseline and GOSS Segmentor. a Baseline. The input image is fed into the encoder for feature extraction. The dual-branch heads are jointly trained for pixel classification and clustering. Furthermore, pixel-wise contrastive learning is leveraged to learn discriminative feature embeddings. The pixel identification module is designed to recognize sets of pixels of the unknown class from the semantic prediction. The final GOSS output is generated by fusing the identified semantic and grouping predictions. b GOSS Segmentor (GST). Confidence adjustment and pixel contrastive learning modules are included

We treat the unknown pixels as a new class in addition to N known classes. Thus, there is a total of \({N+1}\) classes of segmentation. Utilizing segment matching, a predicted segment from GOSS is matched with corresponding ground truth segment when their Intersection over Union (\(\textrm{IoU}\)) is higher than 0.5. As illustrated in Fig. 2, this approach enables the identification of true positives (TP), false positives (FP), and false negatives (FN) for the predicted segments generated by GOSS. We let \(\textrm{GQ}^{kn}\) be the average GOSS quality over N known classes. Accordingly, \(\textrm{GQ}^{uk}\) is the GOSS quality of the unknown class:

$$\begin{aligned} \text {GQ}^{kn}&= \frac{1}{N} \sum _{j \in {\textbf {L}}^{kn}} \frac{\sum _{(u, \hat{u}) \,\in \, \text {TP}_{j}^{kn}} \text {IoU}(u, \hat{u})}{\text {TP}_{j}^{kn} + \frac{1}{2}\text {FP}_{j}^{kn} + \frac{1}{2}\text {FN}_{j}^{kn}} \end{aligned}$$
(1)
$$\begin{aligned} \text {GQ}^{uk}&= \frac{\sum _{(u, \hat{u}) \in \text {TP}^{uk}} \text {IoU}(u, \hat{u})}{\text {TP}^{uk} + \frac{1}{2}\text {FP}^{uk} + \frac{1}{2}\text {FN}^{uk}} \end{aligned}$$
(2)

where \(\textrm{IoU}(u, \hat{u})\) calculates the Intersection over Union value for the predicted segment u and the ground-truth segment \(\hat{u}\). Furthermore, \(\textrm{TP}_{j}^{kn}\), \(\textrm{FP}_{j}^{kn}\), and \(\textrm{FN}_{j}^{kn}\) denote true positives, false positives, and false negatives for the jth known class, respectively. Similarly, \(\textrm{GQ}^{uk}\) is obtained specially for the unknown class with its true positives \(\textrm{TP}^{uk}\), false positives \(\textrm{FP}^{uk}\), and false negatives \(\textrm{FN}^{uk}\).

The metrics \(\textrm{GQ}^{kn}\) and \(\textrm{GQ}^{uk}\) are computed based on the GOSS output (see Fig. 1d). The known and unknown segments on the GOSS prediction are evaluated separately via \(\textrm{GQ}^{kn}\) and \(\textrm{GQ}^{uk}\). However, a unified metric is required to simplify the evaluation. Thus, we define a metric GOSS Quality (\(\textrm{GQ}\)) as:

$$\begin{aligned} \textrm{GQ}= \lambda \cdot \textrm{GQ}^{kn} + (1-\lambda ) \cdot \textrm{GQ}^{uk} \end{aligned}$$
(3)

where we set \(\lambda \) as the most natural number, 0.5, throughout the paper. If we simply average \(\textrm{GQ}\) over \(N+1\) classes, then the ratio between known and unknown would be significantly biased (N : 1). In Eq. (3), \(\textrm{GQ}\) takes care of the known and unknown segments equally. We also introduce \(\textrm{GQ}^{clu}\), which only assesses the pixel clustering quality regardless of pixel class (see Fig. 1b). Refer to the supplementary material for more details of \(\textrm{GQ}^{clu}\).

3.3 Challenges

The endeavour of aggregating pixels in an image into clusters, as necessitated by GOSS, poses greater challenges compared to the conventional OSS task. This increased complexity stems from the clustering of objects belonging to unknown classes, which substantially heightens the difficulty of the task.

4 Methodology

In order to effectively perform GOSS, we propose a baseline framework (see Fig. 3a). The baseline is mainly comprised of five components: the shared encoder, the pixel classification branch, the pixel clustering branch, the identification module, and the fusion module. Then, we extend the baseline into a more advanced one, GOSS Segmentor (GST), as shown in Fig. 3b. More details of our design will be described next.

4.1 Baseline

GOSS can be modelled as a unified segmentation task incorporating pixel classification and clustering in an open-setting scenario. Given an image \(I_o \in \mathbb {R}^{3 \times \hbar _o \times \omega _o}\), we expect the proposed baseline to generate semantic and grouping predictions simultaneously. Hence, we adopt a dual-branch architecture, with one branch for pixel classification and another for pixel clustering. As shown in Fig. 3a, two branches share the same encoder as a feature extractor. The branch for pixel classification computes a prediction map \({\textbf {S}} \in \mathbb {R}^{\hbar _o \times \omega _o}\), while the pixel clustering branch outputs a mask map \({\textbf {G}} \in \mathbb {R}^{\hbar _o \times \omega _o}\) which includes the grouped class-agnostic segments. The unknown regions in \({\textbf {S}}\) are identified, denoted by \({\textbf {S}}^{ide}\), which is further fused with \({\textbf {G}}\), obtaining the final GOSS output \({\textbf {G}}_{oss} \in \mathbb {R}^{2 \times \hbar _o \times \omega _o}\).

The baseline model is jointly trained with two losses: the classification loss \(\ell _{cla}\) and the clustering loss \(\ell _{clu}\). The total loss is \(\ell _{ws} = \alpha _{cla}\ell _{cla} + \alpha _{clu}\ell _{clu}\) where \(\alpha _{cla}\) and \(\alpha _{clu}\) are positive adjustment weights.

4.1.1 Pixel classification

We train the branch for pixel classification to classify each pixel as one of N classes where N is the number of predefined known classes. DeepLabV3+ [4], an existing powerful baseline for semantic segmentation, is chosen as the basic architecture for this branch. The branch is updated under \(\ell _{cla}\) which is the cross-entropy loss between the predicted semantic map \({\textbf {S}}\) and its ground-truth map.

DeepLabV3+ leverages an encoder–decoder architecture that takes a bottom-up pathway network with features at multiple spatial resolutions and appends a top-down pathway with lateral connections. The top-down pathway progressively upsamples features starting from the deepest layer of the network while concatenating or adding them with higher-resolution features from the bottom-up pathway. The Atrous Spatial Pyramid Pooling (ASPP) layer [3] is employed in the DeepLabV3+ model to enlarge the receptive field.

4.1.2 Pixel identification

Pixel identification from OSS is executed to identify sets of pixels of unknown classes from the semantic prediction. Hereafter, we study several recipes for pixel identification with which the identified semantic prediction \({\textbf {S}}^{ide} \in \mathbb {R}^{\hbar _o \times \omega _o}\) is computed after processing \({\textbf {S}}\). Common metrics like AUROC and AUPR of OSS assess the distribution overlap between known and unknown classes. Still, pixel identification is required to state if the input pixel is known or not clearly. In other words, binary classification is necessary.

N-model In the pixel classification branch it is natural to design the semantic segmentation model with an N-dimensional confidence output \({\textbf {C}} \in \mathbb {R}^{N \times \hbar _o \times \omega _o}\). The N-model is restricted to recognizing the set of predetermined known classes. When an unknown region comes up in a test image, it would be erroneously classified as one of the known classes. To identify unknown pixels based on outputs from the N-model, we employ the comparable OSS method, Maximum Softmax Probability (MSP) [55] or Maximum Unnormalized Logit (MaxLogit) [12]. Thresholds are used to classify pixels as belonging to a known or unknown class. More details can be found in the supplementary material.

N+1-model As opposed to the N-model, the N+1-model [56, 57] contains the unknown class in the output \({\textbf {C}} \in \mathbb {R}^{(N +1) \times \hbar _o \times \omega _o}\) such that the N+1-model can directly identify the unknown pixels. During the training stage, the N+1-model explicitly takes the “unlabelled” pixels (i.e. the “void” pixels) as the unknown pixels. N+1-model is not valid if no “void” pixels are provided.

4.1.3 Pixel clustering

The pixel clustering branch is built in parallel with the pixel classification branch. The goal of this branch is to partition the whole image into clusters. During training, to generate the corresponding annotations, we convert the semantic labelling of the known classes into connectivity labelling by ignoring the previous semantics of each segment. The top-performing method, super boundary-to-pixel direction (Super-BPD) [23] is selected to establish the branch. The branch is trained in a supervised manner as well. Super-BPD applies the ground-truth annotation generated by the distance transform algorithm. Using the Super-BPD model, the representation of boundary-to-pixel direction (BPD) for each pixel is learned (\(\ell _{clu} = \ell _{bpd}\)). Super-BPDs are extracted based on the initial BPDs using the component-tree computation, followed by graph partitioning to merge super-BPDs into new segments.

4.1.4 Fusion module

The identification module outputs the identified semantic prediction \({\textbf {S}}^{ide}\). Based on \({\textbf {S}}^{ide}\), the grouping output \({\textbf {G}}\) becomes \({\textbf {G}}^{uk} \in \mathbb {R}^{\hbar _o \times \omega _o}\) where the element \(g_i \rightarrow void\) if the corresponding semantic prediction \(s_i \in [0,...,N-1]\). Afterward, \({\textbf {S}}^{ide}\) is merged with \({\textbf {G}}^{uk}\) to form the GOSS output \({\textbf {G}}_{oss} = [{\textbf {goss}}_1, {\textbf {goss}}_2,..., {\textbf {goss}}_{\hbar _o \omega _o}]\), where \({\textbf {goss}}_i = (s_i, g_i) \in [0,..., N] \times [1,..., g_{max}]\cup {void}\). For ith pixel in \({\textbf {G}}_{oss}\), \({\textbf {goss}}_i = (s_i, void)\) if \(s_i\in [0, 1,..., N-1]\) or \({\textbf {goss}}_i = (N, g_i)\) if \(s_i=N\). The prediction \({\textbf {G}}_{oss}\) can be viewed as a map that is composed of a set of several segments (see “GOSS prediction” in Fig. 3).

4.2 GOSS Segmentor

As shown in Fig. 3b, the baseline model is extended to a new model that we call GOSS Segmentor (GST). Keeping the original five baseline components, we propose to equip the baseline with a confidence adjustment module and a contrastive learning module.

4.2.1 Confidence adjustment

For the N+1 model, it is hard to accommodate unknown classes of objects since the model is trained without seeing any examples from these classes. Instead of completely trusting the confidence prediction \({\textbf {C}}\), specific to the pixel identification of the N+1-model, we propose to modify \({\textbf {C}}\) using a confidence adjustment. Particularly, for the ith pixel, its confidence score after softmax, \({\textbf {c}}_i = [{\textbf {c}}_i^{kn}, c_i^{uk}] \in \mathbb {R}^{N+1}\), is re-scaled as \([{\textbf {c}}_i^{kn}, \beta ^{uk} c_i^{uk}]\) where \(\beta ^{uk} \in (1, +\infty )\) is the scale coefficient of the confidence of the unknown class.

4.2.2 Pixel contrastive learning

In order to learn more discriminative representations for a better GOSS performance, inspired by [58], we adopt a pixel-wise contrastive learning algorithm where we contrast embeddings with different semantic labels. We have ith pixel embedding \({\textbf {e}}_i \in \mathbb {R}^{cn}\) in the feature map \({\textbf {E}} \in \mathbb {R}^{cn \times \hbar _o \times \omega _o}\) where cn, \(\hbar _o\), and \(\omega _o\) are the channel number, height, and width of the feature map. For \({\textbf {e}}_i\), the positive pixel embeddings \({\textbf {e}}_i^{+}\) have the same ground truth label to \({\textbf {e}}_i\) in the same feature map, while the negative pixel embeddings \({\textbf {e}}_{i}^{-}\) have different ground truth from \({\textbf {e}}_i\). The pixel to pixel contrastive loss [58, 59] is then defined as:

$$\begin{aligned} \ell _{pc, i}= & {} \frac{1}{|N_{pst, i}|} \ \ \mathop \sum \limits _{i^+ \in N_{pst, i}}\nonumber \\{} & {} -\log \frac{\exp ({{\textbf {e}}_i \cdot {\textbf {e}}_{i}^{+}}/\tau )}{\exp ({{\textbf {e}}_i \cdot {\textbf {e}}_{i}^{+}}/\tau )+ \mathop \sum \nolimits _{i^{-} \in N_{neg, i}} \exp ({{\textbf {e}}_i \cdot {\textbf {e}}_{i}^{-}}/\tau )}\nonumber \\ \end{aligned}$$
(4)

where \(N_{pst, i}\) and \(N_{neg, i}\) are positive and negative embedding sets for pixel embedding \({\textbf {e}}_i\). \(\tau \) is the temperature parameter. We employ the semi-hard example sampling strategy from [58] to construct the positive and negative sample sets. Before the \(\ell _{pc}\) is calculated, we downscale the ground-truth map to make it have the same size as the feature map \({\textbf {E}}\).

The total loss \(\ell _{ws}\) of GST is obtained by merging the pixel contrastive loss \(\ell _{pc}\) with the classification loss \(\ell ^{cla}\) and clustering loss \(\ell ^{clu}\) as follows: \(\ell _{ws} = \alpha _{cla}\ell _{cla} + \alpha _{clu}\ell _{clu} + \alpha _{pc}\ell _{pc}\), where \(\alpha _{pc}\) is a positive adjustment weight for \(\ell _{pc}\). Contrastive learning aims to make the representations of pixels in the latent space closer when they belong to the same class and farther apart when they belong to different classes. This metric learning technique has been widely used in segmentation tasks, with many existing works reporting better empirical results [60,61,62,63]. In GOSS, we apply contrastive learning to train our model, with the aim of generating more representative embeddings for open-set evaluation.

5 Benchmark

Most datasets for OSS, like StreetHazards [12] and Road Anomaly [13], present separate unknown objects in an image. Ensuring that objects of unknown classes naturally appear together (are adjacent) in the image, in this work, we split the full set of labelling categories into known and unknown classes using a proper ratio. We simulate the training and testing of GOSS using existing semantic segmentation datasets, i.e. COCO-Stuff [11], and Cityscapes [9]. Note that grouping labels of unknown segments are derived from their initial ground-truth semantic labels before the split. Following [23], we convert the original semantic labelling of unknown areas to GS ground truths using connectivity labelling.

5.1 COCO-Stuff-GOSS

COCO-Stuff [11] augments the popular COCO [64] dataset with stuff classes as well as dense-pixel annotations. It has a large-scale semantic multi-class setting containing both the “things” and “stuff” classes. On COCO-Stuff, around 94\(\%\) of the pixels are labelled with one semantic category, and the remaining are “unlabelled” pixels. We customize COCO-Stuff, creating a new benchmark named COCO-Stuff-GOSS. We strictly divide existing specific classes of COCO-Stuff into known and unknown classes. Training and testing images are selected from “train2017” and “val2017”. Those categories which have been defined as unknown categories will not be represented in the training examples. Every selected testing example is composed of objects from the set of known categories and the set of unknown categories (or from only unknown categories). The statistics of the benchmark on different splits are shown in Table 2.

VOC Split The “VOC Split” is a common category split [65,66,67,68] that provides 20 “thing” classes defined in PASCAL VOC [47] as “known thing” classes. The remaining 60 “thing” classes are chosen as “unknown thing” classes.

Manual Split We divide COCO-Stuff categories according to how frequently each specific class appears. We count the number of occurrences of each class and calculate its ratio over the number of all training images. For example, in the “Manual-20/60” split, following that at least one and at most two classes are chosen from each sub-class, we select 20 of the most popular “thing” classes and treat the remaining “thing” classes as unknown. Besides, all “stuff” classes are set as known classes.

Random Split We also conduct experiments with a “Random Split", where all classes are randomly re-defined into known and unknown classes regardless of their super-class and sub-class. The data split of VOC-20/60 and Manual-20/60 does not include “stuff” categories as unknown classes, but Random-111/60 ensures that the known (or unknown) class includes specific classes from both the ‘thing’ and ‘stuff’ super-class. More details can be found in Table 2.

5.2 Cityscapes-GOSS

The Cityscapes [9] dataset consists of 5000 images (2975 train, 500 val, 1525 test) covering urban street scenes in driving scenarios. Dense pixel annotations of 19 classes are provided, that is, 8 “thing” and 11 “stuff” classes. As one goal of the proposed GOSS is to advance self-driving systems, we construct the Cityscapes-GOSS Benchmark. We divide the categories under the “manual split”. As opposed to the COCO-Stuff-GOSS Benchmark, all images, regardless of containing unknown categories or not, are kept. We consider pixels from unknown classes as “void” pixels. Table 2 presents more details.

Manual Split We present two versions of the Cityscapes-GOSS Benchmark. Following the split in [14], we build the first version, “Manual-16/3”, which includes “car”, “truck”, and “bus” as the “unknown thing”. Based on the first version, we additionally make “building”, “traffic sign”, and “vegetation” as “unknown stuff” to produce a more challenging version, “Manual-13/6”.

Table 2 Details of different splits of the COCO-Stuff-GOSS/Cityscapes-GOSS benchmarks. The numbers in the table indicate for each data split, how many known (or unknown) classes are selected and how many training (or testing) images are kept

6 Experiment

Experimental results are presented in this section to demonstrate the rationality and effectiveness of GOSS. Using the baseline and proposed GST, we perform our task on COCO-Stuff-GOSS and Cityscapes-GOSS. The performance is mainly measured via the metric \(\textrm{GQ}\).

6.1 Implementation

For all models, ResNet-50 [69] pre-trained on ImageNet [70] is utilized as the encoder backbone. All models are trained for 60K/40K iterations with a batch size of 10/2 on COCO-Stuff-GOSS/Cityscapes-GOSS. The “poly” learning rate policy [71] is applied with the initial learning rate being set to \(5\textrm{e}{-5}\). GST models are updated using Adam optimization [72] without weight decay. The weights \(\alpha _{cla}\), \(\alpha _{clu}\), and \(\alpha _{pc}\) are 1.0, \(1\textrm{e}{-4}\), and \(1\textrm{e}{-1}\). The thresholds in the identification module and the scale \(\beta ^{uk}\) (for +CA) are set to 0.5 (0.75 for Cityscape-GOSS) and 5.0, respectively. Our models are implemented in PyTorch [73]. We note that the N+1-model cannot be used for the Cityscape-GOSS dataset. As we consider labels of unknown classes to be “void” instead of directly filtering out the image, the entropy of void pixels is not allowed to be added to the loss.

6.2 Results

The results of GOSS on COCO-Stuff-GOSS and Cityscapes-GOSS using various identification methods are reported in Tables  3 and 4, respectively. In addition to GOSS quality, we also provide metrics for OSS (AUROC and AURP) and GS (mIoU and \(\textrm{GQ}^{clu}\)) tasks to show that the models perform reasonably on these relevant older tasks.

For COCO-Stuff-GOSS in Table 3, GST becomes the best-performing model. For example, on the “Manual-20/60 split”, GST attains 9.15\(\%\) \(\textrm{GQ}\), outperforming the N-model+MSP by a healthy margin of nearly 1.45\(\%\). Compared to other baselines, the pixel contrastive learning module assists GST to better discriminate between the known pixels and the unknown pixels in most cases (see “OSS Metric” in Table 3). Moreover, it boosts the clustering accuracy in GS. One of the baseline models, N-model+MaxLogit with using threshold, sacrifices much of \(\textrm{GQ}^{kn}\), but it achieves a high \(\textrm{GQ}^{uk}\). As expected, MaxLogit identifies more unknown areas. However, it does not simultaneously maintain the pixel classification accuracy [24]. For Cityscapes-GOSS in Table 4, we find a similar performance ranking to COCO-Stuff-GOSS in Table 3. For DML [14], except for the case on COCO-Stuff-GOSS of “VOC-20/60 split”, it wins MSP and MaxLogit on the other benchmarks.

Table 3 GOSS results of GST (N+1-model+CA+CL) on COCO-Stuff-GOSS under three splits. “CA” is the confidence adjustment, and “CL” is the cross-pixel contrastive learning
Table 4 GOSS results of GST (N-model+CL) on Cityscapes-GOSS under “Manual-16/3” and “Manual-13/6” splits
Fig. 4
figure 4

Visualized segmentation results from GST (N+1-model+CA+CL) on COCO-Stuff-GOSS. The GOSS prediction (f) merges the OSS prediction (c) and the grouping prediction (d). Hence, within GOSS prediction (f), “objects” inside identified unknown regions can be segmented out. For example, unknown objects, “paper” in the 1st row, “dog/grass” in the 2nd row, and “bottle” in the 3rd row are correctly outlined even though their classes are not known. We also have some zoom-in images to show the effectiveness of the GST

Several examples from the built benchmark are visualized in Fig. 4 to illustrate the GOSS setting better. Taking one example from Fig. 4 (2nd-row figure), GOSS accurately segments out “unknown dogs” from “unknown grass” (see Fig. 4f). Compared to the prediction of OSS in Fig. 4c, GOSS can provide richer information for intelligent agents to make decisions. With GOSS prediction, robots may avoid the obstacle (“unknown dogs”) when they enter an unfamiliar scene (“unknown grass”). In terms of the GST model, we observe from Fig. 4 that the confidence adjustment module of GST effectively helps the N+1-model to detect more unknown regions (see Fig. 4b and c).

6.3 Analysis

Number of Unknown Classes. The increased number of unknown classes makes the GOSS performance significantly drop as we observe the performance of “Manual-16/3” and “Manual-13/6” in Cityscapes GOSS (see Table 4).

Training Strategy. For all models in Table 3 and 4, the pixel classification branch and the pixel clustering branch are trained in a single unified architecture. Here, we study a different training strategy, “Separate,” where two branches are trained separately, and their outputs are then merged. We find that the performances of “Separate” and our “Single” network are close. We finally choose “Single” since it is fast, light, and easy to implement.

Table 5 Ablation study on COCO-Stuff-GOSS under “Random-111/60” split: training strategy. Two strategies are compared

Clustering Method. We train the pixel clustering branch for grouping unknown areas in an unsupervised manner by applying differentiable feature clustering (DFC) loss [74]. DFC in an unsupervised setting has a basic clustering performance, but it is worse than Super-BPD in a supervised setting.

Table 6 Ablation study on COCO-Stuff-GOSS under “Random-111/60” split: clustering method
Table 7 GOSS results of GST (N+1-model+CA+CL) on COCO-Stuff-GOSS under “Manual-20/60” split. “CA” is the confidence adjustment, and “CL” is the cross-pixel contrastive learning

Component Effect. Here, we discuss the effects of each model component (“CA” or “CL”). The results on COCO-Stuff-GOSS under the “Manual-20/60” split are shown in Table 7, which indicates “CA” or “CL” can solely boost the performance by a certain margin.

Challenging Task. The results in Table 3 and 4 verify that GOSS is a very challenging task, despite our baseline framework relying on strong backbones and a reasonable architecture. The first main reason is that it is non-trivial to perform accurate pixel identification under the open-set setting. For example, “unknown laptop” has been misclassified as “known tv” in the 1st-row figure in Fig. 4. Furthermore, the clustering branch suffers a performance drop when the model encounters the unfamiliar appearances of objects from unknown categories at test time. There is significant room for future improvement on the task of GOSS.

7 Conclusion

The improved setting referred to as GOSS is introduced in this paper. We aim to build upon the well-defined OSS to generate more comprehensive predictions. The task is to semantically classify pixels as one of the known classes or an unknown class and cluster the detected unknown pixels. With more extracted information inside the unknown region, GOSS might benefit intelligent agents in their decision-making process. Specific to the new setting, a metric, two benchmarks, and a corresponding baseline model are presented. In future works, the concept of GOSS can be further extended to include instance segmentation, image co-segmentation, video segmentation, point cloud segmentation, etc. We hope this work may provide a new alternative to a more comprehensive pixel-level scene understanding.