GOSS: Towards Generalized Open-set Semantic Segmentation

In this paper, we present and study a new image segmentation task, called Generalized Open-set Semantic Segmentation (GOSS). Previously, with the well-known open-set semantic segmentation (OSS), the intelligent agent only detects the unknown regions without further processing, limiting their perception of the environment. It stands to reason that a further analysis of the detected unknown pixels would be beneficial. Therefore, we propose GOSS, which unifies the abilities of two well-defined segmentation tasks, OSS and generic segmentation (GS), in a holistic way. Specifically, GOSS classifies pixels as belonging to known classes, and clusters (or groups) of pixels of unknown class are labelled as such. To evaluate this new expanded task, we further propose a metric which balances the pixel classification and clustering aspects. Moreover, we build benchmark tests on top of existing datasets and propose a simple neural architecture as a baseline, which jointly predicts pixel classification and clustering under open-set settings. Our experiments on multiple benchmarks demonstrate the effectiveness of our baseline. We believe our new GOSS task can produce an expressive image understanding for future research. Code will be made available.


Introduction
In the deep learning era, image segmentation, especially class-specific semantic segmentation (SS), has received significant progress [45,7,8,9,11,66].The goal of the SS task is to predict the class label of each pixel in an image from a set of predefined object classes.Adjacent pixels naturally belong together to form a segment when they share the same object category.Despite the considerable improvement, most SS settings follow a strong closed-set assumption that training and test data come from the same 1 In the remainder of the paper, we interchangeably use the 'cluster' and 'group'.set of known object classes [56,14,67,3].However, this assumption is rarely the case in practice.It limits the generalization of segmentation models to unknown classes which models do not see during training.
Recently, open-set semantic segmentation (OSS) [28,43,4,5] has been proposed to relax the above assumption, which aims to segment an image that contains both known and unknown object classes.Different from SS, OSS identifies the unknown region where pixels belong to unknown classes.Although OSS aims to identify pixels that do not belong to one of the known classes, it does not provide any further segmentation amongst those identified unknown pixels.We argue that such a setting of OSS may limit the broad usage of vision-based intelligent agents when they encounter unfamiliar scenes where several unknown object classes are adjacent to each other rather than separated.Consider a scenario that an intelligent agent goes into a new scene like shown in Figure 1 (a).OSS leaves the whole unknown region as a large segment without further processing (see 'black region' in Figure 1 (b)).Such insufficient information, provided by the OSS setting, will affect the decision-making for the intelligent agents.Hence, we raise fundamental questions: how to improve OSS for generating richer representations for images in real-world scenes?Moreover, what is a more expressive and versatile image segmentation task beyond OSS?In real-world applications, these questions are crucial.Again considering the example in Figure 1 (a), self-driving cars may be assisted to avoid potential obstacles if 'unknown sheep' or 'unknown rail' inside 'unknown grass' can be found in the OSS prediction (see Figure (b)).Another possible practical example is that new detected 'objects' inside unknown regions of images could help accelerate the process of data annotation, especially when images from unfamiliar scenes are being labeled.
Towards the goal of better handling unknown regions in an image, inspired by the perception system from humans that they can jointly recognize the previously known objects and easily group unknown areas into different segments even though they do not know the categories of those unknown objects, this paper studies a new type of semantic image segmentation task called Generalized Open-set Semantic Segmentation (GOSS).It aims to classify pixels belonging to known classes and group the unknown pixels (see Figure 1 (d) for GOSS prediction).As we can see from Figure 1 (d), 'unknown rail' (or 'unknown 1') and 'unknown sheep' (or 'unknown2') are segmented out from 'unknown grass' (or 'unknown4').GOSS takes the advantage of generic segmentation (GS) which groups pixels into segments sharing similarities [55,12,23,1,1,33,61]. Compared to OSS, GOSS is able to detect more 'objects' inside unknown regions.
To enable intelligent agents to perform GOSS, we first build benchmarks using existing segmentation datasets, i.e., COCO-Stuff [3] and Cityscapes [14].We split the full set of object categories into two sets: known classes and unknown classes.We keep the semantic annotations of known categories.For unknown categories, we use connectivity labeling [61] to convert their original semantic annotations into clustering ground truths.Along with the available datasets, it also requires a metric that can encompass the quality of both OSS and GS.Although there are many existing metrics for segmentation tasks, such metrics are limited to measuring a single setting.In this work, we introduce a metric, termed GOSS Quality (GQ), which evaluates the segmentation quality of both known and unknown objects.Having the datasets and evaluation metrics at hand, we further establish an end-to-end trainable framework, namely, GOSS SegmeTor (GST).The proposed GST adopts a dual-branch architecture with a shared backbone network.To perform the GOSS task, one branch conducts pixel classification, and the other performs pixel clustering.Moreover, to learn more discriminative embeddings and thus better process un- known objects, we leverage the pixel-wise contrastive learning loss into the training.In summary, our contributions are as follows: 1) We present a new image segmentation task called GOSS, which jointly classifies known pixels and groups identified unknown pixels from OSS; 2) We propose a metric that measures the quality of both pixel classification and pixel clustering, under open-set settings; 3) According to settings of GOSS, we build benchmarks by customizing existing datasets; 4) We show a simple yet effective model, GST, to facilitate future research.

Related Work
In computer vision, image segmentation is one of the most widely explored problems.During the history of image segmentation research, novel segmentation tasks have played a crucial role in driving research directions and innovations.We provide comparisons between our new setting and relevant older tasks in Table 1.Open-set Semantic Segmentation (OSS).OSS, which is capable of identifying unknown objects, has been developed significantly in recent years.Performing OSS is essential for intelligent agents as they work in open-set settings where many objects are never seen before.A natural solution, studied in [28,36,6], determines the unknown regions by directly computing the anomaly score from logit or confidence vectors provided by the model classifier.Alternatively, synthesis approaches [43,63,19,60,41] are proposed to detect unexpected anomalies from reconstructed images.In addition, the work [4] employs metric learning to learn more discriminative features and incrementally label novel classes using a human-in-the-loop approach.Beyond the existing OSS setting, the proposed GOSS performs the holistic segmentation via classifying the known objects and clustering the unknown objects, such that it provides more expressive information of the environment than OSS.Having richer information at hand, GOSS can benefit practical usages in real-world scenarios.Generic Segmentation (GS).The task of GS is to find groups of pixels that 'go together' [59].In the early days of computer vision, the term 'image segmentation' and the bottom-up general (non-semantic) segmentation share the same meaning.Recently, it is often termed 'generic segmentation' [1,33,61] to distinguish it from other segmentation tasks.The pipeline of early segmentation methods consists of first extracting local pixel features such as bright-ness, color or texture and then clustering these features based on, e.g., mean shift [12], normalized cuts [55], random walks [24], graph-based representations [23], or oriented watershed transform [1].Learning-based image segmentation methods have now also become popular.DEL [44] learns a feature embedding that corresponds to a similarity measure between two adjacent superpixels.Saacs et al. [33] propose pixel-wise representations that reflect how segments are related.Super-BPD [61] learns super boundary-to-pixel direction to provide a direction similarity between adjacent pixels.Comparing the performance of different image segmentation algorithms, public datasets such as BSDB [47,1] provide human-labeled class-agnostic ground truth.However, they do not provide any semantic information.Image Segmentation as a subtask.Image segmentation is often taken as a subtask jointly solved with other vision problems in a single framework [20,34,40], [2,64,58,54,25,31].Recently, panoptic segmentation [39,38] has become a common image segmentation task by unifying semantic segmentation and instance segmentation.

Task Format
Here, the task format for GOSS is formulated at a pixel level.For the ith pixel of an image, the GOSS output is defined as a pair goss i = (s i , g i ), where the classification label s i indicates the pixel's semantic class and the clustering (or grouping) label g i represents the cluster id.Suppose that there are N known semantic classes L kn ∈ R N and an unknown class indicator L uk ∈ R, we have the semantic label set L = {L kn , L uk } which is encoded by L := {0, ..., N − 1, N }.In our formulation, each pixel can be predicted as either one of known classes or the unknown class.In the first case, each pixel must have a semantic label, while the cluster id is not necessary.This is due to the fact that once the ith pixel is labelled with s i ∈ L kn , its corresponding cluster id g i is invalid (which is denoted by void).When the pixel is predicted as the unknown class, then it can be clustered to g i .Hence, the ith pixel with known classes (or unknown classes) can be assigned with goss i = (s i , void) (or goss i = (N, g i )).In practice, s i can be predicted by a classification model, and g i can be determined after the unknown pixels are clustered.

Evaluation Metrics
Appropriate evaluation metrics play a fundamental role in driving the popularization of a new image segmentation task [39,10,21].In this subsection, we first briefly review some popular existing metrics for relevant segmentation tasks and then introduce a new metric, tailored for the proposed GOSS.
Previous Metrics.Standard metrics for OSS include the false positive rate at 95% true positive rate (FPR at 95% TPR), the area under receiver operating characteristics (AU-ROC) [15,22] and the area under the precision-recall (AUPR) [46,53].Such metrics assess the performance based on the overlap of anomaly score distributions between the known and unknown class.However, they are not suited for evaluating GOSS since they do not need to clearly classify the input as a known or unknown class.GOSS requires each input pixel to be explicitly classified as belonging to a known or unknown class, since GS (or clustering) is only performed on the identified unknown pixels.Well-known metrics for GS include the variation of information [49], probabilistic rand index [52], F-measure [48], and segmentation covering [1].These metrics are initially proposed to evaluate the quality of either data clustering or edge detection.As no multi-class semantic labels are considered, they cannot be directly used to measure the performance of joint GS and OSS.GOSS Quality.We borrow the idea of the segment matching from the panoptic quality [39] in panoptic segmentation (PS) and adapt the panoptic quality as GOSS quality to be suitable for evaluating GOSS.As shown in Figure 1 (d), the GOSS output can be viewed as a set of predicted segments (i.e., each segment or cluster has a unique id goss i ), which is similar to the panoptic output.However, unlike GOSS, PS is not able to deal with unknown objects.In comparison with PS, GOSS does not differentiate 'instance-level' segments for both known and unknown objects.For example, in Figure 1 (d), GOSS does not separate 'two sheep' in 'unknown2' into instances.
We treat the unknown pixels as a new class.Thus, there is a total of N + 1 classes of segmentation.As introduced in Figure 2, with the segment matching, each predicted segment from GOSS is matched with at most one ground truth segment when their IoU is higher than 0.5.Pixels belonging to the same segment have the identical goss.We let GQ kn be the average GOSS quality over N known classes.Accordingly, GQ uk is the GOSS quality of the unknown class: where IoU(u, û) calculates the Intersection over Union value for the predicted segment u and the ground-truth segment û.Furthermore, TP kn k , FP kn k and FN kn k denote true positives, false positives and false negatives for the kth known class, respectively.Similarly, GQ uk is obtained specially for the unknown class with its true positives TP uk , false positives FP uk and false negatives FN uk .The metrics GQ kn and GQ uk are computed based on the GOSS output (see Figure 1 (d)).The known and unknown segments on the GOSS prediction are evaluated separately via GQ kn and GQ uk .A unified metric is required to simplify the evaluation.Thus, we define a new metric GOSS Quality (GQ) as: where we set λ as the most natural number, 0.5, throughout the paper.The reason behind this design is that if we simply average GQ over N + 1 classes, then the ratio between known and unknown is significantly biased (N : 1).GQ takes care of the known and unknown segments equally.We also introduce GQ clu , the GOSS quality of the pixel clustering (see Figure 1(b)) before fusing it with the pixel classifi-cation&identification.Refer to the supplementary material for more details of GQ clu .

Challenges
We argue that, compared with the existing OSS task, GOSS poses more challenges since it requires not only identifying unknown regions in an image but also grouping pixels within these regions into clusters.Clustering objects from unknown classes greatly increases the task difficulty.

Methodology
In order to effectively perform GOSS, we propose a baseline framework.The baseline is mainly comprised of five components: the shared encoder, the pixel classification branch, the pixel clustering branch, the identification module, and the fusion module.Then, we extend the baseline into a more advanced one, GOSS SegmenTor (GST), as shown in Figure 3.More details of our design will be described next.

Baseline
GOSS can be modeled as a unified segmentation task that incorporates the pixel classification and clustering in an open-setting scenario.Given an image I o ∈ R 3× o ×ωo , we expect the proposed baseline to simultaneously generate the semantic and grouping predictions.Hence, we adopt a dualbranch architecture, with one branch for pixel classification and another for pixel clustering.As shown in Figure 3, two branches share the same encoder as a feature extractor.The branch for pixel classification computes a prediction map S ∈ R o ×ωo while the pixel clustering branch outputs a mask map G ∈ R o ×ωo which includes the grouped classagnostic segments.The unknown regions in S are identified, denoted by S ide , which is further fused with G, obtaining the final GOSS output G oss ∈ R 2× o ×ωo .The baseline model is jointly trained with two losses: the classification loss cla , and the clustering loss clu .The total loss is ws = α cla cla + α clu clu where α cla and α clu are positive adjustment weights.

Pixel Classification
We train the branch for pixel classification to classify each pixel as one of N classes where N is the number of predefined known classes.DeepLabV3+ [9], an existing powerful baseline for semantic segmentation, is chosen as the basic architecture for this branch.The branch is updated under cla which is the cross-entropy loss between the predicted semantic map S and its ground-truth map.DeepLabV3+ leverages an encoder-decoder architecture that takes a bottom-up pathway network with features at multiple spatial resolutions, and appends a top-down pathway with lateral connections.The top-down pathway progressively upsamples features starting from the deepest layer of the network while concatenating or adding them with higher-resolution features from the bottom-up pathway.The Atrous Spatial Pyramid Pooling (ASPP) layer [8] is employed in the DeepLabV3+ model to enlarge the receptive field.

Pixel Identification
Pixel identification from OSS is executed to identify sets of pixels of unknown classes from the semantic prediction.Hereafter, we study several recipes for pixel identification with which the identified semantic prediction S ide ∈ R o ×ωo is computed after processing S. Common metrics like AUROC and AUPR, of OSS assess the distribution overlap between known and unknown classes, but pixel identification is required to clearly state if the input pixel is known or not.In other words, the binary classification is necessary.N-model.In the pixel classification branch it is natural to design the semantic segmentation model with an Ndimensional confidence output C ∈ R N × o ×ωo .The Nmodel is restricted to recognizing the set of predetermined known classes.When an unknown region comes up in a test image, it would be erroneously classified as one of the known classes.To identify unknown pixels based on outputs from the N-model, we employ Maximum Softmax Probability (MSP) [29], Maximum Unnormalized Logit (MaxLogit) [28] and Deep Metric Learning (DML) [4], respectively.Thresholds are used to clearly classify pixels as belonging to a known or unknown class.More details can be found in the supple.N+1-model.As opposed to the N-model, the N+1-model [18,13] contains the unknown class in the output C ∈ R (N +1)× o ×ωo such that the N+1-model can directly identify the unknown pixels.During the training stage, the N+1model explicitly takes the 'unlabeled' pixels (i.e., the 'void' pixels) as the unknown pixels.N+1-model is not valid if no 'void' pixels are provided.

Pixel Clustering
The pixel clustering branch is built in parallel with the pixel classification branch.The goal of this branch is to partition the whole image into clusters.During training, to generate the corresponding annotations, we convert the semantic labeling of the known classes into connectivity labeling by ignoring the previous semantics of each segment.The top-performing method, super boundary-to-pixel direction (Super-BPD) [61] is selected to establish the branch.The branch is trained in an end-to-end supervised manner as well.Super-BPD applies the ground-truth annotation generated by the distance transform algorithm.Using the Super-BPD model, the representation of boundary-to-pixel direction (BPD) for each pixel is learned ( clu = bpd ).Super-BPDs are extracted based on the initial BPDs using the component-tree computation, followed by graph partitioning to merge super-BPDs into new segments.

Fusion Module
The identification module outputs the identified semantic prediction S ide .Based on S ide , the grouping output G becomes G uk ∈ R o ×ωo where the element g i → void if the corresponding semantic prediction s i ∈ [0, ..., N − 1].Af-terward, S ide is merged with G uk to form the GOSS output G oss ∈ R 2× o×ωo .For ith pixel in G oss , goss i = (s i , void) if s i ∈ [0, 1, ..., N − 1] or goss i = (N, g i ) if s i = N .The prediction G oss can be viewed as a map which is composed of a set of several segments (see 'GOSS prediction' in Figure 3).

GOSS Segmentor
As shown in Figure 3, the baseline model is extended to a new model that we call GOSS Segmentor (GST).Keeping the original five components of the baseline, we propose to equip the baseline with a confidence adjustment module and contrastive learning module.

Confidence Adjustment
For the N+1-model, it is hard to accommodate unknown classes of objects since the model is trained without seeing any examples from these classes.Instead of completely trusting the confidence prediction C, specific to the pixel identification of the N+1-model, we propose to modify C using a confidence adjustment.Particularly, for the ith pixel, its confidence score after softmax, where β uk ∈ (1, +∞) is the scale coefficient of the confidence of the unknown class.

Pixel Contrastive Learning
In order to learn more discriminative representations and better identify&cluster unknown inputs, inspired by [62], we adopt a pixel-wise contrastive learning algorithm where we contrast embeddings with embeddings with a different semantic label.For the embedding of the ith pixel e i ∈ R cn in the feature map E ∈ R cn× o ×ωo where cn is the channel number, the positive samples are pixel embedding e + i with the same ground truth label to e i in the same image, while the negative samples are pixel embeddings e − i having different ground truths.The pixel to pixel contrastive loss [62,57] is then defined as: where P i , N i are the set of positive and negative samples for pixel embedding e i and τ is the temperature parameter.We employ the semi-hard example sampling strategy from [62] to construct the positive and negative samples sets.
The pixel contrastive loss is merged with the classification loss cla and the clustering loss clu to give the total loss ws of GST as: ws = α cla cla + α clu clu + α pc pc where α pc is positive adjustment weights to pc .The purpose of a contrastive learning schema is to make the representations of pixels in the latent space from the same class closer, and further away from different classes.Many existing works have utilized similar metric learning techniques for segmentation tasks and gain better empirical results [26,16,32,65].In GOSS, we apply contrastive learning to model training on known close-set data and hope the trained model can generate more representative embeddings for known and unknown pixels in open-set evaluation.

Benchmark
Most datasets for OSS, like StreetHazards [28] and Road Anomaly [43], present separate unknown objects in an image.Ensuring that objects of unknown classes appear together (are adjacent) in the natural image, in this work, using a proper ratio, we split the full set of labeling categories into known and unknown classes.We simulate the training and testing of GOSS using existing semantic segmentation datasets, i.e., COCO-Stuff [3] and Cityscapes [14].Note that grouping labels of unknown segments are derived from their initial ground-truth semantic labels before the split.Following [61], we convert the original semantic labeling of unknown areas to GS ground truths using connectivity labeling.

COCO-Stuff-GOSS
COCO-Stuff [3] augments the popular COCO [42] dataset with stuff classes as well as dense-pixel annotations.It has a large-scale semantic multi-class setting containing both the 'things' and 'stuff' classes.On COCO-Stuff, around 94% of the pixels are labeled with one semantic category, and the remaining are 'unlabeled' pixels.We customize COCO-Stuff, creating a new benchmark named COCO-Stuff-GOSS.We strictly divide existing specific classes of COCO-Stuff into known and unknown classes.Training and testing images are selected from 'train2017' and 'val2017', respectively.Those categories which have been defined as unknown categories will not be represented in the training examples.Every selected testing example is composed of objects from the set of known categories as well as the set of unknown categories (or from only unknown categories).The statistics of the benchmark on different splits are shown in Table 2. VOC Split.The 'VOC Split' is a common category split [51,30,35] that provides 20 'thing' classes defined in PAS-CAL VOC [21] as 'known thing' classes.The remaining 60 'thing' classes and 91 'stuff' classes are chosen as 'unknown' classes.Manual Split.We divide COCO-Stuff categories according to how frequently each specific class appears.We count the number of occurrences of each class and calculate its ratio over the number of all training images.For example, in the 'Manual-20/60' split, following that at least one and at most two classes are chosen from each sub-class, we choose 20 of the most popular 'thing' classes and treat the remaining 'thing' classes as unknown.Besides, all 'stuff' classes are set as unknown classes.Random Split.We also conduct experiments with a 'Random Split', where all classes are randomly re-defined into known classes and unknown classes regardless of their super-class and sub-class.The data split of VOC-20/60 and Manual-20/60 do not include 'stuff' categories as unknown classes, but Random-111/60 ensures that the known (or unknown) class includes specific classes from both the 'thing' and 'stuff' super-class.More details can be found in Table 2.

Cityscapes-GOSS
The Cityscapes [14] dataset consists of 5000 images (2975 train, 500 val, 1525 test) covering urban street scenes in driving scenarios.Dense pixel annotations of 19 classes are provided, where 8 of them are with instance-level segmentation, that is, 8 'thing' and 11 'stuff' classes.As one goal of the proposed GOSS is to advance self-driving systems, we construct the Cityscapes-GOSS Benchmark.We divide the categories under the 'manual split'.As opposed to the COCO-Stuff-GOSS Benchmark, all images, regardless of containing unknown categories or not, are kept.We consider pixels from unknown classes as 'void' pixels.Table 2 presents more details.Manual Split.We present two versions of the Cityscapes-GOSS Benchmark.Following the split in [4], we build the first version, 'Manual-16/3', which includes 'car', 'truck', and 'bus' as the 'unknown thing'.Based on the first version, we additionally make 'building', 'traffic sign', and 'vegetation' as 'unknown stuff', to produce a more challenging version, 'Manual-13/6'.

Experiments
Experimental results are presented in this section to demonstrate the rationality and effectiveness of GOSS.We perform our novel task on COCO-Stuff-GOSS and We also provide details of the original COCO-Stuff [3] and Cityscapes [14] for comparison.
Cityscapes-GOSS using the baseline and proposed GST.
The performance is mainly measured via the metric GQ.

Implementation
For all models, ResNet-50 [27] pre-tained on ImageNet [17] is utilized as the encoder backbone.All models are trained for 60K/40K iterations with a batch size of 10/2 on COCO-Stuff-GOSS/Cityscapes-GOSS.The initial learning rate is set to 5e−5.The weights α cla , α clu and α pc are 1.0, 1e−4 and 1e−2.The thresholds in the identification module and the scale β uk (for +CA) are set to 0.5 and 5.0, respectively.Our models are implemented in PyTorch [50].We note that the N+1-model cannot be used for the Cityscape-GOSS dataset.As we consider labels of unknown classes to be 'void' instead of directly filtering out the image, the entropy of void pixels is not allowed to be added into the loss.

Results
The results of GOSS on COCO-Stuff-GOSS and Cityscapes-GOSS using various identification methods are reported in Table 3 and 4, respectively.In addition to GOSS quality, we also provide metrics for OSS (AUROC and AURP) and GS (mIoU and GQ clu ) tasks to show that the models perform reasonably on these relevant older tasks.
For COCO-Stuff-GOSS in Table 3, GST becomes the best performing model.For instance, on the 'Manual-20/60 split', GST attains 9.15% GQ, outperforming the N-model+MSP by a healthy margin of nearly 1.45%.Compared to other baselines, the pixel contrastive learning module assists GST to better discriminate between the known pixels and the unknown pixels in most cases (See 'OSS Metric' in Table 3).Moreover, it boosts the clustering accuracy in GS.One of the baseline models, N-model+MaxLogit with using threshold sacrifices much of GQ kn , but it achieves a high GQ uk .As expected, MaxLogit identifies more unknown areas, however, it does not simultaneously maintain the pixel classification accuracy [36].For Cityscapes-GOSS in Table 4, we find a similar performance ranking to COCO-Stuff-GOSS in Table 3.
Several examples from two built benchmarks are visualized in Figure 4 for better illustrating the GOSS settings.Taking one example from Figure 4 (2nd-row figure), GOSS accurately segments out 'unknown dogs' from 'unknown grass' (see Figure 4 (e)).Compared to the prediction of OSS in Figure 4 (c), GOSS can provide richer information for intelligent agents to make decisions.In terms of GST model, it can be observed from Figure 4 (1st figure and 2nd figure) that the confidence adjustment module of GST effectively helps the N+1-model to detect more unknown regions (see Figure 4 (b) and (c)).

Analysis
Training Strategy.For all models in Table 3 and 4, the pixel classification branch and the pixel clustering branch are trained in a unified single architecture.Here, we study a different training strategy, 'Separate', where two branches are trained separately, and then their outputs are merged.We find that the performances of 'Separate' and our 'Single' network are close.We finally choose 'Single' since it is fast, light, and easy to implement.
Clustering Method.We train the pixel clustering branch for grouping unknown areas in an unsupervised manner applying differentiable feature clustering (DFC) loss [37].DFC in an unsupervised setting has a basic clustering performance, but it is worse than Super-BPD in a supervised setting.
Challenging Task.The results in Table 3 and 4 verify that GOSS is a very challenging task, despite our baseline framework relying on strong backbones and a reasonable architecture.The first main reason is that it is non-trivial to perform accurate pixel identification under the open-set setting.Furthermore, the clustering branch suffers a performance drop when the model encounters the unfamiliar appearances of objects from unknown categories at test-time.There is significant room for future improvement on the task of GOSS.Table 6.Ablation study on COCO-Stuff-GOSS under 'Random-111/60' split: clustering method.Two clustering models of GST, super-BPD in a supervised setting and DFC [37] in an unsupervised setting are compared.

Conclusion
The novel setting referred to as GOSS is introduced in this paper.Our goal is to build upon the well-defined OSS towards generating more comprehensive predictions.The task is to semantically classify pixels as one of known classes or an unknown class as well as cluster the detected unknown pixels.With more extracted information inside the unknown region, GOSS is beneficial for intelligent agents in their decision making process.Specific to the new setting, a metric, two benchmarks, and a corresponding baseline model are presented.In future works, the concept of GOSS can be further extended to include instance segmentation, image co-segmentation, video segmentation, point cloud segmentation, etc.We hope that this work may provide a new alternative towards a more comprehensive pixel-level scene understanding.

Figure 2 .
Figure 2. Toy model of ground truth and predicted GOSS of an image.The predicted segments for 'unknown' are partitioned into true positives TP uk , false negatives FP uk , and false positives FN uk .

Figure 3 .
Figure 3.The framework of GOSS Segmentor (GST).The input image is fed into the encoder for feature extraction.The dual-branch heads are jointly trained for the pixel classification and clustering.Furthermore, the pixel-wise contrastive learning is leveraged to learn discriminative feature embeddings.The pixel identification module is designed to recognize sets of pixels of the unknown class from the semantic prediction.The final GOSS output is generated by fusing the identified semantic prediction and the grouping prediction.

Table 1 .
Comparisons of different image segmentation tasks.Compared to traditional segmentation tasks, GOSS takes better care of unknown objects.

Table 2 .
Details of different splits of the COCO-Stuff-GOSS/Cityscapes-GOSS benchmarks.The numbers in the table indicate for each data split, how many known (or unknown) classes are selected and how many training (or testing) images are kept.

Table 3 .
GOSS results of GST (N+1-model+CA+CL) on COCO-Stuff-GOSS under three splits.'CA' is the confidence adjustment and 'CL' is the cross-pixel contrastive learning.The best results of GQ are in bold.

Table 4 .
GOSS results of GST (N-model+CL) on Cityscapes-GOSS under 'Manual-16/3' split.'CL' is the cross-pixel contrastive learning.The best results of GQ are in bold.The results of 'Manual-13/6' split are provided in the supplementary material.