Query Semantic Reconstruction for Background in Few-Shot Segmentation

Few-shot segmentation (FSS) aims to segment unseen classes using a few annotated samples. Typically, a prototype representing the foreground class is extracted from annotated support image(s) and is matched to features representing each pixel in the query image. However, models learnt in this way are insufficiently discriminatory, and often produce false positives: misclassifying background pixels as foreground. Some FSS methods try to address this issue by using the background in the support image(s) to help identify the background in the query image. However, the backgrounds of theses images is often quite distinct, and hence, the support image background information is uninformative. This article proposes a method, QSR, that extracts the background from the query image itself, and as a result is better able to discriminate between foreground and background features in the query image. This is achieved by modifying the training process to associate prototypes with class labels including known classes from the training data and latent classes representing unknown background objects. This class information is then used to extract a background prototype from the query image. To successfully associate prototypes with class labels and extract a background prototype that is capable of predicting a mask for the background regions of the image, the machinery for extracting and using foreground prototypes is induced to become more discriminative between different classes. Experiments for both 1-shot and 5-shot FSS on both the PASCAL-5i and COCO-20i datasets demonstrate that the proposed method results in a significant improvement in performance for the baseline methods it is applied to. As QSR operates only during training, these improved results are produced with no extra computational complexity during testing.


Introduction
The ability to segment objects is a long-standing goal of computer vision, and recent methods have achieved extraordinary results (He, Zhang, Ren and Sun, 2016;He, Deng, Zhou, Wang and Qiao, 2019;Long, Shelhamer and Darrell, 2015).These results depend on a large number of pixellevel annotations which are time-consuming and costly to produce.When facing the situation where few exemplars from a novel class are available, these methods overfit and perform poorly.To deal with this situation, few-shot segmentation (FSS) methods aim to predict a segmentation mask for a novel category using only a few images and their corresponding segmentation ground-truths.
Most current FSS algorithms (Zhang, Lin, Liu, Yao and Shen, 2019b;Siam, Oreshkin and Jagersand, 2019;Zhang, Lin, Liu, Guo, Wu and Yao, 2019a;Lu, He, Zhu, Zhang, Song and Xiang, 2021;Liu, Ding, Jiao, Ji and Ye, 2021;Li, Jampani, Sevilla-Lara, Sun, Kim and Kim, 2021;Wu, Shi, Lin and Cai, 2021;Zhang, Xiao and Qin, 2021) follow a similar sequence of steps.Features are extracted from support and query images by a shared convolutional neural network (CNN) which is pre-trained on ImageNet (Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein et al., 2015;Yang, Liu, Li, Jiao and Ye, 2020;Siam, Doraiswamy, Oreshkin, Yao and Jagersand, 2020;Zhang et al., 2019b).Then the support image groundtruth segmentation mask is used to identity the foreground information in the support features.Generally, the object haoyan.guan@kcl.ac.uk (H.Guan); michael.spratling@kcl.ac.uk (M.Spratling) class is represented by a single foreground prototype feature vector (Wang, Liew, Zou, Zhou and Feng, 2019;Yang et al., 2020;Tian, Zhao, Shu, Yang, Li and Jia, 2020;Zhang et al., 2021;Li et al., 2021).Finally, a decoder is used to calculate the similarity of the foreground prototype and every pixel in the query feature-set to predict the locations occupied by the foreground object in the query image.This standard approach ignores the importance of background features that can be mined for negative samples in order to reduce falsepositives, and hence, make the model more discriminative.Some FSS methods (Yang et al., 2020;Boudiaf, Kervadec, Masud, Piantanida, Ayed and Dolz, 2021;Wang et al., 2019) extract background information from support images by using the support masks to identify the support image background.RPMMs (Yang et al., 2020) uses the Expectation-Maximization (EM) algorithm to mine more background information in the support images.MLC (Yang, Zhuo, Qi, Shi and Gao, 2021) extracts a global background prototype by averaging together the backgrounds extracted from the whole training data in an offline process, then updates this global background prototype with the support background during training.However, the same category object may appear against different backgrounds in different images.The background information extracted from or aligned with the support image(s) is, therefore, unlikely to be useful for segmenting the query image.Existing FSS methods ignore the fact that the background information of an image is most relevant for segmenting that specific image.
In this paper, we are motivated by the issue illustrated in Fig. 1 and design a method that can extract background information from the query image itself to make existing FSS algorithms be more discriminative.Our method, Query Semantic Reconstruction (QSR), separates the feature extracted from a query image according to known classes and latent classes.Known classes are the categories that appear in the training data, like dog and cat in the example used in Fig. 1.Latent classes are unknown categories like mat and wall which are not explicitly labelled in the training data, but which can appear in the background in the training images.QSR learns to eliminate the foreground information according to the class labels.The remaining classes are used to define a prototype for the background of the query image that excludes contributions from the foreground class.
The extracted foreground and background prototypes are used as input to the prototype decoder module from the underlying, baseline, FSS method.The decoder produces predictions of foreground and background masks.The predictions are compared to a ground-truth mask and the loss is used to tune the parameters of the model.For these foreground and background prototypes to be effective at identifying the foreground and background regions of the query image, the whole model must be able to make the prototypes discriminative of features representing different semantics in the images.Hence, our method trains the underlying FSS method so that at test time it is able to more accurately segment images.Our method only predicts background masks during training to optimize the whole model.Hence, during testing the method is identical to that of the baseline.
The main contributions of our work are as follows: 1. To address the long-standing high false positive problem in FSS and to demonstrate that background information from the query image itself can be employed usefully for segmentation, we propose QSR that can be applied to many existing FSS algorithms to ensure they are better able to discriminate between foreground and background objects.2. QSR improves existing FSS methods through optimized training.During testing our method is identical to the baseline, so no additional parameters or extra computation is needed at test-time.3. We demonstrate the effectiveness of QSR using three different baselines methods: CaNet (Zhang et al., 2019b), ASGNet (Li et al., 2021) and PFENet (Tian et al., 2020).For the PASCAL-5 dataset, QSR improves mIOU results of 1-shot and 5-shot FSS by 1.0% and 1.5% for CaNet, 1.8% and 2.1% for ASGNet, and by 1.9% and 4.8% for PFENet.For the COCO-20 dataset, QSR improves ASGNet by 2.8% and 1.6%, PFENet by 4.5% and 3.8%.4. Our method achieves new state-of-the-art performance on PASCAL-5 , with mIOU of 62.7% in 1-shot, and 66.7% in 5-shot.On the COCO-20 dataset, our method achieves strong results of 36.9% in 1-shot, and 41.2% in 5-shot.

Related Work
Semantic segmentation.Semantic segmentation requires the prediction of per-pixel class labels.The introduction of end-to-end trained fully convolutional networks (Long et al., 2015) has provided the foundation for recent success on this task.Additional innovations to improve segmentation accuracy further have included a multi-scale cascade model named U-Net (Ronneberger, Fischer and Brox, 2015), dilated convolution (Chen, Zhu, Papandreou, Schroff and Adam, 2018) and pyramid pooling (Zhao, Shi, Qi, Wang and Jia, 2017).In contrast to these methods, we explore semantic segmentation in the few-shot scenario.
Few-shot learning.Few-shot learning (FSL) explores methods to enable models to quickly adapt to perform classification of new data.FSL methods can be categorized into generation, optimization or metric learning approaches.Generation methods (Hariharan and Girshick, 2017;Wang, Girshick, Hebert and Hariharan, 2018;Chen, Fu, Zhang, Jiang, Xue and Sigal, 2019;Liu, Sun, Han, Dou and Li, 2020) generate samples or features to augment the novel class data.Optimization approaches (Finn, Abbeel and Levine, 2017;Ravi and Larochelle, 2017) learn commonalities among different tasks, then a novel task can be fine-tuned on a few annotated samples based on the commonalities.Metric learning methods (Snell, Swersky and Zemel, 2017;Grant, Finn, Levine, Darrell and Griffiths, 2018) learn to produce a feature space that allows samples to be classified by comparing the distance between their features.Most FSL methods focus on image classification and cannot be easily adapted to produce the per-pixel labels required for segmentation.
Few-shot segmentation learning.The first FSS method (Shaban, Bansal, Liu, Essa and Boots, 2017) employed a two-branch comparison framework that has become the basis for FSS methods.PaNet (Wang et al., 2019) used prototype feature-vectors to represent support object classes, then compared their similarity with query features to make predictions.Other methods have improved different aspects of this process, for example, by extracting multiple prototypes representing different semantic classes (Yang et al., 2020;Li et al., 2021), by iteratively refining the predictions (Zhang et al., 2019b), or using a training-free prior mask generation method (Tian et al., 2020).Some methods extract information not only from support images, mining latent classes from the training dataset to search for more prototypes (Yang et al., 2021), or supplementing prototypes with support predictions (Zhang et al., 2021).

Problem Setting
, where is the semantic segmentation mask for the training image , and is the number of image-mask pairs.During testing, the model has access to a support set = ( , is the semantic segmentation mask for support image , and k is the number of image-mask pairs, which is small (typically either 1 or 5 for 1-shot and 5-shot tasks respectively).A query (or test) set = ( , ) ∈  is used to evaluate the performance of the model, where is the ground-truth mask for image .The model uses the An overview of our method for 1-shot segmentation.
Like other FSS methods, our method extracts a foreground prototype from the support image and uses this to predict a foreground segmentation mask for the query image.QSR (dashed box) operates at training time to learn to represent different semantic categories in the query image, and uses this class information to define a background prototype.The background prototype is then used to predict a segmentation mask for the background regions of the query image via the same decoder as is used for the foreground prediction.
To improve the accuracy of this additional prediction, the decoder is induced to become more discriminate.This ability to discriminate between foreground and background objects results in improved performance at test time, when the process illustrated in the dashed region is not used.
support set to predict a segmentation mask, ̂ , for each image in query set .

Overview
Fig. 2 illustrates our method for 1-shot segmentation.Both support and query images are input into a shared CNN.In common with our baselines, CaNet (Zhang et al., 2019b), ASGNet (Li et al., 2021) and PFENet (Tian et al., 2020), we use a ResNet (He et al., 2016) pre-trained on ImageNet (Russakovsky et al., 2015) for this encoder backbone and choose features generated by 2 and 3.All parameter values in 2, 3, and earlier layers are fixed.These features are concatenated and encoded using a convolution layer.The convolution layer parameters are optimized by the loss function (details in Section 4.3).For CaNet (Zhang et al., 2019b) and ASGNet (Li et al., 2021), this layer has a 3 × 3 convolution kernel shared between support and query branches.For PFENet (Tian et al., 2020), two independent 1 × 1 convolution layers are defined for support and query features respectively.After the convolution layer, the CNN produces support features and query features of size × ℎ × , where is the number of channels, and ℎ, are the height and width.
As for the baseline methods (Zhang et al., 2019b;Li et al., 2021;Tian et al., 2020), masked average pooling (MAP) was used to extract the foreground prototype : where indexes the spatial locations of features, and 1[⋅] is the indicator function, which equals 1 if the argument is True and 0 otherwise.
Global average pooling (GAP) was used to extract a query prototype from the query features : Both the foreground and query prototypes were input to our QSR method (defined in Section 4.2).QSR maps different regions of the query image to semantic classes, and uses this class information to generate a background prototype : In Section 4.3, we describe how we utilise the prototype decoder module from the baseline FSS method.These modules are used to predict final semantic segmentation masks.The foreground prototype is used to make a foreground prediction ̂ and the background prototype is used for a background prediction ̂ .The prototype decoder modules for foreground and background prediction are identical and share parameters.Our method only predicts a background mask during training.During testing the method is identical to the baseline and only uses the foreground prototype to predict the foreground mask.
In this paper, we limited ourselves to being consistent with the baselines: using a frozen backbone CNN and masked average pooling to extract a single foreground prototype.In addition, we also extract only one background prototype making is possible to share parameters in the decoder module that is applied to both the foreground and background prototype.Future work might usefully explore improved methods of representing foreground objects, for example, by using multiple prototypes.

Query Semantic Reconstruction
Our method assumes that images contain objects from known classes and latent classes.Known classes are ones corresponding to the labels provided in the training data and we define them as C = { 0 , 1 , ..., }.The number of known classes, , is defined by the training dataset, for example = 15 in PASCAL-5 (Everingham, Van Gool, Williams, Winn and Zisserman, 2010).During training, the foreground class is contained in C .Latent classes are given the generic label of 'background' in the training data.However, we define multiple latent classes to represent possible background objects and they are defined as C = { 0 , 1 , ..., }.The number of latent classes, , is a hyper-parameter and the effects of different values were explored in experiments, the results of which are reported in Table 6.The background class must be a member of the set of latent classes or the set of known classes, excluding the class of the foreground object, which can be expressed as: Mapping between prototype feature-vectors and classes is achieved using a layer of weights.A known class weight matrix whose size is × maps from the 1 × prototype to the known class labels.Hence, each row vector in represents the corresponding category in C = { 0 , 1 , ..., }.In the same way, a latent classes weight matrix , with size × , maps from a prototype to the latent categories in C = { 0 , 1 , ..., }. and are both randomly initialized.
The known class weights can be learnt directly from the training data.In each episode, ( , ) is calculated from ( , ), where ∈ C .× is used as the prediction for the category of the foreground object.Cross-Entropy (CE) loss can then be used to update the known class weights to provide better representations of object class labels: The true latent class labels are unknown, so learning the latent classes weights assumes that all categories (both known and latent) should be independent of each other.A possible method to achieve this is the application of contrastive loss (Zbontar, Jing, Misra, LeCun and Deny, 2021;Chen and He, 2021) to constrain each class representation to be independent by maximizing the orthogonality of their representations.A previous FSS method, ASR (Liu et al., 2021), has used contrastive loss to generate orthogonal semantic prototypes for foreground classes.In this paper, we apply the technique used in (Zbontar et al., 2021), a more efficient method, to constrain all class weights to be independent.Specifically, we define as the concatenation of and , (i.e. has size ( + ) × ), we first calculate the cross-correlation matrix, , as: The loss function for learning the latent class weights is defined as: where i, j index the spatial location of the cross-correlation matrix.The latent loss tries to make the cross-correlation matrix close to the identity matrix.This causes each category to be statistically independent of all others.
As illustrated in Fig. 3, a background score, , is calculated to measure the correlation between each nonforeground class and the query image prototype: where is the query prototype from Eq. ( 2).Finally the background prototype is calculated, by back-projecting the scores (which represent the classes predicted to be present in the background) through the weights that represent the classes: where the colon means the whole dimension.This generates a prototype that represents a mixture of feature-vectors representing the classes believed to be present in the background of the query image.
In order to be able to share the same decoder with the baseline, is set to 256.However, such a large value may cause the background prototypes to be redundant.On PASCAL-5 the ratio between the class number ( + ) and in is 30:256, compared to about 8:1 in (Zbontar et al., 2021).Although these two ratios are used in unrelated tasks, and we also have the known loss to constrain the part of , in future work it would be worth-while setting as a hyper-parameter that can be tuned for different datasets.

Prototypes Decoder Module
We use CaNet (Zhang et al., 2019b), ASGNet (Li et al., 2021) and PFENet (Tian et al., 2020) as baselines on which to test our method.These methods have been widely used as the underlying model enhanced by various previous techniques (Yang et al., 2020;Wu et al., 2021;Zhang et al., 2021).Unlike most previous methods that modify the structure of the baseline decoder network, we try to improve it through better training.Each baseline incorporates a prototype decoder module (called the Iterative Optimization Module in CaNet, FPN in ASGNet and the Feature Enrichment Module in PFENet) that takes as input the foreground prototype and query features, and outputs a predicted segmentation mask ̂ .In addition to using this module in the standard way, we also use it with the foreground prototype replaced by the background prototype, so that it outputs a background prediction ̂ .When predicting the background mask in the ASGNet baseline, we use only one background prototype ignoring its ability to use multiple prototypes.PFENet also uses a prior mask ( ) to supplement ̂ and this input is replaced by (1 -) to predict ̂ when using PFENet as the baseline.
Based on the two predicted segmentation masks, we define two loss functions which are consistent with those used by the baselines: The overall loss combines the losses defined in Eqs. 5, 7, 10 and 11, as follows: where and are parameters to balance the losses.Results for experiments investigating the effects of these hyperparameters are reported in Table 7.When = = 0,  =  ( ) and the whole method degenerates to the baseline.
For multi-shot tasks (i.e. when applied to k-shot FSS when > 1), we use the same method as the corresponding baseline.Specifically, CaNet (Zhang et al., 2019b) designs an attention mechanism to fuse different features generated by each of the k support images.ASGNet (Li et al., 2021) uses super-pixels to generate multiple prototypes of support images.PFENet (Tian et al., 2020) averages the foreground prototypes from k support images together.As QSR obtains the background prototype from the query image, QSR is unaffected by the number of support images which makes QSR easy to integrate with different baseline methods.

Experimental Setup
Datasets.We evaluate our method on two benchmark datasets, PASCAL-5 (Shaban et al., 2017) and COCO-20 (Nguyen and Todorovic, 2019).PASCAL-5 includes the PASCAL VOC2012 (Everingham et al., 2010) and the extended SDS datasets (Hariharan, Arbeláez, Girshick and Malik, 2014).It contains 20 classes which are divided into 4 folds each containing 5 classes.COCO-20 is the MS-COCO dataset (Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár and Zitnick, 2014) with the 80 classes divided into 4 folds each containing 20 classes.Following previous standard practice (Zhang et al., 2019b;Tian et al., 2020), we use 4-fold cross validation to measure performance on both datasets: testing each fold in turn using a model that had been trained on the other three folds.A random sample of 1,000 query-support pairs is used to test each fold in PASCAL-5 and 20,000 in COCO-20 .
Implementation details.As mentioned above, we use CaNet (Zhang et al., 2019b), ASGNet (Li et al., 2021) and PFENet (Tian et al., 2020) as baselines.The whole model is trained end-to-end.As QSR is only used in the training phase, the model is identical to the baseline during testing.The details specific to QSR were as follows: the class weights (Section 4.2) were initialized from the uniform distribution (− √ 1∕ , √ 1∕ ).The loss weights & (Eq.( 12)) were set to 1.0 & 0.5 in PASCAL-5 and 1.0 & 0.1 in COCO-20 .The motivation for reducing for COCO-20 was because this dataset has more categories.The number of latent classes (Section 4.2) was set to 15 in PASCAL-5 Table 1 mIoU (%) results for 1-shot and 5-shot FSS on PASCAL-5 .'Mean' is the mIoU averaged across folds.The best result for each column is in bold.Methods of first two rows use VGG16 (Simonyan and Zisserman, 2014) for feature extraction while all others use ResNet-50 (He et al., 2016).
During training, we used the methods and hyper-parameters used by the baselines.Specifically, for CaNet (Zhang et al., 2019b), weights were optimised using SGD with momentum of 0.9 and a weight decay of 0.0005.Training was performed for 200 epochs with a learning rate of 0.00025 and a batch size of 4. For ASGNet (Li et al., 2021), the model was trained with the SGD optimizer and an initial learning rate to 0.0025 with batch size 4 on Pascal-5 , and 0.005 with batch size 8 on COCO-20 .For PFENet (Tian et al., 2020), SGD was also used as the optimizer.The momemtum was set to 0.9 and the weight decay to 0.0001.On PASCAL-5 , 200 epochs were used with a learning rate of 0.0025 and a batch size of 4. On COCO-20 , the PFENet baseline was trained for 50 epochs with a learning rate of 0.005 and a batch size 8.On both datasets, the learning rate was reduced following the "poly" policy (Chen, Papandreou, Kokkinos, Murphy and Yuille, 2017).
Evaluation metrics.Following standard practice, we use mean intersection over union (mIoU) as the primary evaluation metric.It computes the IoU for each individual foreground class and then calculates an average of these values over all classes (5 in .We also report the results of FB-IoU, which calculates the mean IoU for the foreground (i.e. for all objects ignoring class labels) and the background.We use false positive rate (FPR) which is defined as FPR = FP FP+TN , where FP is the number of background pixels incorrectly labelled as foreground, and TN is the number of background pixels correctly labelled as background.

Comparison with the State-of-the-Art
Table 1 and Table 2 compare our method with other approaches on PASCAL-5 .When QSR is applied to PFENet, the method outperforms the previous state-of-the-art in both the 1-shot and 5-shot settings.For each baseline, the QSR method improves performance on every fold, and overall, for both 1-shot and 5-shot segmentation tasks.This is achieved with only a small increase in the number of learnable parameters, as indicated in the last column of the Table 2.These additional parameters are due to matrix (see Section 4.2), and are only used during training: at test time the proposed method uses an identical number of parameters as the corresponding baseline.The ability to improve performance for three existing FSS methods, suggests that QSR may have the potential to provide a general-purpose method of improving the accuracy of FSS approaches.Additional results using a different backbone architecture are shown in Table 3.
Table 3 mIoU (%) results for 1-shot and 5-shot FSS on PASCAL-5 .These results were produced using a different feature extraction backbone than was used for the corresponding results in Table 1.

Method
1-shot 5-shot RPMMs (Yang et al., 2020) 30.6 35.5 CWT (Lu et al., 2021) 32.9 41.3 ASR (Liu et al., 2021) 32.6 34.4 RePRI (Boudiaf et al., 2021) 34.1 41.6 MMNet (Wu et al., 2021) 37.2 38.0 SCL (Zhang et al., 2021) 37.0 39.9 MLC (Yang et al., 2021) 33.9 40.6 ASGNet (Li et al., 2021) 34.6 42.5 ASGNet+QSR 37.4 44.1 PFENet (Tian et al., 2020) 32.4 37.4 PFENet+QSR 36.9 41.2 These results show that increasing the size of the backbone does not, this case, improve performance, but that QSR continues to improve performance in comparison with the baseline.Table 4 compares our method with other approaches on COCO-20 .QSR is able to increase performance when used in conjunction with both baselines, and for the ASGNet baseline increase performance a level that is state-of-the-art.This is achieved with only a small increase in the number of learnable parameters used during training.The number of additional parameters are 15.36k.The reason for the larger increase in parameters here compared to that for PASCAL-5 is due to matrix being larger due to an increase in the number of classes.More detailed results for the proposed, showing performance on individual folds and with different backbones, are shown in Table 5.These results show that QSR is consistent in improving performance across folds.

Ablation Study
The following ablation studies were conducted with the PFENet baseline using the 1-shot setting on PASCAL-5 .
Numbers of latent classes.Table 6 compares the performance achieved when using different numbers of latent classes, .When = 0 there are no latent classes, only known classes, and = (see Section 4.2).It can be seen that the best results were produced when = 15, which is equal to the number of categories in the training data (15 in PASCAL-5 ).As the number increased, the results become poorer.However, for every value of tested, the performance of the proposed method improves on the results produced by the the baseline model (60.8%, see Table 1).
Effects of loss weight.Table 7 shows the impact of different loss weights, and (see Eq. ( 12)) on the results.When = = 0, the loss function becomes equivalent to the baseline loss  ( ), the results produced are therefore identical to those of the baseline model.All combinations of non-zero values for and produced mIoU results that were better than those of the baseline.For the loss weights tested, the best results were produced with = 1, meaning that the background and foreground information was weighted equally, and = 0.5.
Background prototype from support images.Table 8 explores the effects of extracting background information from different images.In the baseline, background information was not used, and the results are the same as the underlying FSS method.For the results labelled 'Support', the background information was extracted from the support image, rather than the query image.This was achieved by replacing the query features in Eq. ( 3) with the support features , but keeping other settings unchanged to allow for a fair comparison.It can be seen that this method produces little improvement over the baseline.For the results labelled 'Query', the background information was extracted from the query image.This is our proposed QSR method of extracting background prototypes, which produces a more significant improvement in the results.Hence, extracting background information from the query image is more effective than extracting it from the support image.We believe that this is due to there being a diverse range of backgrounds against which objects from the same category can appear in different images.Extracting foreground and background information from different training images enables the decoder to be trained to correctly distinguish foreground objects from a larger variety of backgrounds.
Importance of prototype reconstruction.Table 9 shows the effects of using different methods to extract the background prototypes.The results labelled 'Mask' used the query image segmentation masks (which are available during training) to obtain the background prototypes directly.Specifically, masked average pooling (Eq.( 1)) was used to generate background prototypes replacing those generated by QSR in Eq. (3).The final loss function in Eq. ( 12) becomes  =  ( ) +  ( ).As Table 9 shows, this method improves the results compared to the baseline, which reinforces the idea that using background information can improve the training of the model.However, QSR provides a further improvement in the results, suggesting that the background prototypes created through the proposed method Table 5 mIoU (%) results for 1-shot and 5-shot FSS in COCO-20 .This table shows more detailed results, with performance on each fold, compared to Table 4.In addition, it also shows additional results for our proposed method when using a ResNet101 as the feature extraction backbone.This allows a more direct comparison with the published results for PFENet using ResNet101.

Table 6
Effects of different numbers of latent classes, .

Table 7
Effects of different loss weights, and .

Model Analysis
The following experiments to analyze QSR were performed with the PFENet baseline using the 1-shot setting in PASCAL-5 .
What latent classes represent.Latent classes (see Section 4.2) are used to represent classes that are undefined in the training dataset, but may correspond to unlabelled background features.To visualise these latent classes we identified the three highest scores (see Eq. ( 8)) for latent classes.Then generated a background prototype for each of these high-scoring latent classes in turn, and used those prototypes to segment the image.The results for two example It can be seen that each latent class represents a certain area of the background.This shows that the latent weights do represent the unknown categories of the background.However, these categories do not correspond to meaningful categories, that might be given distinct labels by a human.This is because QSR constrains the latent classes to be statistically independent from each other and the known classes.This constraint does not force latent classes to correspond to specific background classes, but allows them to learn combinations of background features.It can also be seen that when the background prototype is generated using all non-foreground classes, in the way we propose, that this prototype does an excellent job of identifying almost all background regions in the two example images.This is even the case (as shown for the chair example) when the situation is challenging due to the object occupying a very small proportion of the image and both the background and  foreground in the query image having little similarity with the support image.
False positive rate.QSR uses background information during training in order to make the model more descriminative and the foreground prototypes extracted during testing less likely to be matched with the background.The results shown in Table 10 demonstrate that QSR does indeed reduce the FPR compared to the corresponding baseline FSS algorithm.
Qualitative results.Fig. 5 shows some qualitative results.In the far right column above the line is an example of an unsuccessful segmentation, but a result where the false positive rate is reduced.

Conclusion
This paper proposes query semantic reconstruction (QSR) for few-shot segmentation.By associating the query image with semantics during training, QSR obtains background information from the query image to mine negative samples in order to make a more discriminative model that reduces false-positives.QSR improves the performance of three different baselines, and for one of them the improvement Table 10 False positive rate (%) results.The smaller the value, the lower the rate of mispredicting background regions as foreground.The results for baselines were produced using the original authors code, as no FPR results were reported in their papers.

Method
P-5 0 P-5 1 P-5 2 P-5 3 mean CaNet 10.9 7.9 9.8 10. is sufficient to produce state-of-the-art results for both the 1-shot and 5-shot settings on PASCAL-5 .Future work might usefully explore improved methods of representing foreground objects or the use of background information at test time.In addition, due to limited computing resources, we did not tune the number of latent classes (see Section 4.2) on COCO-20 .Trying more may produce better performance.and Pattern Recognition, pp. 5217-5226. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017.Pyramid scene parsing network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2881-2890.

Figure 1 :
Figure 1: Motivation for our method.Most previous FSS methods (as shown above the dashed line) use a decoder to classify features of the query image, by comparing them to a foreground prototype extracted from the support image and mask.This process often produces false positives: misclassifying the background (e.g .cat) as the foreground (e.g .dog).QSR (as shown below the dashed line) uses background information extracted from the query image at training time to learn a more descriminative decoder which is achieved by the semantic separation and foreground elimination.

Figure 3 :
Figure 3: Query semantic reconstruction (QSR).A query prototype is multiplied with the semantic class weights (which are optimized by  and  ) to generates score values measuring the correlation between and each class.The score for the current foreground class is set to zero.The score is multiplied with to reconstruct a background prototype eliminating any contribution from the foreground class.Note that the foreground class is one of the known classes, but is shown using a different colour for clarity.

Figure 4 :
Figure 4: Visualized results of latent classes and background predictions.For each class, (a) the prediction results for three latent classes, (b) the final background prediction, (c) the query with foreground masks, (d) the support image with foreground masks.

Table 2
FB-IoU (%) results of 1-shot and 5-shot FSS on PASCAL-5 .'Params' is the number of learnable parameters (values preceded by a plus show the number QSR added during training).−denotesresults that were not provided in the original paper.For methods listed in Table1but not here no relevant data was provided in the published work.

Table 8
Effects of different sources for background prototypes.

Table 9
Effects of methods to reconstruct background prototypes.