Sequential interactive image segmentation

Interactive image segmentation (IIS) is an important technique for obtaining pixel-level annotations. In many cases, target objects share similar semantics. However, IIS methods neglect this connection and in particular the cues provided by representations of previously segmented objects, previous user interaction, and previous prediction masks, which can all provide suitable priors for the current annotation. In this paper, we formulate a sequential interactive image segmentation (SIIS) task for minimizing user interaction when segmenting sequences of related images, and we provide a practical approach to this task using two pertinent designs. The first is a novel interaction mode. When annotating a new sample, our method can automatically propose an initial click proposal based on previous annotation. This dramatically helps to reduce the interaction burden on the user. The second is an online optimization strategy, with the goal of providing semantic information when annotating specific targets, optimizing the model with dense supervision from previously labeled samples. Experiments demonstrate the effectiveness of regarding SIIS as a particular task, and our methods for addressing it.


Introduction
In the fields of image editing and data annotation, it is crucial to quickly obtain a high-quality pixel- level mask with little human effort. Therefore, the community has invested a lot of effort on interactive image segmentation (IIS) technology, in which users participate in the segmentation process, providing interactive information iteratively to get good masks. To reduce the user burden, research into IIS focuses on two main principles. The first is to carefully design the mode of interaction mode [1][2][3][4][5][6], so that users can provide more information with lower interaction costs. The second is to carefully design the back-end algorithm [7][8][9][10][11][12][13] to make best use of the information provided by users.
In practice, users often annotate multiple related images, such as images in the same categories in a semantic segmentation task, and having the same substructures in human or scene parsing tasks. During inferencing, the IIS model will obtain an almost exact mask, unlike the uncertain one in most computer vision tasks. These observations enlighten us to think how we can use information from previous annotations to assist in the current annotation task, as illustrated in Fig. 1. Currently, this idea is largely neglected by IIS methods, which deal with each image independently without considering the useful priors in previous annotations. Recent work [14] was the first to regard interactive segmentation sequences, and In the annotation process, previous object representations, user interaction, and final prediction masks can inform the current interaction and prediction. optimize parameters through user clicks. This work only makes use of the click information and adopts the incomplete mask as regularization. We take this approach a step further and explore the interaction logic in this task sequence. Moreover, we propose to obtain an accurate mask in individual tasks to better optimize the parameters.
In this paper, focusing on the key principles of interaction mode and back-end algorithm, we propose a systematic solution having two corresponding modules, i.e., initial click proposal and online purification optimization, for the SIIS problem. In terms of interaction mode, we design new interaction logic that greatly reduces the user interaction burden via an initial click proposal (ICP), as shown in Figs. 2 and 3. Specifically, ICP maintains a bank of initial click embeddings for a semantic target. When dealing with a new target, through similarity measurement, ICP proposes an initial click at the most likely position of a real interaction. If accepting the proposal, users can further interact to refine it; otherwise, they may directly correct the proposal with a new click. In terms of the back-end algorithm, we propose an online purification optimization (OPO) strategy for sequential interactive segmentation based on previous interactive results, as shown in Fig. 2. OPO keeps a group of parameters for each semantic target for narrowing the semantic gap. With increasing user annotation, our pipeline will become more efficient for specific semantic targets.
Our contributions can be summarized as follows: • A formulation of the sequential interactive image segmentation (SIIS) task with two key components, interaction mode and back-end algorithm. • To improve interaction efficiency, we provide an initial click proposal (ICP) for SIIS to replace real user input. • To better utilize interaction cues, we apply online purification optimization (OPO) to adapt the model to a specific semantic target using previous annotation.

Fig. 2
Pipeline of the proposed sequential interactive segmentation method. ICP: the initial click proposal is generated automatically and intended to reduce the user burden. OPO: online purification optimization continuously optimizes parameters specialized for the semantic target according to previous annotation masks, to ensure an adaptive semantic proposal space and efficient segmentation. c indicates two convolutional layers. Red and blue dots indicate clicks in foreground and background, respectively. The purple dot indicates the recommended click.

Fig. 3
Initial click proposal (ICP). An initial click proposal is based on the confidence map, a mean of multiple similarity maps measured by image features and all embeddings in the bank from previous initial clicks. The proposal acts as a real user click and provides an initial segmentation. If the proposal is acceptable, the user can further refine it by further interaction. In multi-target scenes, previous masks are erased from the confidence map so that ICP can propose a new initial click for the next target.

Image interactive segmentation
The field of interactive image segmentation has been explored for nearly two decades. Unlike automatic segmentation methods such as semantic segmentation [15], interactive methods take advantage of human participation [16] to segment a specified object. Such methods focus on two issues: interaction mode and back-end algorithm. The former aims to allow users to provide the maximum information with the least interaction. Traditional methods mainly employ scribbles [17][18][19][20][21][22] to denote background and foreground regions. Many variants, such as cross-instance scribble [23], error-tolerant scribble [24], bounding boxes [25], and automatic border lassos [26,27], have also been studied by the community. Recently, deep learning technology has brought stronger perception, which makes lightweight interaction modes possible. For example, the user can directly click on a target object to select it, and on background to discard incorrect predictions [28]. Some other lightweight modes, such as extreme points [1] and boundary clicks [2,3], have also been investigated. The novel IOG [6] method, which combines the outside bounding box and an inside click, also achieves excellent results.
Research on back-end algorithm logic aims to maximize the use of the interactive information provided by users to make accurate predictions. Traditional methods are mainly based on color features [29][30][31][32]. Recently, methods based on convolutional neural networks [28], recurrent neural networks [33,34], graph convolutional networks [35], and reinforcement learning [36] have been applied to the IIS task. Various architectures are used, e.g., regional refinement blocks [37] and two-stream fusion [38]. Some research [7,8] tries to overcome ambiguity in interactive segmentation. Important cues in interactive segmentation have also received attention, as well as training strategies [9], interaction maps [39], and user intents [10,11,40].

Interactive segmentation with online learning
Online learning has been used in many segmentationrelated works [41,42]. For the IIS task, user interactions can be used as a reference for prediction and as a supervision signal for fine-tuning models.
BRS [12] first applied the idea to individual interactive segmentation. Given its assumptions, confidence in the model's prediction at the user's click may not be high enough. Fortunately, uncertainty allows the possibility of model optimization. BRS uses the incorrectly predicted clicked pixels as a penalty to fine-tune the input distance map, which is specialized to the target object, ensuring that the prediction can cover the interaction points well. f-BRS [13] improves BRS by back-propagating part of the model instead of the whole model, making online training more efficient. Recently, Kontogianni et al. [14] attempted to introduce clicked position supervision to image sequence segmentation, with promising results. It employs sparse supervision using only a few positive and negative clicks; these incomplete masks provide a regularization constraint in sparse optimization. However, interactive segmentation has the particular property that the user gets complete masks after interacting with previous images. Our method goes a step further and directly uses all previous predictions as dense supervision instead of a few user clicks. Our OPO approach is based on this idea, utilizing the previous final masks to help segment subsequent pictures.

Method
This section introduces the proposed method. Section 3.1 introduces our modified DeepLab v3+ [43], specifically designed for sequential interactive segmentation. Section 3.2 describes our initial click proposal (ICP) method, which proposes an initial click to the user based on previous initial clicks in the image sequence. Section 3.3 describes our online purification optimization (OPO) method, which optimizes the purification parameters in the modified DeepLab v3+ via online training based on previous annotation masks.

Network architecture
Interactive segmentation essentially involves object segmentation. Most previous click-based interactive segmentation works [8,9,11,13,14], adopt DeepLab v3+ [43] as the segmentation network, using 5 channels (RGB image + positive/negative click maps) as input. This network architecture works usually well. However, the original architecture has two problems in sequential interactive segmentation.
First, we should utilize feature correlation for images in a specific category. The traditional 5-channelinput does not meet our requirements because the annotation input will disturb the semantic feature. In other words, correlations between interaction points will be significantly enhanced, and semantic similarity will dramatically disappear. Second, in sequential interactive segmentation, the parameters should be optimized continuously; we need to save specific parameters for each category. Optimizing global parameters will significantly increase the burden of hardware storage (such as memory or video memory) in real application environments. For the above reasons, we split the original architecture into two parts, the feature extraction part and the interactive segmentation part, as shown in Fig. 2. We call it modified DeepLab v3+. For feature extraction, we adopt ResNet-101 [44] with an output stride of 16 as backbone. The features of the last four layers have {256, 512, 1024, 2048} channels. These features are fed into a simple purification module containing a few 1×1 convolutions to reduce and purify the feature for each specific category. The channel-reduced features are denoted {R 1 , R 2 , R 3 , R 4 }, with {32, 64, 128, 256} channels, 1/8 the size of the original ones. Using the annotation guidance maps (click maps) and these channelreduced features, we apply a Mini-DeepLab v3+ module, whose architecture is similar to the original one. For the encoder module, the input is the annotation guidance map E 0 , which is two Gaussian maps based on points. The input features are gradually combined with channel-reduced features, as Eq. (1): where D(·) means down-sampling, ⊕ means feature concatenation, and C means two convolutional layers with kernel size 3×3. The output E 4 of the miniencoder module is fed into a Mini-ASPP module. Unlike the original one, the down-sampling is only 1/4 instead of 1/8. The Mini-ASPP output with 64 channels is finally concatenated with E 1 and convolved to give the final predictions. Although the modified DeepLab v3+ has some additions to the original, it is lighter as it has fewer channels. Because of the separation of feature extraction and interactive segmentation, more sequential operations can be implemented, and we can better explore sequential interactive segmentation.

Initial click proposal
In sequential interactive segmentation, reducing the user's burden via interaction logic is an important problem. Our initial click proposal (ICP) method maintains a click bank that records feature vectors on pixels where previous initial clicks were located for each category. It is initialized as an empty bank. When the user intends to perform interactive segmentation of a specific category, the similarity between the feature vectors of all pixels and those of previous initial clicks in this bank are calculated. For initial segmentation, the most similar pixel will be marked as an initial click proposal. If the proposal is unsatisfactory, the user can select an initial point manually instead. After the user selects the initial click manually, or adopts a suitable initial click proposal, the corresponding feature vector is stored in the click bank. How should the initial click proposal be chosen? We utilize cosine similarity φ to find the recommended point: Suppose that the target image is T and the image features we choose are F; F(p) means the feature vector at pixel p. Then the recommended pointp is given byp where p n is the recommended point, p 1 , . . . , p n−1 are previous initial clicks, and I is the ignore mask, initially set to ∅ and used for segmentation of multiple instances. In practice, we perform mean filtering on the confidence map before selecting the maximum point to prevent occasional extreme points. We use the last layer R 4 to generate channel-reduced features for the initial click proposal, as it provides more semantic information (this choice is discussed further later). Different approaches can be used for basing interaction around initial click proposals. For example, if the user decides the recommended point is not satisfactory, the user can manually select an initial click using the middle mouse button, and continue to use left and right mouse buttons for interactive segmentation. For multi-category annotation, the ICP method maintains a click bank of initial clicks for each category. When the user selects a semantic tag as annotation, the corresponding initial click proposal is generated and shown. If multiple instances of the same category are present in an image, the ICP method can also make multiple suggestions. As Fig. 3 shows, every time the user finishes an annotation A, the ignore mask I is updated to I ∪ A. Subsequent recommended points will not repeat previous ones.

Online purification optimization
For the first exploration of online training based on previous annotation results, we adopt the most concise training settings, similar to those used for training the baseline model. The core difference is that when training of the baseline model, the number of foreground and background points is between [1,10] and [0, 10], respectively, to simulate user operations better. For online training, they are within [1,5] and [0, 5] to reduce the impact of interaction points. The batch size is set to 8, and we train for four iterations after a complete interactive segmentation. The images and instances are selected from all the annotated ones. In the process of online training, we use stochastic gradient descent for optimization, with a fixed learning rate of 5 × 10 −3 . Unlike other online learning methods, we only optimize a small set of the parameters in the purification module, which is called online purification optimization (see Fig. 2). The purification module is composed of multiple 1×1 convolutions, and its parameters play the role of extracting the original features. In sequential interactive segmentation, the features that each category depends on are often different. Through this purification module, the features are regrouped and integrated, and the parameters of the purification module are changed through online learning. It is called a purification module because it is conducive to extracting parameters for specific categories from the original complex image features. Before the module, the features are impure because they represent all kinds of image features. After this module, each reduced feature can better represent a specific kind of object. As later shown in Fig. 6, the segmentation outcomes of the initial clicks reflect the parameters fitted to the corresponding characteristics of the category.
In real usage, every time the user completes an instance segmentation, a round of online learning for that category is performed. For each category, the segmentation system will save a set of parameters for the purification module. Because the number of parameters is extremely small, as shown in Table 6, the storage space required is small, and does not cause a burden. When the user selects the semantic tags to be annotated, the corresponding parameters will be used. Online learning tasks usually face the problem that it is challenging to run in real time, but the task of interactive segmentation does not have such problems: in most cases, the time used for thinking is much longer than that used for computer processing during interactive segmentation. If a specific interval δ is used together with the model trained on the first (n − δ) samples when segmenting the nth object, it is possible to fully achieve the effect of simultaneous user interaction. Generally, as long as δ is greater than 1, real-time requirements can be met.

Datasets
Several datasets were used, as follows.
Augmented PASCAL VOC [45,46] is a widely used semantic segmentation dataset with 20 categories. Following previous works, we used the training set (25,832 instances) for training and the validation set (3427 instances) for testing COCO [47] is a large-scale dataset provided by Microsoft. We took three settings for testing. For individual interactive segmentation, we followed Ref. [11]. For sequential interactive segmentation, we followed Ref. [14], including COCO (Unseen 6k), COCO (Donut, Bench, Umbrella, Bed).
CoSOD3k [48,49] is a dataset for co-salient object detection, having many categories. We selected the whole set with 4874 instances across 160 categories for the test.
CoCA [50] is another dataset for co-salient object detection; it has special categories which are atypical and do not appear in other datasets, ideal for studying independent semantic tasks. We selected the whole set with 2143 instances across 80 special categories for testing.
Fashionpedia [51] is a dataset of fashion images. We used the 8781 part masks across 46 categories in the validation set for testing, using it for exploring segmenting object parts in sequential interactive segmentation.
LeedsButterfly [52] is a dataset with 832 images of butterflies.

Metrics
For evaluating interactive segmentation, we used the same metric as most other interactive segmentation works. A robot user was used to select the next point in the center of the largest error region. We record the mean number of clicks (NoC) used in an interactive process until each instance reaches the specified intersection over union (IoU) score (represented as @XX%); lower values are better. The recorded NoC values when using online training are mean values over 5 experiments.

Implementation details
We took ResNet-101 [44] pre-trained on ImageNet [53] as a backbone. We set the batch size to 8 and trained for 30 epochs. We used binary cross-entropy loss in baseline training. We adopted exponential learning rate decay with an initial learning rate of 7 × 10 −3 and γ = 0.95 for each epoch. For parameter optimization, we used stochastic gradient descent with a momentum of 0.9 and weight decay of 5×10 −4 . We cropped and resized images to 384 × 384 with random flip and random clip augmentation. For annotation simulation, we used a similar strategy to Ref. [11] and the same iterative training strategy as Ref. [9]. All experiments were implemented with the PyTorch [54] framework and run on a single NVIDIA Titan XP GPU.

Ablation study & discussion
We conducted ablation experiments on our core methods using the six selected datasets. Table 1 shows the NoC metric, for different target thresholds, for all datasets. Overall, no matter which dataset, or which target threshold (@85% or @90%), using ICP and OPO improves the results, demonstrating the effectiveness of the proposed methods. We now analyze the effects of the two core modules, ICP and OPO, on different types of datasets, taking the data with @85% as an example.
When testing on the validation dataset in PASCAL, whose categories are the same as the training set, the parameters in the purification module have been fully fitted to these seen categories. Here ICP is highly effective, with 11.37% improvement. However, the improvement from OPO is quite limited. This is also reasonable because the fully fitted features are more suitable for providing the click proposal, while it is difficult to improve these parameters through online training with limited samples of the seen categories.
COCO and CoSOD3k have a small number of categories that overlap with the training set, and the classes in CoCA are unique. These three datasets are rich in categories, but the number in each class is limited. The improvement is 7.04%, 10.61%, 4.67% with ICP, and 7.70%, 7.97%, 13.66% with OPO. Combining ICP and OPO, the improvement reaches 14.32%, 19.09%, and 18.21%. These data show that our method can bring noticeable improvements even if there are few samples in each category.
Fashionpedia is the most difficult because the segmentation targets are fashion items, while the training samples are all instances, in particular a whole human body. The improvement is only 1.33% with ICP, but it is up to 18.25% with OPO. We speculate that this is because the neural network has a high probability of treating the human body as a unified category so that the feature similarity will cause mismatches to clothing items, accessories, etc. Parameter optimization is still helpful in improving the results for such objects.
For the LeedsButterfly, which only contains several butterflies, the improvement brought by ICP and OPO is significant. The OPO brings 42.04% improvement compared to baseline. For the ICP equipped with the baseline, the improvement reaches 36.62%, and the ∆NoC achieves 0.79. After adding the OPO, the advance of ICP is more significant, with 76.52%, and the ∆NoC achieves 0.96, whose  Table 2 shows the primary metrics for the two network architectures, including the number of parameters (Params), floating-point operations per second (FLOPs), and seconds per click (SPC). We can see that our modified network is more lightweight. This does not mean that our network is better than the original one, but because of the unique design of our ICP and OPO modules, we have to adopt such a change.
Q2: Could features be selected in another layer of the backbone network for the ICP module? Table 3 explores this issue by using features from other layers of the purification module. We see that results are best with the R 4 features, then R 3 , R 1 , and R 2 . This is consistent with our intuition: using the highest-level feature information is more conducive to a good initial click proposal. We also carried out an additional experiment using multi-scale features for the initial click proposal. This further improves results.
Q3: Can the ICP really save time for users and reduce the interaction burden? Table 4 shows results of a user study to verify the effectiveness of the proposed ICP module. 40 images in 4 categories  from the COCO [14,47] dataset were selected as the study set. Half of the data provided the correct recommendation point for each category, and the other half did not. 20 volunteers participated in our user study; they were asked to complete two tasks: (a) find and click on an object in the corresponding category in the provided random pictures, and (b) judge whether the recommended point is correct for provided random pictures and corresponding recommendation point. Results show that a longer time is taken to judge recommended points as wrong than as correct, but both are less than the time needed to click on the object directly. This indicates that the ICP module can reduce user interaction time in practical applications. Q4: How does the quality of initial clicks affect the final segmentation? We carried out additional experiments for initial click proposals with random clicks, using two random strategies for comparison: selecting a random click from the full image, and replacing the initial click proposal by a random point on an object in this category when the proposal is correct. Mean results from five experiments are shown in Table 5. Results are worse if the random click is selected in the full image or the foreground. Because the ICP module is mainly used to locate objects in the given category, the difference is relatively minor between selecting from the foreground compared and from the full image.
Q5: Why do we only optimize the purification parameters? Table 6 compares optimizing purification parameters with global parameters. The number of purification module parameters (Params) is less than 2% of the number for the whole network, yet only optimising these has no significant effect on  the results. For sequential interactive segmentation, it is necessary to save unique parameters for each category. Such a small number of parameters is undoubtedly appropriate. Using a small number of parameters, optimization speed, indicated by seconds per batch (SPB), is faster, which is also helpful to the task. Q6: Does performance improve with increasing online training data? Figure 4(a) shows NoC trends on the LeedsButterfly data when stopping online training after a specified number (abscissa) of samples. With increasing online training data, results initially improve, reaching a stable state later.
Q7: Does the method require access to the whole training dataset during online learning? Our method does not need to access the whole training dataset. We can set a memory bank, and the training samples can always be chosen from this bank. Figure 4(b) illustrates the NoC trend on LeedsButterfly for different sizes of memory bank. Results are affected if the bank size is too small, but after a certain size, they are stable and suitable for practical applications. Table 7 and Fig. 6 provide quantitative and qualitative comparisons of our method to other stateof-the-art methods. Here, we further elaborate on the inference process of "B+OPO". When the user labels an image, the OPO module switches to the specific parameters for the current working category. The user constantly provides foreground and background clicks for annotation. After each interaction, information about the image and clicks are input into our network to generate the corresponding mask. The user can continue to add further clicks for refinement until the mask meets the user's needs. The online training phase is carried out after a new image has been annotated completely; the newly annotated image and several previous images are randomly chosen for an online training step. As the previously annotated masks satisfied the user, they can naturally serve as ground truth labels for supervision. We random simulate clicks in these images according to the corresponding annotated masks; the images and these clicks are fed into the network as in the standard training phase, generating the predictions for optimization. We employ binary cross-entropy loss between predictions and previously annotated masks. Note that, during online training, only the parameters of the OPO module are optimized through stochastic gradient descent to become more suitable for this category, while the parameters of other modules are fixed. As the process continues, labeling of objects will become better and better. Table 7 shows the NoC metric for IIS methods using these datasets with rich categories. Due to the different emphasis of individual and sequential interactive segmentation, we provide these results only to provide an intuitive comparison. These methods are carefully designed and focus on IIS.

Comparison to individual IIS methods
Our method mainly addresses the problem of sequential interactive segmentation and makes some modifications to the basic network. Its performance is comparable to or even surpasses these cuttingedge methods, and shows that regarding interactive segmentation as a sequential process is beneficial. Table 8 provides a comparison to the only other state-of-the-art method [14] for sequential interactive segmentation. This method utilizes correction clicks for online training, while our approach goes a step further and thus achieves better results. Because our method focuses on semantic objects, we mostly use semantic data for our comparison. Through more dense supervision, we make the model more sensitive to each category of objects, and thus perform better. 6 show some results of the proposed method. In Fig. 5, we see that the initial click proposals are located in the interiors of objects in chosen categories, such as animals and vehicles, which can reduce the interaction burden on the user. Figure 6 compares segmentation results with the baseline and other methods using 1, 2, and 3  clicks. With optimized parameters, our method can achieve more accurate results for the same number of interactions, both for whole objects (first three rows) or object parts (fourth row). It can also help users to segment a target from a scene with multiple instances (3rd row).

Conclusions
In this paper, we formulate the task of sequential image interactive segmentation (SIIS). To solve SIIS, we systematically explore it from the points of view of interaction mode and back-end algorithm. For the former, we provide an initial click proposal (ICP) module, which utilizes the previous semantic embeddings of the target object to recommend an initial click proposal to serve as input for the current annotation. We put forward an online purification optimization (OPO) module for the algorithm logic; it fine-tunes model parameters for each target category using previous accurate annotations. Extensive experiments demonstrate the utility of the sequential image interactive segmentation concept, and the effectiveness of our method.

Declaration of competing interest
The authors have no competing interests to declare that are relevant to the content of this article. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.