A survey of recent interactive image segmentation methods

Image segmentation is one of the most basic tasks in computer vision and remains an initial step of many applications. In this paper, we focus on interactive image segmentation (IIS), often referred to as foreground-background separation or object extraction, guided by user interaction. We provide an overview of the IIS literature by covering more than 150 publications, especially recent works that have not been surveyed before. Moreover, we try to give a comprehensive classification of them according to different viewpoints and present a general and concise comparison of the most recent published works. Furthermore, we survey widely used datasets, evaluation metrics, and available resources in the field of IIS.


Introduction
The main goal of image segmentation is to divide an image into homogeneous regions according to common characteristics such as spatial position, color, shape, texture, and motion (in the case of video segmentation). Emulating the human perceptual system's ability to segment or divide an image into meaningful regions is still challenging and has been widely studied since the early days of computer vision. While many works, including recent ones, have been presented in the literature to review segmentation algorithms [1,2], semantic segmentation techniques  [3], and medical image segmentation [4,5], few works have been dedicated to interactive image segmentation (IIS) methods, although research in this area is very active and a recent periodic overview remains necessary. In fact, except for the comparative evaluation in Ref. [6] of some IIS techniques proposed before 2010 and the work of Ref. [7] in 2013, the only work to do so, as far as we are aware, is Ref. [1] which briefly addressed IIS as part of its survey.
IIS, "supervised segmentation", and "semiautomatic segmentation" all mean the task of extracting an image region or object of interest from the background (BG) using prior knowledge provided by user interaction. This interaction, either in the form of some points or scribbles to mark the object of interest and/or the BG, either using a bounding box (BB) or polygon to delimit the region of interest (ROI), allows the user to provide good constraints (on size, color, location, objectness ...) to guide the segmentation process. This can improve results as well as reducing runtime compared to automatic segmentation methods [7]. In fact, many computer vision applications (medical imaging, image editing, object recognition, and object tracking) need such user intervention to obtain accurate segmentation results, which are then used as input for other highlevel processing.
IIS methods can be classified in different ways depending on the criteria used.
User interaction: the type of user interaction required can be used to divide methods into seedbased and ROI-based approaches [8] or into active and passive interaction-based approaches [9].
Methodology: the methodology used to segment the desired object can be based either on contours or label propagation [1]. In this work, IIS methods are divided into: contour, graph cut (GC), random walk (RW), and region merging (RM)/region growing (RG) based methods. In addition to these categories, with the recent great success of convolutional neural networks (CNN) [10] in image segmentation and classification, a new family of CNN-based IIS methods has appeared, providing high accuracy. IIS techniques that do not adopt any of the above methodologies are classified as "other methods".
Processing level: IIS techniques can be categorized into pixel-wise, superpixel-wise, or hybrid (pixel/superpixel) methods according to the type of image units used in segmentation [7]. Figure 1 summarizes the different categories of IIS methods according to these criteria.
Due to the importance and progress of research in the field of IIS, the aim of this work is to update the literature in this area and to present a recent reference for researchers. We do not tackle the task of interactive video object segmentation (VOS) which usually takes motion information into consideration to segment the desired object. For a good survey of VOS methods, the reader can refer to Ref. [11]. Furthermore, the application of IIS to medical imaging is not addressed in this survey; it has already been discussed in Refs. [4,5]. In the rest of this paper, we will review the different families of IIS methods classified according to user interaction (Section 2), the adopted methodology (Section 3), and the level of processing (Section 4). Moreover, the most frequently used datasets and evaluation metrics will be reviewed in Sections 5 and 6 respectively. Existing evaluations of IIS methods in the literature and a general comparison of the most recent works are presented in Section 7. Before concluding, Section 8 provides links to code and software for different IIS methods.

Classification of IIS methods based on user interaction
To define the image content to be segmented, IIS methods need some kind of user interaction. Many kinds of interaction can be involved, for example some points, line segments, or strokes to mark the object and/or the BG. These interactions provide seeds. Another popular type of interaction is to delimit the desired object with a BB, polygon, or any closed contour to define the ROI. According to the nature of the user interaction, IIS techniques can be divided into seed-based and ROI-based methods. From another perspective, some IIS methods provide the possibility for active intervention or online assistance from the user during the segmentation process [9]. Such methods can be called active interaction-based approaches, while others are considered to be passive. In this section, we consider different IIS methods according to the type of interaction without giving details of the algorithms; these will be addressed in the methodology classification section.

Seed-based methods.
Early seed-based IIS approaches include intelligent scissors [12] and live wire [13]. These methods can be classified as boundary seed-based methods since they require accurate seed points (or anchor points) on the boundary of the desired object. In the same subcategory, Ref. [14] is a variant of Ref. [13] which improves its speed, and Riverbed [15] requires fewer seeds to segment the object. The other subcategory of seed-based IIS methods is region seed-based approaches. These methods do not impose hard constraints on user interaction like the first family, but the result of segmentation is very sensitive to the number of seeds, and different interactions may give different results. The seeds may be just a few points [16][17][18][19][20][21][22][23][24][25][26] or segments of points (scribbles, strokes) . Examples of region seed-based methods are interactive GC (IGC) [49], random walk (RW) [50], and their many variants . These approaches exploit priors from object and BG regions marked by the user with other constraints via an energy minimization framework to segment the object of interest. Many other works using active contours (AC) [72][73][74] or region merging (RM) [75,76] are also region seed-based methods. An example of a hybrid that combines boundary seeds and region seeds is proposed in Ref. [77]. Based on initial boundary seeds, the object contour is traced then a set of foreground (FG) and BG region seeds are selected from adjacent contour points, and the final segmentation is achieved using GC [49] or an image foresting transform [78].
ROI-based methods. Another popular way to guide segmentation is to draw an ROI, and then prior knowledge about FG and BG is extracted from pixels inside and outside the ROI, respectively. GrabCut [79] is the most widely used IIS method which allows user to draw a BB separating the object of interest from the BG. Many works have been presented in the literature to improve GrabCut's performance, for instance, One-Cut [80], Loose-Cut [81], Point-Cut [82], Super-Cut [83], Dense-Cut [84], Deep-Cut [85], Deep GrabCut [86], and Neutro Connectedness Cut (NC-Cut) [8]. Also using a provided BB, the authors in Refs. [87,88] solved the object segmentation as a figure-ground classification problem. As discussed in Ref. [81], the performance of many BB-based IIS techniques degrades when the input BB does not tightly cover the FG object. To overcome this issue, some works introduce the concept of a tightness prior such as Pin-Point [89] and Mil-Cut [90]. Since ROIbased methods use a simpler form of interaction, there are many such approaches, our list is not exhaustive [91][92][93][94][95][96]. Some methods in literature accept either seeds or ROI as methods of user interaction [53,[97][98][99][100]. Figure 2 gives some examples of the different modalities of user input for IIS.
In general, it is easier for users to indicate the candidate object using an ROI by making some mouse clicks to specify a BB or a polygon. However, due to the complexity of the object boundary and its appearance, segmentation accuracy is usually limited by how tightly the ROI is delimited [81]. Therefore, many ROI-based approaches integrate iterative refinement steps [8,79] or/and added new constraints to their framework to achieve accurate object segmentation [8,81].
On the other hand, seed-based algorithms can tackle complex-shaped objects as long as sufficient user inputs are given. Sometimes more rounds of interaction are needed than for ROI-based algorithms [101]. However, some authors have fused multiple cues (edges, regions, and geometric cues) in their framework to segment an object of interest driven by a single touch [102]. Another work [103], minimizes use of user provided seeds by first generating many segmentations of the input image automatically, and then allows the user to click on the boundary of the desired object to refine the final segmentation. Others have exploited all modalities of user interaction by integrating BB-based, seed/scribble-based, and querybased (see Section 2.2) mechanisms in an unified scheme for fast IIS [104].
Regardless of the interaction modality, the key challenge of any IIS is to minimize the amount of effort by the user while accurately segmenting the desired object. Recently, this goal has been ever more achieved by DL-based techniques [17,23,[105][106][107][108][109] which have demonstrated high performance with minimum user inputs.

Active interaction-based methods versus passive interaction-based methods
The principle of active IIS is to give an initial estimate of the image segmentation, and based on this humanmachine interaction is required to provide the user's intent. The authors in Ref. [110] propose an active IIS method which suggests uncertain regions to the user based on non-local uncertainty measurements, and then watershed cut [111] is applied to get the final segmentation, guided by user selection. Another method using this kind of segmentation is introduced in Ref. [112], which asks the user some binary questions based on the probability distribution over a set of sampled segmentations, and then based on the user's responses, the computer estimates label of image regions. In Ref. [9] and its extended version [113], IIS from 1-Bit Feedback is presented using superpixels, entropy, and transductive inference to propose a sequence of informative yes-no questions to the user. Segmentation is then performed progressively according to the user's answers until the desired result is obtained. The work presented in Ref. [114] incorporates the user's feedback via a constrained spectral clustering method to achieve IIS. The strategy used alternates between updating the segmentation and requesting the user to select pairwise constraints. The process finishes when the user is satisfied with the segmentation result or iterations. More recent work in Ref. [104] starts with a user ROI covering the desired FG object. Then, the proposed system provides an active user-assistance mechanism. The most uncertain region is presented as a query for the user who responds with a true-orfalse answer to label it.
A few works use online query-based humanmachine interaction to guide segmentation. Other related works have used this concept for image cosegmentation [115], video segmentation [116], and 3D reconstruction [117]. In passive interaction-based methods, the majority of IIS techniques, it is the user's responsibility to choose the scribbles to guide segmentation, these provide the input used by the computer to perform segmentation. Results are updated when the user modifies the input strokes.
Although passive interaction is widely used not just for the task of image segmentation, but in many computer vision applications, active interaction presents a very important alternative in some specific scenarios where input devices that receive binary signals can be used to collect user responses and exploit them without touching the computer. For example, in a sterilized environment, a physically touched computer control for medical image segmentation is inappropriate, or, for instance, when tiny screens on wearable computers have limited interface capabilities [9,113].

Classification of IIS methods based on methodology
Based on the adopted methodology, IIS methods can be classified in different ways [7]. Here, we divide IIS techniques into contour-based methods, GC-based methods, RW-based methods, RG/RM-based methods, deep learning (DL) techniques, and others. We try to discuss the majority of the methods in these categories, especially recent works that have not been covered in previous surveys. Our proposed classification is not strict because many IIS methods employ a variety of techniques in their algorithms, so can belong to more than one category.

Contour-based methods
The main principle of contour-based methods is to extract object contours using edge features and prior knowledge provided by the initial user interaction. For example, intelligent scissors [12], which can be seen as an implementation of Live wire [13], has been integrated successfully as a tool in GIMP [118], and is one of the earliest contour-based methods. The object contour is extracted by calculating the shortest path linking the seed points using Djikstra's algorithm. A variant of Ref. [13] is proposed in Ref. [14] using a faster shortest path algorithm for improved speed. Riverbed [15] is another IIS based on boundary seeds which requires fewer user interactions and uses an optimum boundary tracking process to extract the object contour.
The supervised active contours model (ACM) is the best known family of contour-based IIS methods. For instance, interactive convex active contours (CAC) [72] takes as input user FG and BG strokes and the segmentation result obtained by any other ISS technique such as Ref. [53] or [50]. Then the energy equation of the convex active contours in Ref. [119] is reformulated to take into account prior knowledge from the user interaction and the probability map of the initial segmentation. Finally, evolution of the contours is performed by minimizing the energy using the split Bregman method [120]. To improve the performance of the CAC method, the authors in Ref. [121] introduce a geodesic energy term in their model; a seed refinement technique is used to update the segmentation iteratively. Recently, to tackle the case of noisy images with inhomogeneous intensity, the authors in Ref. [122] proposed a multi-phase level-set model by combining denoising a constrained surface with a denoising fidelity term. In Ref. [74], an interactive ACM with kernel descriptor is presented. First, a color kernel descriptor is employed to compute image patch features, and these patches are grouped into clusters. Then, template feature sets are used to formulate an energy functional and evolution is performed via the level set method based on the initial contour provided by user interaction.
Using an ACM, the work of Ref. [123] presents a selective segmentation method based on the Chan-Vese model [124] and the geodesic ACM [125], which can segment noisy images given an initial contour and some boundary object seeds. Another supervised ACM based on self-organizing-map (SOM) is presented in Ref. [126]. It combines a variational level set method with the weights of the neurons of two SOMs to preserve the image intensity distribution. The authors in Refs. [127] and [128] respectively use nonparametric kernel density estimation (KDE) and a parametric Gaussian mixture model (GMM) in the level set framework to guide evolution of the contour. Minimum paths in the ACM framework have demonstrated a good performance for boundary extraction in many works such as the geodesically linked ACM introduced in Refs. [129] and [130] and the Finsler minimal path model [131][132][133]. The method in Ref. [134] incorporates discriminative classification models and distance transforms with the level set to avoid local minima and extract accurate object boundaries. Another IIS framework proposed in Ref. [135] uses level sets and Dempster-Shafer theory of evidence. Recently, a novel local regionbased ACM for supervised segmentation uses Bayes theorem [136].
Contour-based methods for IIS are very efficient for accurate boundary extraction and very suitable for deformable object segmentation. Several methods based on ACM have been introduced in recent years to deal with intensity inhomogeneity and the presence of noise [122,126]. Common shortcomings of this category of methods are the need for manual adjustment of the initial parameters and a lengroundtruthy processing time [137].

GC-based methods
The first GC-based IIS (IGC) [49,138] solved the problem of image labelling in the Markov random field (MRF) framework [139] by optimizing the following energy functional using min-cut/max-flow algorithm [140]: where L b = {l i } is the final segmentation of the image and l i ∈ {0, 1} such that, l i = 1 if the ith pixel belongs to the FG and l i = 0 otherwise.
The first term of the energy equation U i is a unary potential calculated from the intensity histogram. The second term V i is the pairwise potential encouraging spatial coherence. i and j index pixels and N s is the set of pairs of adjacent pixels.
GrabCut [79] extends IGC by replacing the unary potential in Eq. (1) by Gaussian mixture models (GMM) computed from BG and FG regions instead of the monochrome histogram. Furthermore, GrabCut applies an iterative framework by alternating between optimization and GMM estimation until convergence. Lazy snapping (LS) [59] was proposed to improve both speed and accuracy of IGC by processing image regions instead of pixels using the watershed algorithm for over-segmentation [141]. GMMRF [142] is an adaptive Gaussian mixture Markov random field model based on a pseudo-likelihood algorithm [143] that aims to learn GC parameters from image data before optimizing the energy functional. PinPoint [89] employed convex continuous optimization and GC to tackle the task of IIS under hard constraints with respect to the tightness of the provided BB. The authors in Ref. [144] applied the same segmentation scheme as GMMRF except that an initial optimization step is included. This initial step combines the standard k-means algorithm [145] with swap moves [146] to find a clustering that leads to histograms which better separate the FG region from the BG GrabCut in One-Cut [80] proposed a new energy term using the L 1 distance between FG and BG appearance models that can be optimized in a single GC step assuring good segmentation and fast running time. LooseCut [81] followed the MRF used in GrabCut [79], to handle cases where the BB only loosely covers the object of interest. To do so, the authors added a label consistency term to the energy function to give the same label to pixels with similar appearance. Furthermore, a global similarity constraint is imposed in the iterative process to distinguish FG and BG according to the difference between the appearance models. The work of Ref. [61] proposed discriminative Gaussian mixtures (DGMs) to boost the performance of Ref. [49]. It uses an automatically selects features, number of models in the mixture, and parameter estimation to maximize the discriminant power of the models.
In Ref. [64], a diffusive likehood based GC method is proposed to perform accurate object segmentation from input seeds using a likehood diffusion strategy and perceptual learning. First, initial probabilities of both pixels and superpixels are estimated using a GMM. Then, a diffusion technique is applied to explore global similarity relationships. Finally, the segmentation is obtained by optimization of an energy function at pixel and superpixel levels using the GC algorithm. The method of Ref. [147] combines both color and texture information in the GC model and incorporates AC in the segmentation process to address the case of textured images. Using pixel-level and patch-level information, Ref. [148] generates structural features of image patches using a GMM. Then, patch-level and pixel-level information are integrated in the GC framework, allowing preservation of details around boundary regions and improving the accuracy of the segmentation result. Superpixel-guided IIS based on GC is presented in Ref. [31]; it achieves segmentation in two stages. An initial segmentation is obtained at the superpixel level using GC, and then a narrow band is constructed along object contours using a morphology operator. Secondly, pixel-level GC segmentation provides accurate segmentation of pixels around the edges. Super-Cut [83] was proposed to improve GrabCut. It first over-segments the input image into superpixels, and then clusters them using a novel local similarity constraint. Finally, segmentation is finished in one cut using the clustered results. GC-based methods with geodesic priors exploit the compactness of objects as a spatial constraint and use geodesic distance to compute the lowest cost path between two points instead of Euclidian distance. For example, in Ref. [54] the authors integrate the geodesic distance in the data term of the energy function; Ref. [53] extends the same principle to perform soft segmentation. Geos [149] is another geodesic-based method that computes the geodesic distance to obtain a set of restricted image segments, and obtains the final segmentation by finding the solution minimizing the energy cost. Geodesic star convexity proposed in Ref. [55] employs the geodesic distance in addition to shape constraints to perform robust IIS. Recent work in Ref. [150] takes the advantages of Refs. [54] and [80] to develop a geodesic appearance overlap GC framework for IIS.
GC-based methods with shape priors impose shape constraints for IIS. For example, Refs. [55,57,151,152] take into consideration the fact that most objects are convex and use star convexity constraints on IGC to improve the connectivity of segmented FG objects either using single star convexity [140] or multiple stars [55]. To incorporate prior shape knowledge, Ref. [153] integrates graph edge-weights containing information about a level-set function of a template, in addition to the boundary and region terms in the GC formulation. In Ref. [97], an adaptive optimal shape constraint selection system uses a nonrigid shape registration technique and local shape consistency evaluation to optimize the GC model. To handle objects with compact shape, Ref. [154] includes a compact shape prior in the GC framework to achieve robust segmentation with a minimum of user seeds. GC-based methods with connectivity priors exploit topological properties of the image. Connectivity in digital topology [155] defines adjacency between points and it is usually used to solve the shrinking bias problem in GC: see Refs. [29,57]. Recently, Ref. [8] introduced the concept of neutro-connectedness (NC) from classic logic to generalize the notion of fuzzy connectedness (FC) proposed earlier in Ref. [156], used for medical image segmentation in Refs. [157,158].
NC-Cut models the topology of image regions with indeterminacy, and then segmentation is performed using both pixel-wise appearance models and region-based NC in the GC framework. EISeg [19] improves the effectiveness of user interaction by computing the NC between image regions and image boundary and provides visual cues so that the user can guide the segmentation process with a minimum number of seeds.
GC-based methods present the most popular approaches for IIS. Our list of works discussed is incomplete and many other ideas have been proposed in the literature to extend or improve the efficiency of GC [28,39,51,52,56,63,92,96,159].

RW-based methods
The original random walk model [50] starts by constructing an undirected graph G = (V, E) to represent the input image. V is the set of nodes; node v i designates pixel i. E is the set of edges; each edge connects two neighboring pixels (i, j) and has a pairwise weight w i,j reflecting the probability of a random walk stepping between these two nodes. The principle of RW is to find the set x of probabilities where x i is the probability of a random walk remaining at node i. The solution is obtained by minimizing the following function: (2) where L is the combinatorial Laplacian matrix given by Letting the matrix L be divided into blocks: The result of the minimization of Eq. (2) can be obtained by solving a sparse linear system: To handle images with complex texture, RW with restart (RWR) [160] introduces the restart probability that the RW will return to the starting node or walk out to an adjacent node. In Ref. [68] image content is modelled using a directed hypergraph adapted to semi-supervised segmentation using an RW process. Sub-Markov RW (SRW) [161] is based on a traditional RW on the graph but adds some new auxiliary nodes. The proposed sub-RW method with label priors solves the twig segmentation problem (handling long thin objects). Graph-driven diffusion and RW for IIS [66,162] incorporates a degree term into the original RW formulation to take into consideration the centrality of every adjacent node and measures its contribution in the diffusion process. In Ref. [163], a new energy functional is used to generalize the RWto semi-local and nonlocal frameworks. The Laplacian coordinates (LC)-based IIS method [58] uses a simple formulation of RW that improves the diffusion process by keeping pixels with similar attributes close to each other and imposing big jumps on the boundary of image regions. Another RW-based model using constrained Laplacian optimization is proposed in Ref. [164] that incorporates the constraints provided by user scribbles into the energy function as Laplacian energy and applies an acceleration strategy to reduce the runtime of the proposed algorithm. Iterative boundary RW [165] presented two approaches, iterative RW and boundary RW, for segmentation potential, in order to develop a feedback system and produce an intuitive segmentation with reduced input. In addition to these approaches, other IIS algorithms [32,62,100,108,111,166,167] have proposed to extend the original Grady RW [50].

RG/RM-based methods
The main idea of region growing and region merging methods is to start from provided labels given by user seeds and to merge similar adjacent regions according to a homogeneity criterion. The original seeded RG method is presented in Ref. [16] and many works have been proposed to improve its performance such as Refs. [25,[168][169][170][171][172]. Maximal similarity-based RM (MSRM) in Refs. [75,76] applies a pre-processing step by over-segmenting the input image into regions or superpixels, and then the user marks the FG and BG. The RM operation is then performed iteratively until all image regions are labelled according to user intention. Other work in Ref. [93] presented an object extraction method from a BB using a double sparse reconstruction strategy, and then applied the RM process of Ref. [75] to obtain the final segmentation result. In Ref. [98], a fast IIS method is presented using discriminative clustering and RM. After over-segmenting the image using the mean-shift algorithm [173], discriminative clusteringbased RM is performed to classify the unmarked regions using color features omitting the spatial information. Then, a pruning step is applied by considering local neighborhood information and using a connected component algorithm. In addition to these works, other proposed RM techniques for IIS include Refs. [174,175].
Despite the conceptual simplicity of RG/RMbased approaches, the limitation of such methods is that different merging orders can produce different results [16]. Furthermore, accurate FG extraction requires sufficient user input to cover the main feature regions, especially when parts of the FG are very similar to the BG and in the presence of challenges such as shadows, low-contrast edges, and textured objects [75].

Deep learning (DL)-based methods
In recent years, CNNs have achieved great success in many vision applications [10] including image segmentation. A detailed recent survey of image segmentation using DL can be found in Ref. [176].
Recently, many IIS methods using deep architectures have been proposed achieving higher performance than classical models. Deep Extreme Cut (DEXTR) [17] presents a CNN architecture for IIS using extreme points at top, bottom, left, and right of the object of interest. The annotated extreme points are given as input to the network in addition to the RGB channels of the input image; ResNet-101 [177] is utilized as the backbone of the proposed architecture. Deep interactive object selection [23] transforms positive and negative clicks provided by the user into separate Euclidean distance maps. These maps are combined with the RGB channels of the input image to create an (image, user interactions) pair used by a fine tuned fully convolutional network (FCN). Finally, a refinement step is applied using GC optimization. Deep GrabCut for Object Selection [86] is inspired by Ref.
[23] and takes a user-provided BB as input instead of user clicks. Furthermore, the proposed model is trained end-to-end and does not need a post-processing step to refine the FCN outputs. Deep interactive region segmentation and captioning [178] uses a hybrid deep architecture that allows detection, segmentation and captioning of the user region of interest from a few clicks using a new deep architecture named Lyncean FCN (LFCN). Iteratively trained IIS [22] followed the work of Ref. [23] and uses an iterative model that receives user clicks as input for the CNN then during training, further clicks are added iteratively based on the errors in the current segmentation. Regional IIS networks [179] presented a new deep framework that exploits local information surrounding provided inputs to capture local region information. Then, multiscale global information is used to improve the feature representation. IIS using a fully convolutional two-stream fusion network [180] has a new deep architecture consisting of two sub-networks: a two-stream late fusion network that estimates the FG at a reduced resolution, and a multi-scale refinement network that refines the FG at full resolution. SeedNet [21] provides an interactive segmentation agent which assists a user to segment an object accurately using an automatic seed generation framework with deep reinforcement learning. The work in Ref. [181] investigated the role of guidance maps in IIS using FCNs. A scale aware guidance map is generated using hierarchical image information, which leads to a significant reduction in the average number of clicks (NoCs) required to extract a desirable FG mask. IIS with latent diversity [24] presents a composite architecture. The first module is a single FCN trained to takes the user's input and the image representation to synthesize a diverse set of solutions. The second module is a network trained to select one of the synthesized segmentations. The authors of Polygon-RNN [182] and Polygon-RNN++ [183] solve IIS as a polygon prediction problem using a recurrent neural network (RNN) to sequentially predict the vertices of the polygons delimiting the object of interest. 100× faster than Polygon-RNN++, a new framework alleviates the sequential nature of Polygon-RNN by predicting all vertices simultaneously using a graph convolutional network (GCN) [184]. Furthermore, the proposed Curve-GCN enables both a polygon and a spline representation of an object contour. DeepCut [85] presents a DL-based extension of GrabCut and formulates object segmentation as an energy optimization problem via a densely connected CRF. It applies an iterative process to update the training models and obtain the final segmentation. The work in Ref. [185] combines powerful CNN models with level set optimization in an end-to-end fashion to tackle the task of IIS. The proposed model employs a multi-branch architecture that learns to predict level set evolution parameters conditioned on a given input image, and evolves a predicted initial contour to extract the object. Both extreme points and motion vectors from annotators dragging and dropping erroneous points, have been incorporated in the interactive framework. Recently, Ref. [106] proposed a backpropagation scheme-based IIS algorithm. To segment a target object, an FCN is trained, and in the test phase, the forward pass in the proposed network is performed using an input image and user-annotation. A backpropagating refinement scheme (BRS) which constrains userspecified locations to have correct labels and refines the segmentation result of the forward pass, is also developed in this work. Following Ref. [106], the authors in their newly proposed work [186] have developed a feature-BRS (f-BRS) that solves the optimization problem with respect to auxiliary variables instead of the network inputs, and requires running forward and backward passes just for a small part of a network. An IIS method that considers all regions jointly has been proposed in Ref. [107] based on Mask R-CNN [187]. The authors have adapted their architecture to deal with single object segmentation as well as full image segmentation. In Ref. [188], a scale-diverse IIS network based on Resnet-101 is proposed, incorporating a set of two-dimensional scale priors into the model to generate a set of scale-varying proposals that conform to the user input. New work in Ref. [189] has explored and demonstrated the importance of the first click for IIS. The authors have developed a deep framework, named First Click Attention Network (FCA-Net), which adds a simple module to the basic segmentation network to shift more attention to the first click.
With the popularization of DL and the high quality of the segmentation results obtained by DL-based IIS techniques with just a few clicks, research in this field remains very promising. However, a common problem of most DL-based techniques that some works [24,179,181,186,189] have tried to overcome is the that excellent results are achieved on objects present in the training set, while poor performance is achieved for unseen object classes. Furthermore, some authors [33] found that the large amounts of data needed to train a large number of model parameters makes the practicality of applying such models to real applications questionable. In contrast with this statement, and due to the emergence of increasingly available data sets and the use of data augmentation to boost the performance of the models [176], DL-based approaches have been successfully used in industry and smart factories [190].

Other methods
Other IIS techniques apply different strategies to propagate initial labels provided by the user to the rest of the image in order to extract the object of interest. In this subsection, we present some of these methods. [191]: This method solves the segmentation problem by matching two graphs, one which represents the over-segmented image and one representing only the regions labelled by user strokes. The optimization step is performed based on deformed graphs [192].

IIS by matching attributed relational graphs (MARG)
Kernel propagation cut (KP-Cut) [193]: Based on kernel propagation [194], KP-Cut starts by generating a small-size seed-kernel matrix that is propagated into a full-kernel matrix for the whole image. FG-BG separation is effectively performed during the kernel propagation process.
Multiple instance learning cut (MIL-Cut) [90]: By imposing and exploring the property of tightness of the BB covering the object of interest, Mil-Cut solves the segmentation problem as a multiple instance learning task by generating positive bags from pixels of sweeping lines inside the BB and negative bags from pixels outside the box.
IIS via graph-based manifold ranking [195]: A graph-based semi-supervised learning framework ranks similarities of unlabeled data points to labeled ones by exploiting global and local consistency of all the data. A three-stage strategy is used to generate the segmentation. First, a k-regular graph is built to model spatial relationships. Then, user provided seeds are integrated to enhance the graph structure. Finally, to overcome instability due to sensitivity to the hyper-parameter, a content based locally adaptive kernel width parameter is used to provide graph edge weights.
Robust affinity diffusion (RAD) for IIS [33]: Unlike algorithms that construct segmentation models based on local affinity graphs such as GC and RW extended approaches, this work proposes iterative diffusion of the local affinity graph to explore global affinity across the whole image. The segmentation model is then constructed on the global graph via an energy function obtained by multiplication of an affinity matrix and a prior probability vector estimated from user seeds. Diffusion is performed as in Ref. [196], and the segmentation is obtained by solving a linear system.
Adaptive constraint propagation-cut (ACP-Cut) [43]: To effectively exploit the small quantity of user information, this work utilizes ACP for semi-supervised kernel matrix learning which allows adaptive seed propagation and complexity reduction. Finally, for FG-BG separation a global k-means [197] algorithm is applied.
Parametric pseudo bound cuts (pPBC) [94]: This work proposed a new general pseudo-bound optimization paradigm for approximate iterative minimization of high-order and non-submodular binary energies. The pPBC algorithm improved the stateof-the-art in many energy minimization problems, in particular IIS using GrabCut [79].

IIS using sample reconstruction and Fisher's linear discriminative analysis (SR-FLDA)
[46]: This IIS method generates image superpixels in an initial step and builds a dictionary using labelled ones. Then, a classification strategy using a discriminative projection matrix through the FLDA algorithm [198] is employed to group unlabeled superpixels into FG or BG by calculating their minimal norm.
Adaptive figure-ground classification [87,88]: This work extracts the FG region from the BG using a BB in multiple steps. First, an adaptive mean-shift algorithm is applied to over-segment the image into regions.
Then, BG and FG priors are explored to gradually refine these regions. Various distance measures and score functions are computed to generate multiple hypotheses and the final segmentation is obtained using a weighted combination or a voting scheme.
Constrained dominant sets for IIS [99,199]: This work is based on the notion of dominant sets [200] which is a graph-theoretic concept and can be seen as a generalization of a maximal clique to edge-weighted graphs. This algorithm can deal with many types of constraints and input modality, and solves the IIS problem in a quadratic optimization framework.
IIS via cascaded metric learning [44]: The proposed approach starts by generating image superpixels and represents them in a space of extracted features (color, intensity, and texture). Positive and negative samples are selected using provided seeds and an optimal classification metric is computed to classify unlabeled samples. Then metric learning and the classification process are performed again using new training samples obtained from the previous iteration; this is repeated until convergence. [103]: Whereas conventional IIS methods start from user provided seeds to segment objects, click carving enables accurate segmentation by precomputing possible segmentation hypotheses. Then, the user clicks on the object boundary to carve away erroneous parts, and the segmentation is refined iteratively until the user is satisfied.

Click carving for IIS
IIS using label propagation through complex networks [201]: This provides a simple graph-based method for IIS with two stages. In the first stage, pixels are connected to their k-nearest neighbors to build a complex network with the small-world property to spread labels quickly. In the second stage, a regular network in grid format is used to refine the segmentation.

Classification of IIS methods based on processing level
With the great success of superpixel oversegmentation algorithms in many computer vision applications, a large number of IIS methods operate on image regions instead of pixels to achieve FG-BG separation while consuming less time and memory. For example, MSRM [75] segments the object of interest by merging homogenous regions obtained by initial over-segmentation using the meanshift algorithm [173]. Meanshift has also been adopted by other works such as ACP-Cut and Refs. [87,88,98,202] to generate image entities in an initial step. The LS [59] method can be seen as a fast version of GC which processes image superpixels obtained by the watershed algorithm [141] instead of pixels. Similarly, the watershed has been employed in Ref. [191]. The active method in Ref. [113] partitions the input image into superpixels using the SLIC method [203] and takes them as input of their algorithm. Since SLIC over-segmentation is fast to compute, it has been used as a pre-processing step in many IIS methods [44,90,93,104,175,[204][205][206]. Other works such as Loose-Cut use a multiscale superpixel algorithm [207] to generate input regions and reduce the complexity of their systems. The method of Ref. [199] chose the ultrametric contour map algorithm (UCM) [208] to over-segment the input image. SR-FLDA and [209] followed [210] by applying two over-segmentation algorithms: meanshift and UCM [208]. As well as classic IIS methods that operate directly on pixels and those that operate on superpixels (already mentioned above), some hybrid algorithms process both pixels and superpixels. For example, NC-Cut starts by computing connectedness between image regions then the final segmentation is found via a pixel-level optimization framework. In the same way, recent work in Ref. [211] proposes a coarseto-fine method from region-level segmentation to pixel-level segmentation. In Ref. [212], an interactive multi-label segmentation approach combines both pixels and superpixels via robust multi-layer graph constraints. Based on graph theory, a hybrid IIS method using pixels and superpixels is presented in Ref. [31]. The authors in Ref. [202] use game theory to optimize the combinational energy functional related to both pixels and superpixels for IIS. Super-Cut stars its algorithm by applying the superpixel oversegmentation of Ref. [213], and then a pixel-level clustering is applied to obtain the final segmentation. The main goal of using superpixels in IIS methods is to create visually meaningful entities while heavily reducing the number of primitives for subsequent processing steps [214], and therefore reducing the time consumed by the segmentation method [59]. The price of the speed of most superpixel-based methods compared to pixel-based ones is that these methods are very sensitive to the quality of the initial over-segmentation and may result in failure, especially in the presence of shadows, lowcontrast edges and similar BG and FG regions [75]. To achieve a balanced compromise between speed and segmentation accuracy, it has been shown that a combination of pixel and superpixel level segmentation always outperforms a single level segmentation [83,202].

Datasets
To evaluate IIS methods, many datasets containing images with associated ground-truth have been proposed in the literature. The most widely used datasets for the task of IIS are presented and described in Table 1. We note that some datasets were designed for other computer vision applications such as object recognition, salient object detection, and image co-segmentation; those used to evaluate IIS are also presented in Table 1. Berkeley segmentation data set (BSDS) [215] Large dataset of natural images with human annotation to serve as ground-truth. Benchmark to evaluate different contour detection and image segmentation algorithms. BSDS500 is the latest version and contains 500 natural images with their ground-truth.
Microsoft GrabCut dataset [79] Dataset for IIS containing 50 color images. Ground-truth is stored as tri-maps identifying FG, BG, and mixed pixels (unknown).
MSRA10K dataset [216] Dataset consisting of 10,000 images with pixel accurate salient object labeling. Proposed to evaluate salient object detection and segmentation methods.
Pascal VOC datasets [217] Data sets from the VOC challenges (1) providing standardized image data sets for object class recognition; includes image annotations.
Alpha matting dataset [218] High-quality matting database that extends the dataset of Ref. [195] by adding challenging images from natural scenes.
Icoseg dataset [115] Large challenging dataset with 643 images proposed to evaluate image co-segmentation methods. It contains 38 groups of images from real scenes. Each group consists of instances of similar objects.
Weizmann dataset [219] Segmentation evaluation dataset containing 200 gray level images with ground-truth segmentations.
Microsoft COCO [220] Dataset for detecting and segmenting objects containing a vast collection of object instances with a total of 2,500,000 labeled instances in 328,000 images.
Cityscapes [221] Dataset for semantic urban scene understanding containing 5000 finely annotated images of driving scenes, including 2975 images for training, 500 for validation, and 1525 for testing. Eight object classes are provided with per-instance annotation.
Kitti [222] Suite of vision components for an autonomous driving platform. The object detection dataset contains 7481 training images annotated with 3D bounding boxes. A full description of the annotations can be found in the readme of the object development kit readme on the Kitti homepage (2) . Pixel-level annotation of a subset of images from the dataset is provided by Ref. [223].

Evaluation metrics
For objective quantitative evaluation of image segmentation techniques including IIS ones, a comparison between the results obtained by a segmentation method and a human-labeled segmentation-the ground-truth-must be performed by measuring their similarity according to some metric [1]. In this section, we report the commonly used different measures for such evaluation as described in various works [1,6,180,198,199]. Table  2 summarizes these measures by describing them, giving their mathematical formulation or a reference to it, and alternative names. In Table 2, we use S and G to designate the output result of the segmentation to be evaluated and the ground-truth, respectively.

Evaluation of IIS methods in the literature
Some works in the literature focus on the evaluation of IIS methods. For example, the authors in their survey of graph-based approaches to image segmentation [229], dedicated an experimental section to compare four selected IIS methods: IGC, Refs. [65], [59] and [50] using the segmentation accuracy and the number of interactions required for segmentation. They concluded in their study that IGC outperforms all the other approaches by producing a high segmentation accuracy while requiring the least amount of user interaction. Earlier, an evaluation study of four popular algorithms: IGC, Refs. [27], [16], and [42] was presented in Ref. [6]. In this study, by exploring two measures to compute object and boundary accuracy, the experiments showed the most effective techniques to be IGC and Ref.
[42]. More extensive evaluation of IIS techniques has been reported in Ref. [7] including ICG, CAC, MSRM, RWR, and MARG. Taking into consideration the accuracy and the robustness of segmentation results, the experiments concluded that each method has its own limitations such as sensitivity to user inputs and noise, sensitivity to the initial oversegmentation, etc. For more details of the results of  these evaluation studies, the reader can refer to the original works Refs. [229], [6], and [7].

General comparison of recent works
In this paper, we select some recent works (about 40 papers) that have been published in the four last years, i.e., between 2016 and 2020, and based on the results reported by their authors, we give a general comparison of these methods grouped by the adopted methodology. GC-based IIS models are very popular in practical applications because of their solid theoretical foundation and good performance. Thus, the research into improving and extending GC methods is always in progress. For example to overcome the drawbacks of GC with ROI-based methods, i.e., GrabCut extents, Super-Cut [83], NC-Cut [8], LooseCut [81], and Ref. [211] have been proposed. Loose-Cut handles robustly the cases of images with a loose input BB while other methods fail because of the inaccurate FG appearance model estimation in this case. Super-Cut tackles also the issue of loose user ROI definition while consuming less time than Loose-Cut. NC-Cut computes an NC map to reduce the sensitivity of user input by exploiting topological properties of the image. All these methods achieve good performance against other methods [79,80,90,94] for different datasets using several evaluation metrics. Point-Cut and the works in Refs. [64] and [150] are other GC-based methods but they are seed-based models. The work in Ref. [64] outperforms many methods including ICG, the original RW, Refs. [58] and [37] as well as other recent approaches such as Refs. [161] and [202]. Furthermore, the results obtained by Point-Cut are very competitive despite the use of only one input user seed point. The authors in Ref. [150] exploited the advantages of both of geodesic GC [54] and One-Cut [80] with little additional runtime and reduced user interaction.
RW-based methods present another category of widely used approaches to perform IIS based on graph theory. Belonging to this category, SRW [161], Refs. [100] and [165] have been recently proposed to boost the performance of RW models. SRW focuses on the improvement of the segmentation of complex textured images which other RW-based methods [50,160,230] fail to solve. However the time consuming of SRW is greater than for the compared methods. The model presented in Ref. [100] achieves good results in spite of the presence of inaccurate initial labels, but some failures occur when initial FG and BG labels have similar color distributions. Recently Ref. [165], a feedback system allows the computer to exploit and understand the intent of limited user input using an iterative boundary.
To solve IIS for deformable objects in the presence of challenges, some recent contour-based methods have been proposed. For instance, the model in Ref. [122] tackles the case of images with intensity inhomogeneity and a high level of noise, producing accurate segmentation compared to other level set methods. In the same way, the interactive ACM in Ref. [74] generates better segmentation results for objects in heterogeneous and cluttered images than CAC [72] and GrabCut. Other recent work [136] has demonstrated its superiority over other wellknown ACM methods with respect to the balance of segmentation accuracy and speed, in the presence of different types of noise and with the use of various initializations.
As described in Section 3.5, recent approaches use CNNs to perform effective IIS. They achieve the state of the art in both accuracy and limited user interaction. The first DL-based IIS method [23] requires only a few positive and negative clicks to mark FG and BG respectively. Its results demonstrate superiority by requiring the fewest NoCs to achieve a certain IoU accuracy when compared to other conventional works on different datasets. Following Ref.
[23], many DL-based approaches have been presented to extent its principles and improve its performance using different strategies such as region proposals [179], extreme points [17], iterative training [22], coupled CNN architecture [24,180], backpropagation refinement [106,186], polygon [182,183], and curve-graph [184] based networks, deep extreme level sets [185], and other strategies [21,107,181,188,189]. In general, all DL-based IIS approaches have the ability to perceive complex global and local image features and achieve better performance than conventional methods. Although direct comparison between the different DL-based techniques cannot be performed due to differences in frameworks, datasets and hardware used, the results reported in two recent works [186,189] reached the state-of-the-art for the datasets used.
In addition to the abovementioned methods, many recent works have been published in the last years and achieved promising results for the task of IIS. Table 3 presents a general comparison of recent IIS methods grouped by the methodology followed. For each method, the key words, the level of processing: pixels (Px) or superpixels (Spx), the type of user interaction, the evaluation metrics, the advantages, and the limitations (if known) are given.

Available resources
The aim of this section is to provide resources such as links to available code and software for IIS research. Table 4 presents the collected resources grouped by methodology. We note that the last access of the links is the time of writing this article.

Conclusions
In this paper, we have reviewed the literature covering IIS methods. We have classified them according to different criteria: the type of human interaction, the adopted methodology, and the level of processing. We focused on recent works including DL-based methods. We have given a general comparison of them and collected available resources including datasets, evaluation metrics, and on line code sources. The main goal of this work is to present a recent survey to serve as reference to researchers in the field of IIS. Dependent on the quality of the generated region proposals Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.