User-Centric Learning and Evaluation of Interactive Segmentation Systems
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s11263-012-0537-4
- Cite this article as:
- Kohli, P., Nickisch, H., Rother, C. et al. Int J Comput Vis (2012) 100: 261. doi:10.1007/s11263-012-0537-4
- 9 Citations
- 1.6k Views
Abstract
Many successful applications of computer vision to image or video manipulation are interactive by nature. However, parameters of such systems are often trained neglecting the user. Traditionally, interactive systems have been treated in the same manner as their fully automatic counterparts. Their performance is evaluated by computing the accuracy of their solutions under some fixed set of user interactions. In this paper, we study the problem of evaluating and learning interactive segmentation systems which are extensively used in the real world. The key questions in this context are how to measure (1) the effort associated with a user interaction, and (2) the quality of the segmentation result as perceived by the user. We conduct a user study to analyze user behavior and answer these questions. Using the insights obtained from these experiments, we propose a framework to evaluate and learn interactive segmentation systems which brings the user in the loop. The framework is based on the use of an active robot user—a simulated model of a human user. We show how this approach can be used to evaluate and learn parameters of state-of-the-art interactive segmentation systems. We also show how simulated user models can be integrated into the popular max-margin method for parameter learning and propose an algorithm to solve the resulting optimisation problem.
Keywords
Interactive systemsImage segmentationLearning1 Introduction
Problems in computer vision are known to be hard, and very few fully automatic vision systems exist which have been shown to be accurate and robust under all sorts of challenging inputs. In the past, these conditions had made sure that most vision algorithms were confined to the laboratory environment. The last decade, however, has seen computer vision finally come out of the research lab and into the real world consumer market. This great sea change has occurred primarily on the back of the development of a number of interactive systems which have allowed users to help the vision algorithm to achieve the correct solution by giving hints. Interactive systems for generating collages and panoramas of images (Rother et al. 2006) and object cut-and-paste (image segmentation) (Rother et al. 2004) have become particularly among users. Understandably, interest in interactive vision systems has grown in the last few years, which has led to a number of workshops and special sessions in vision, graphics, and user-interface conferences.^{1}
The performance of an interactive system depends on a number of factors, one of the most crucial being the user. This user dependence makes interactive systems quite different from their fully automatic counterparts, especially when it comes to learning and evaluation. Surprisingly, there has been little work in computer vision or machine learning devoted to user-centric learning of interactive systems. This paper tries to bridge this gap.
We choose to study the learning and evaluation problems in the context of interactive segmentation systems which are extensively used in the real world. Interactive segmentation aims to separate an object of interest from the rest of an image. It is a classification problem where each pixel is assigned one of two labels: foreground (fg) or background (bg). The interaction comes in the form of sets of pixels marked by the user by help of brushes to belong either to fg or bg.^{2} Most work on learning and evaluating interactive segmentation systems assume a fixed input, without considering how real world users interact with the system in practice.
The paper addresses the problem of: (1) How to evaluate any given interactive segmentation system? and (2) How to learn the best interactive segmentation system? Observe that the answer to the first question gives us an answer to the second by picking the segmentation system with the best evaluation. We conduct a user study to analyze user behavior and answer the key questions of how to measure (a) the effort associated with a user interaction, and (b) the quality of the segmentation result as perceived by the user. Using the insights obtained from these experiments, we propose a framework to evaluate and learn interactive segmentation systems which brings the user in the loop. Although we apply our framework to only interactive segmentation systems, it can be applied to general machine intelligence and computer vision problems.
We demonstrate the efficacy of our evaluation methods by learning the parameters of the state-of-the-art system for interactive image segmentation. We then extend parameter learning in structured models by including the user effort in the max-margin method. The contributions of this paper are: (1) The study of the problems of evaluating and learning interactive systems. (2) The analysis of the behavior of users of interactive segmentation systems. (3) The use of a user model for evaluating and learning interactive systems. (4) A comparison of state-of-the-art segmentation algorithms under an explicit user model. (5) A new algorithm for max-margin learning with user in the loop. Two recent articles (Gulshan et al. 2010; Blake et al. 2011) already employ our robot user to learn and compare various different segmentation algorithms, which demonstrates the usefulness of our approach.
A preliminary version of this paper appeared as (Nickisch et al. 2010). This extended version describes a new user study which provides us with insights on how users measure accuracy of solutions and interaction effort in the context of interactive segmentation systems.
Organization of the Paper
In Sect. 2, we discuss the problem of interactive system evaluation. In Sect. 3, we give details of our problem setting, and explain the different segmentation systems and datasets considered in our study. In Sect. 4, we describe the artificial user model used in our study. In Sect. 5, we describe the results of our user study which provides us with insights on how users perceive segmentation quality and interaction effort. Section 6 explains the naïve line-search method for learning segmentation system parameters. In Sect. 7, we show how the max-margin framework for structured prediction can be extended to handle interactions, and show some basic results. We conclude by listing some ideas for future work in Sect. 8.
2 Evaluating Interactive Systems
Performance evaluation is one of the most important problems in the development of real world systems. There are two choices to be made: (1) The data sets on which the system will be tested, and (2) the quality measure. Traditional computer vision and machine learning systems are evaluated on preselected training and test data sets. For instance, in automatic object recognition, one minimizes the number of misclassified pixels on datasets such as PASCAL VOC (Everingham et al. 2009).
Comparison of methods
Method | user in loop | user can learn | interaction | effort model | parameter learning | time | price |
---|---|---|---|---|---|---|---|
User model | yes | yes | yes | yes | this paper | fast | low |
Crowdsourcing | yes | yes | yes | yes | conceivable | slow | a bit |
User study | yes | yes | yes | yes | infeasible | slow | very high |
Static learning | no | no | no | no | used so far | fast | very low |
2.1 Static Interactions
A fixed set of user-made interactions (brush strokes) associated with each image of the dataset is most commonly used in interactive image segmentation (Blake et al. 2004; Singaraju et al. 2009; Duchenne et al. 2008). These strokes are chosen by the researchers themselves and are encoded using image trimaps. These are pixel assignments with foreground, background, and unknown labels (see Fig. 2b). The system to be evaluated is given these trimaps as input and their accuracy is measured by computing the Hamming distance between the obtained result and the ground truth. This scheme of evaluation does not consider how users may change their interaction by observing the current segmentation results. Evaluation and learning methods which work with a fixed set of interactions will be referred to as static in the rest of the paper.
Although the static evaluation method is easy to use, it suffers from a number of problems: (1) The fixed interactions might be very different from the ones made by actual users of the system. (2) Different systems prefer different types of user hints (interaction strokes) and thus a fixed set of hints might not be a good way of comparing two competing segmentation systems. For instance, geodesic distance based approaches (Bai and Sapiro 2007; Grady 2006; Singaraju et al. 2009) prefer brush strokes equidistant from the segmentation boundary as opposed to graph cuts based approaches (Boykov and Jolly 2001; Rother et al. 2004). (3) The evaluation does not take into account how the accuracy of the results improves with more user strokes. For instance, one system might only need a single user interaction to reach the ground truth result, while the other might need many interactions to get the same result. Still, both systems will have equal performance under this scheme. These problems of static evaluation make it a poor tool for judging the performance of newly proposed segmentation systems.
2.2 User Studies
A user study involves the system being given to a group of participants who are required to use it to solve a set of tasks. The system which is easiest to use and yields the correct segmentation in the least amount of time is considered the best. Examples are (Mortensen and Barrett 1998) and (Li et al. 2004) where a full user study has been conducted, or (Bai and Sapiro 2007) where an advanced user has done with each system the optimal job for a few images. However, user studies are very impractical to arrange if thousands of parameters are to be tested.
While overcoming most of the problems of a static evaluation, we have introduced new ones: (1) User studies are expensive and need a large number of participants to be statistically significant. (2) Participants need time to familiarize themselves with the system. For instance, an average driver steering a formula (1) car for the first time, might be no faster than with a normal car. However, after gaining experience with the car, one would expect the driver to be much faster. (3) Each system has to be evaluated independently by participants, which makes it infeasible to use this scheme in a learning scenario where we are trying to find the optimal parameters of the segmentation system among thousands or millions of possible ones.
2.3 Evaluation Using Crowdsourcing
Crowdsourcing has attracted a lot of interest in the machine learning and computer vision communities. This is primarily due to the success of a number of incentive schemes for collecting training data from users on the web. These are either based on money (Sorokin and Forsyth 2008), reputation (von Ahn and Dabbish 2004), or community efforts (Russell et al. 2008). Crowdsourcing has the potential to be an excellent platform for evaluating interactive vision systems such as those for image segmentation. One could ask Mechanical Turk (amazon.com 2010) users to cut out different objects in images with different systems. The one who needs the least number of interactions on average might be considered the best. However, this approach too, suffers from a number of problems such as fraud prevention. Furthermore, as in user-studies, it cannot be used for learning in light of thousands or even millions of systems.
2.4 Evaluation with an Active User Model
In this paper we propose a new evaluation methodology which overcomes most of the problems described above. Instead of using a fixed set of interactions, or an army of human participants, our method only needs a model of user interactions. This model is a simple algorithm which—given the current segmentation, and the ground truth—outputs the next user interaction. This user model can use simple rules, such as “place a brush stroke in the middle of the largest wrongly labelled region”, or alternatively, can be learnt from the interaction logs. We will see that a simple user model exhibits similar behavior as a novice human user. There are many similarities between the problem of learning a user model and the learning of an agent policy in reinforcement learning. Thus, one may exploit reinforcement learning methods for this task. Pros and cons of evaluation schemes are summarized in Table 1.
Concurrent with our work, McGuinness and O’Connor (2010, 2011) have also proposed the use of user models to evaluate the performance of interactive image segmentation systems. They have introduced a number of deterministic and stochastic strategies to choose brush strokes. They also reason about more sophisticated models for brush stroke generation. However, unlike our work which looks at both learning and evaluation of interactive systems, the main focus of their paper is on evaluation of these systems. One of their high-level insights, which is similar to our work, is that simple strategies such as choosing the center of the erroneous region for placing the brush stroke performs reasonably well in obtaining accurate segmentations compared to random or more energy aware strategies.
Our framework is also vaguely related to the recent and interesting work of Vijayanarasimhan and Grauman (2011b, 2011a) on active learning for object recognition which contains a user based annotation system within it. In a separate paper (Vijayanarasimhan and Grauman 2009), Vijayanarasimhan and Grauman had looked at the problem of predicting the time taken by a user to annotate any given image, which can be seen as the learning of a implicit user model.
3 Interactive Segmentation Systems: Problem Setting
We now describe in detail the segmentation systems and datasets used in our studies on evaluation and learning interactive systems.
3.1 The Segmentation Systems
We use 4 different interactive segmentation systems in the paper: “GrabCut(GC)”, “GC Simple(GCS)”, “GC Advanced(GCA)”, and “GeodesicDistance (GEO)”.
GEO is a very simple system. We first learn Gaussian Mixture Model (GMM) based color models for fg/bg from user made brush strokes. The shortest path in the likelihood ratio yields a segmentation (Bai and Sapiro 2007).
The unary terms are computed from a probabilistic model for the colors of background (y_{p}=0) and foreground (y_{p}=1) pixels using two different GMMs Pr(x|0) and Pr(x|1). E_{p}(y_{p}) is then computed as:
−log(Pr(x_{p}|y_{p})) where x_{p} contains the three color channels of pixel p. Importantly, GrabCut (Rother et al. 2004) updates the color models based on the whole segmentation. In practice we use a few iterations only.
To summarize, the models have two linear free parameters: w_{i},w_{c} and a single non-linear one: w_{β}. GC minimizes the energy defined above, and is effectively the original GrabCut system (Rother et al. 2004). GCS is a simplified version, where color models (and unary terms) are fixed up front; they are learnt from the initial user brush strokes (see Sect. 3.1) only. GCS will be used in max-margin learning and to check the active user model, but it is not considered as a practical system.
Finally, GCA is an advanced GrabCut system performing considerably better than GC. Inspired by recent work (Liu et al. 2009), foreground regions are 4-connected to a user made brush stroke to avoid deserted foreground islands. Unfortunately, such a notion of connectivity leads to an NP-hard problem and various solutions have been suggested (Vicente et al. 2008; Nowozin and Lampert 2009). However, all these are either very slow and operate on super-pixels (Nowozin and Lampert 2009) or have a very different interaction mechanism (Vicente et al. 2008). We remove small disconnected foreground regions in a postprocessing step.
3.2 Datasets for Evaluation
3.3 The Error Measure
Observe that f encodes two facts: all errors below 1.5 are considered negligible and large errors never weigh more than c. The first reasons of this settings is that visual inspection showed that for most images, an error below 1.5 % corresponds to a visually pleasing result. Of course this is highly subjective, e.g. a missing limb from the segmentation of a cow might be an error of 0.5 % but is visually unpleasing, or an incorrectly segmented low-contrast area has an error of 2 % but is visually not disturbing. A second reason for having a lower limit on the errors considered significant is that for most segmentation problem instances, it is hard to define a single precise ground truth segmentation. This is due to a number of factors including mixed pixels, shadows etc. In their user study for evaluating the performance of the Intelligent Scissors’s algorithm for image segmentation, Mortensen and Barrett (1998) compared the results of their algorithm to those of a suite of users (including variation within them) rather than to a fixed “ground truth” that was itself determined by a single user. The reason for having a maximum weight of c is that users do not discriminate between two systems giving large errors. Thus errors of 50 % and 55 % are equally penalized. Note that ideally we would learn f(e) by performing a user study.
Due to runtime limitations for parameter learning, we do want to run the robot user for not too many brushes (e.g. maximum of 20 brushes). Thus we start by giving an initial set of brush strokes (cyan/magenta in e.g. Fig. 2(c)) which are used to learn the (initial) colour models. At the same time, we want that most images reach a Hamming error level of about 1.5 %. A run of the robot user for the GCA system showed that this is possible (i.e. for 68 % of images the error is less than 1.5 % and for 98 % less than 2.5 %). We also confirmed that the initial static brush trimap does not affect the learning (see Sect. 6) considerably.^{6}
4 Evaluation Using a Robot User
We start the robot user from an initial fixed set of brush strokes (the “static brush trimap”) such as the one shown in Fig. 1(b). The robot user puts brushes in the form of dots with a maximum fixed size (here 4 pixel radius). At the boundary, the fixed brush size is scaled down, in order to avoid that the brush straddles the boundary. Figure 2(c) shows an example robot user interaction, where red/blue dots are the robot user interactions and cyan/meganta are fixed brushes.
Given the ground truth segmentation y^{k} and the current segmentation solution y, the robot user model is a policy s:(x^{k},y^{k},u^{k,t},y)↦u^{k,t+1} which specifies which brush stroke to place next. Here, u^{k,t} denotes the user interaction history of image x^{k} up to time t. We have investigated various options for this policy: (1) Brush strokes at random image positions. (2) Brush strokes in the middle of the largest, wrongly labelled region (center). For the second strategy, we find the largest connected region of the binary mask, which is given by the absolute difference between the current segmentation and ground truth. We then mark a brush stroke at the pixel which is inside this region and furthest away from the boundary. This is motivated by the observation that users find it hard to mark pixels at the boundary of an object because they have to be very precise.
We also tested user models which took the segmentation algorithm explicitly into account. This is analogous to users who have learnt how the segmentation algorithm works and thus interact with it accordingly. We consider the user model which marks a circular brush stroke at the pixel (1) with the lowest min marginal (SENSIT), inspired by Batra et al. (2010). (2) which results in the largest change in labeling or which maximizes the size of the region of influence (ROI). (3) which decreases the Hamming error by the biggest amount (Hamming). We consider each pixel as the circle center and choose the one where the Hamming error decreases most. This is very expensive, but in some respects is the best solution.^{7} “Hamming” acts as a very “perfect user”, who knows exactly which interactions (brush strokes) will reduce the error by the largest amount. It is questionable that a user is actually able to find that optimal position.
Figure 2(d) shows the performance of 5 different user models (robot users) over a range of 20 brushes. The Hamming error is used to measure the error (see Sect. 3.3), which is averaged over all 50 images of our database. Here we used the GCS system, since it is computationally infeasible to apply the (SENSIT; ROI; Hamming) user models on other interaction systems. GCS allows for efficient computation of solutions by dynamic graph cuts (Kohli and Torr 2005). In the other systems, this is not possible, since unaries change with every brush stroke, and hence we have to treat the system as a black box.
As expected, the random user performs badly. Interestingly the robot users which are guided by the energy (ROI, SENSIT) also perform badly. This is in sharp contrast to (Batra et al. 2010) where they use the system uncertainty to guide the user scribbles. We conjecture that this phenomenon is due to two primary factors. First, a scribble at a position where the labelling is certain but wrong may provide more information to the algorithm than a scribble at a position which is uncertain but wrong. Second, a number of computer vision studies have shown that MRF models used for image segmentation are misspecified i.e. the most probable solutions under these models are not the ground truth solutions (Szeliski et al. 2006).^{8} In such cases, providing information that reduces uncertainty of the model might not move it towards the ground truth solution.
Both “Hamming” and “center” strategies for the robot user are considerably better than the rest. It is interesting to note that “center” is actually only marginally worse than “Hamming”. It has to be said that for other systems, e.g. GEO this conclusion might not hold, since e.g. GEO is sensitive to the location of the brush stroke than a system based on graph cut, as (Singaraju et al. 2009) has shown.
To summarize, “center” is a user strategy which is motivated from the point of view of a “system-unaware user” (or “novice user”) and is computationally feasible. Indeed, in Sect. 5 we will validate that this strategy correlates quite well with real novice users. We conjecture that the reason is that humans tend to place their strokes in the center of wrongly labeled regions. Also, “center” performed for GCS nearly the same as the optimal strategy “Hamming”. Hence, for the rest of the paper we always stick to the user “center” which we call from here onwards our robot user. Note, that the recent work of (Gulshan et al. 2010) has utilized a very similar type of robot user. We would like to refer the interested reader to their webpage^{9} where code and an extend dataset is available online.
5 Validating the Robot User
We conducted a user study to check our assumption that the robot user is indeed related to a human “novice user” (details of user study are in Nickisch et al. 2009). We designed an interface which exactly corresponds to the robot user interface, i.e. where the only choice for the human user is to select the position of the circular brush.
Our user study had 12 participants out of which 6 participants were familiar with computer vision but had no background knowledge about the tested image segmentation algorithms. The other 6 participants were computer literate but did not have any expertise in computer vision. We asked the participants to segment 10 randomly selected images from our database, with each of our 3 systems (GCA, GC, GCS) with reasonable parameters settings (see Nickisch et al. 2009). For every new image, a system was randomly chosen. We also confirmed that users did not train up for a particular system in the course of the study by asking for multiple segmentations of the same image.
For refining the object outline, the user could place circular brushes on the image (the radius of the circle was determined as in the robot user). Additionally, we automatically switched between fg and bg (red and blue brush) by using the underlying ground truth segmentation information. Hence, switching between the two brushes was not penalized. The user could place a maximum of 20 brushes per image. If he/she was satisfied with the result before, he/she could press the “Next” button to go to the next image (see Fig. 3).
Optimal parameter values ± stdev. for different systems after line-search for each parameter individually
(a) System GCA. | ||||
---|---|---|---|---|
Trimap | w_{c} | w_{i} | w_{β} | Test (Er) |
dynamic brush | 0.03± 0.03 | 4.31± 0.17 | 2.21± 3.62 | 1.00 |
static trimap | 0.07± 0.09 | 4.39± 4.40 | 9.73± 7.92 | 1.04 |
static brush | 0.22± 0.52 | 0.47± 8.19 | 3.31± 2.13 | 1.19 |
(b) System GC. | ||||
---|---|---|---|---|
Trimap | w_{c} | w_{i} | w_{β} | Test (Er) |
dynamic brush | 0.24± 0.03 | 4.72± 1.16 | 1.70± 1.11 | 1.38 |
static trimap | 0.07± 0.09 | 4.39± 4.40 | 4.85± 6.29 | 1.52 |
static brush | 0.57± 0.90 | 5.00± 0.17 | 1.10± 0.96 | 1.46 |
The final error Er (mean ± std.) averaged over all images and 6 human users from group 1 is 0.442±0.090 (GCA), 0.610±0.113 (GC), 0.896±0.079 (GCS). It shows a clear correlation with the error of our robot user: 0.00 (GCA), 0.112 (GC), 0.476 (GCS). The corresponding numbers for the group 2 experiment for human participants were: 0.422±0.237 (GCA), 0.871±0.132 (GC), 1.055±0.163 (GCS), and for the robot user were: 0.00 (GCA), 0.069 (GC), 0.296 (GCS).
5.1 Perceptual Accuracy Satisfaction Threshold
5.2 Measuring Interaction Effort
There has been little work on analyzing the time taken for segmenting objects in images, or making particular brush strokes. A notable exception is the study conducted by Vijayanarasimhan and Grauman (2009) who tried to predict the time taken by users to label complete images.
6 Learning by Line-Search
We will now address the problem of learning or estimating the optimal parameters for different interactive segmentation systems. Systems with few parameters can be trained by simple line-search. Our systems, GC, GCS, and GCA, have 3 free parameters: w_{c},w_{i},w_{β}. Line-search is done by fixing all but one free parameter w_{ϕ} and simulating the user interaction process for 30 different discrete values w_{ϕ,i} of the free parameter w_{ϕ} over a predefined range. The optimal value \(w_{\phi}^{*}\) from the discrete set is chosen to minimize the leave-one-out (LOO) estimate of the test error.^{10} Not only do we prevent overfitting but we can also efficiently compute the Jackknife estimator of the variance (Wasserman 2004, ch. 8.5.1)—a measure of how certain the optimal parameter is. We run this procedure for all three parameters individually starting from w_{c}=0.1, w_{i}=0, w_{β}=1. These initial settings are not very different to the finally learned values, hence we conjecture that initialization is not crucial.^{11} One important thing to notice is that our dataset was big enough (and our parameter set small enough) as to not suffer from over-fitting. We see this by observing that training and test error rates are virtually the same for all experiments. In addition to the optimal value we obtain the variance for setting this parameter. In rough words, this variance tells us, how important it is to have this particular value. For instance, a high variance means that parameters different from the selected one, would also perform well. Note, since our error function (Eq. (2)) is defined for static and dynamic trimaps, the above procedure can be performed for all three different types of trimaps: “static trimap” (e.g. Fig. 1(c)), “static brush” (e.g. Fig. 1(b)), “dynamic brush”.
More importantly, we see that the test error is lower when trained dynamically in contrast to static training. This validates our conjecture that an interactive system has to be trained in an interactive way.
Let us look closer at some learnt settings. For system GCA and parameter w_{c} (see Table 2(a) (first row), and Fig. 8(a)) we observe that the optimal value in a dynamic setting is lower (0.03) than in any of the static settings. This is surprising, since one would have guessed that the true value of w_{c} lies somewhere in between the parameters learned with a loose and very tight trimap. This shows that the procedure in (Singaraju et al. 2009) is not necessarily correct, where parameter are learnt by averaging the performance from two static trimaps. Furthermore, neither the static brush nor the static trimap can be used to guess the settings of all parameters for a dynamic model. For instance, the static “tight trimap” is a quite useful guidance for setting w_{c}, w_{i}, but less useful for w_{β}.^{12} To summarize, conclusions about the optimal parameter setting of an interactive system should be drawn by a large set of interaction and cannot be made by looking solely at a few (here two) static trimaps.
For the sake of completeness, we have the same numbers for the GC system in Table 2(b). We see the same conclusions as above. One interesting thing to notice here is that the pairwise terms (esp. w_{c}) are chosen higher than in GCA. This is expected, since without post-processing a lot of isolated islands may be present which are far away from the true boundary. So post-processing automatically removes these islands. The effect is that in GCA the pairwise terms can now concentrate on modeling the smoothness on the boundary correctly. However, in GC the pairwise terms have to additionally make sure that the isolated regions are removed (by choosing a higher value for the pairwise terms) in order to compensate for the missing post-processing step.
It is interesting to note that for the error metric f(er_{b})=er_{b}, we get slightly different values (full results in Nickisch et al. 2009). For instance, we see that w_{c}=0.07±0.07 for GCA with our active user. This is not too surprising, since it says that larger errors are more important (this is what f(er_{b})=er_{b} does). Hence, it is better to choose a larger value of w_{c}.
System Comparison
7 Max-Margin Learning
The line-search method used in Sect. 6 can be used for learning models with few parameters only. Max-margin methods (Tsochantaridis et al. 2004; Taskar et al. 2004; Szummer et al. 2008) deal which models containing large numbers of parameters and have been used extensively in computer vision. However, they work with static training data and cannot be used with an active user model. In this Section, we show how the traditional max-margin parameter learning algorithm can be extended to incorporate an active user.
The structure of this section is as follows. After reviewing the static case of max-margin learning (Sect. 7.1), we describe the dynamic case (Sect. 7.2) where the user is in the loop. The optimization of the dynamic case is very challenging and we suggest two different heuristic techniques in Sect. 7.3. The latter one, optimization with strategies, is a simple and practical solution which is used in the experimental part in Sect. 7.4.
7.1 Static SVMstruct
7.2 Dynamic SVMstruct with “Cheating”
For simplicity, we choose the amount of user interaction or cheating ι to be the maximal a-reweighted number of labeled pixels \(\iota=\max_{k}\sum_{i}a_{i}|u_{i}^{k}|\), with uniform weights a=a⋅1. In practice we should use different weights for different interactions (as explained in Sect. 5.2).
Other formulations based on the average rather than on the maximal amount of interaction proved feasible but less convenient. We denote the set of all user interactions for all K images x^{k} by U=[u^{1},..,u^{K}]. The compatible label set \(\mathcal{Y}|_{\mathbf{u}^{k}}=\{0,1\}^{n}\) is then given by \(\{\hat{\mathbf{y}}\in\mathcal {Y}|u_{i}^{k}=1\Rightarrow\hat{y}_{i}=y_{i}^{k}\}\) where y^{k} is the ground truth labeling. Note that o(w,U) is convex in the weights w for all values of U∈{0,1}^{n×K}, hence the global minimiser \(\mathbf{w}^{*}_{\mathbf{U}}=\arg\min_{\mathbf {w}}o(\mathbf{w},\mathbf{U})\) can efficiently be computed by the cutting-planes algorithm. However the dependence on u^{k} is horribly difficult—we have to find the smallest set of brush strokes leading to a correct segmentation. Geometrically, setting one \(u_{i}^{k}=1\)halves the number of possible labellings and therefore removes half of the label constraints. The problem (Eq. (4)) can be re-interpreted in different ways:
A modified energy \(\tilde{E}_{\mathbf{w},\mathbf{v}}(\mathbf{y})=E_{\mathbf{w}}(\mathbf {y})+\sum_{i\in\mathcal{V}}u_{i}^{k}\phi_{i}(y_{i},y_{i}^{k})\) with cheating potentials\(\phi_{i}(y_{i},y_{i}^{k}):=\zeta |y_{i}-y_{i}^{k}|\) where the constant ζ is sufficiently large 0≪ζ<∞ allows to treat the SVMstruct with cheating as an ordinary SVMstruct with a modified energy function \(\tilde{E}_{\mathbf{w},\mathbf{v}}(\mathbf{y})\) and an extended weight vector \(\tilde{\mathbf{w}}=[\mathbf{w};\mathbf{u}^{1};..;\mathbf {u}^{K}]\).
A second (but closely related) interpretation starts from the fact that the true label y^{k} can be regarded as a feature vector^{15} of the image x^{k}. Therefore, it is feature selection in a very particular feature space. There is a direct link to multiple kernel learning—a special kind of feature selection.
7.3 Optimisation—Two Strategies
We explored two approaches to minimise o(w,U): (i) coordinate descent and (ii) relaxation by strategies. Note, that we evaluate experimental (Sect. 7.4) only the latter approach.
The idea of (block) coordinate descent is very simple: minimise one variable (block) at a time; upon convergence, a local minimum is reached. In our case, we interleave running cutting planes \(\mathbf{w} \leftarrow\mathbf{w}^{*}_{\mathbf {U}}\) and label descent^{16}U←U+∂o/∂U using the discrete gradient of the pseudo boolean map U↦o(w,U). Even though the gradient ∂o/∂U can be evaluated efficiently,^{17} we empirically observed that the coupling between the pixels in U is extremely strong allowing small steps only.^{18}
The overall computational cost is T times the cost of an individual cutting plane optimisation.
7.4 Experiments
8 Conclusion
This paper addressed the problem of evaluating and learning interactive intelligent systems. We performed a user study for evaluating different interactive segmentation systems which provided us with new insights on how users perceive segmentation accuracy and interaction effort. We showed these insights can be used to build a robot user which can be used to train and evaluate interactive systems.
We showed how a simple line-search algorithm can be used to find good parameters for different interactive segmentation systems under a user interaction model. We also compared the performance of the static and dynamic user interaction models. With more parameters, line-search becomes infeasible, leading naturally to the max margin framework. To overcome this problem, we introduced an extension to SVMstruct which incorporates user interaction models, and showed how to solve the corresponding optimisation. We obtained promising results on a small simulated dataset. The main limitation of the max margin framework is that crucial parts of state-of-the-art segmentation systems (e.g. GCA) cannot be handled. These parts include (1) non-linear parameters, (2) higher-order potentials (e.g. enforcing connectivity) and (3) iterative updates of the unary potentials.
The evaluation and learning of intelligent interactive systems has been relatively ignored by the machine learning and computer vision communities. With this paper, our aim is to inspire discussion and new research on this very important problem.
http://research.microsoft.com/en-us/um/cambridge/ projects/visionimagevideoediting/segmentation/grabcut.htm.
This input is used for both comparison and parameter learning e.g. (Blake et al. 2004; Singaraju et al. 2009).
We started the learning from no initial brushes and let it run for 60 brush strokes. The learned parameters were similar as with starting from 20 brushes.
Note, one could do even better by looking at two or more brushes after each other and then selecting the optimal one. However, the solution grows exponentially with the number look-ahead steps.
This behaviour is also observed in our experiments. Note that after each user interaction we obtain the global optimum of our current energy. Also, note that the energy changes with each user interaction.
However, compared to an exhaustive search over all possible joint settings of the parameters, we are not guaranteed to find the global optimum of the objective function.
Note, the fact that the uncertainty of the “tight trimap” learning is high, gives an indication that this value can not be trusted very much.
We write images of size (n_{x}×n_{y}×n_{c}) as vectors \(\in\mathbb{R}^{n},\:n=n_{x} n_{y} n_{z}\) for simplicity. All involved operations respect the 2d grid structure absent in general n-vectors.
To our knowledge, there is no simple graph cut like algorithm to do the minimisation in U all at once.
Acknowledgement
Christoph Rhemann was supported by the Vienna Science and Technology Fund (WWTF) under project ICT08-019.