Skip to main content
Log in

Multiple instance learning via Gaussian processes

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Multiple instance learning (MIL) is a binary classification problem with loosely supervised data where a class label is assigned only to a bag of instances indicating presence/absence of positive instances. In this paper we introduce a novel MIL algorithm using Gaussian processes (GP). The bag labeling protocol of the MIL can be effectively modeled by the sigmoid likelihood through the max function over GP latent variables. As the non-continuous max function makes exact GP inference and learning infeasible, we propose two approximations: the soft-max approximation and the introduction of witness indicator variables. Compared to the state-of-the-art MIL approaches, especially those based on the Support Vector Machine, our model enjoys two most crucial benefits: (i) the kernel parameters can be learned in a principled manner, thus avoiding grid search and being able to exploit a variety of kernel families with complex forms, and (ii) the efficient gradient search for kernel parameter learning effectively leads to feature selection to extract most relevant features while discarding noise. We demonstrate that our approaches attain superior or comparable performance to existing methods on several real-world MIL datasets including large-scale content-based image retrieval problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. It is an extension of our earlier work (conference paper) published in Kim and De la Torre (2010). We extend the previous work broadly in two aspects: (i) more technical details and complete derivations are provided for Gaussian process (GP) and our approaches based on it, which makes the manuscript comprehensive and self-contained, and (ii) The experimental evaluation includes more extensive MIL datasets including the SIVAL image retrieval database and the drug activity datasets.

  2. We abuse the notation \(f\) to indicate either a function or a response variable evaluated at \(\mathbf{x}\) (i.e., \(f(\mathbf{x})\) interchangeably.

  3. We provide a better insight about our argument here. It is true that any other instances can make a bag positive, but it is the instance with the highest confidence score (what we called most likely positive instance) that solely determines the bag label. In other words, to have an effect on the label of a bag, an instance needs to get the largest confidence score. In formal probability terms, the instance \(i\) can determine the bag label only for the events that assign the highest confidence to \(i\). It is also important to note that even though \(\max _{i\in B} f_i\) indicates a single instance, \(\max \) function examines all instances in the bag to find it.

  4. So, it is also possible to have a probit version of (12), namely \(P(Y_b|\mathbf{f}_b) = \varPhi (Y_b \max _{i\in I_b} f_i)\), where \(\varPhi (\cdot )\) is the cumulative normal function.

  5. It is well known that the soft-max provides relatively tight bounds for the \(\max \), \(\max _{i=1}^m z_i \le \log \sum _{i=1}^m \exp (z_i) \le \max _{i=1}^m z_i + \log m\). Another nice property is that the soft-max is a convex function.

  6. Although it is feasible, here we do not take the variational approximation into consideration for simplicity. Unlike the standard GP classification, it is difficult to perform, for instance, the Expectation Propagation (EP) approximation since the moment matching, the core step in EP that minimizes the KL divergence between the marginal posteriors, requires the integration over the likelihood function in (13), which requires further elaboration.

  7. This has a close relation to Gehler and Chapelle (2007)’s DA approach to SVM. Similar to Gehler and Chapelle (2007), one can also consider a scheduled annealing, where the inverse of the smoothness parameter \(\lambda \) in (29) serves as the annealing temperature. See Sect. 4.1 for further details.

  8. Available at http://www.cs.cmu.edu/~juny/MILL.

  9. http://www.cs.columbia.edu/~andrews/mil/datasets.html.

  10. http://www.cs.wustl.edu/~sg/multi-inst-data/.

  11. We only consider binary classification where the extension to multiclass cases is straightforward.

  12. Although the models in (39) have no parameters involved, we can always consider more general cases. For instance, one may play with a generalized logistic regression (Zhang and Oles 2000): \(P(y|f,\gamma ) \propto \big (\frac{1}{1+ \exp (\gamma (1-yf))}\big )^\gamma \).

  13. For the sigmoid model, the integration over \(f_*\) cannot be done analytically, and one needs to do further approximation.

References

  • Andrews S, Tsochantaridis I, Hofmann T (2003) Support vector machines for multiple-instance learning. Adv Neural Inf Process Syst 15:561–568

    Google Scholar 

  • Antić B, Ommer B (2012) Robust multiple-instance learning with superbags. In: Proceedings of the 11th Asian conference on computer vision

  • Babenko B, Verma N, Dollár P, Belongie S (2011) Multiple instance learning with manifold bags. In: International conference on machine learning

  • Chen Y, Bi J, Wang JZ (2006) MILES: multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 28(12):1931–1947

    Article  Google Scholar 

  • Dietterich TG, Lathrop RH, Lozano-Perez T (1997) Solving the multiple-instance problem with axis-parallel rectangles. Artif Intell 89:31–71

    Article  MATH  Google Scholar 

  • Dooly DR, Zhang Q, Goldman SA, Amar RA (2002) Multiple-instance learning of real-valued data. J Mach Learn Res 3:651–678

    Google Scholar 

  • Friedman J (1999) Greedy function approximation: a gradient boosting machine. Technical Report, Department of Statistics, Stanford University

  • Fu Z, Robles-Kelly A, Zhou J (2011) MILIS: multiple instance learning with instance selection. IEEE Trans Pattern Anal Mach Intell 33(5):958–977

    Article  Google Scholar 

  • Gärtner T, Flach PA, Kowalczyk A, Smola AJ (2002) Multi-instance kernels. In: International conference on machine learning

  • Gehler PV, Chapelle O (2007) Deterministic annealing for multiple-instance learning. In: AI and Statistics (AISTATS)

  • Kim M, De la Torre F (2010) Gaussian process multiple instance learning. In: International conference on machine learning

  • Lázaro-Gredilla M, Quinonero-Candela J, Rasmussen CE, Figueiras-Vidal AR (2010) Sparse spectrum Gaussian process regression. J Mach Learn Res 11:1865–1881

    MATH  MathSciNet  Google Scholar 

  • Li YF, Kwok JT, Tsang IW, Zhou ZH (2009) A convex method for locating regions of interest with multi-instance learning. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD’09), Bled, Slovenia

  • Liu G, Wu J, Zhou ZH (2012) Key instance detection in multi-instance learning. In: Proceedings of the 4th Asian conference on machine learning (ACML’12), Singapore

  • Mangasarian OL, Wild EW (2008) Multiple instance classification via successive linear programming. J Optim Theory Appl 137(3):555–568

    Article  MATH  MathSciNet  Google Scholar 

  • Maron O, Lozano-Perez T (1998) A framework for multiple-instance learning. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information processing systems. MIT Press, Cambridge

  • Rahmani R, Goldman SA (2006) MISSL: multiple-instance semi-supervised learning. In: International conference on machine learning

  • Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge

    MATH  Google Scholar 

  • Ray S, Page D (2001) Multiple instance regression. In: International conference on machine learning

  • Raykar VC, Krishnapuram B, Bi J, Dundar M, Rao RB (2008) Bayesian multiple instance learning: automatic feature selection and inductive transfer. In: International conference on machine learning

  • Scott S, Zhang J, Brown J (2003) On generalized multiple instance learning. Technical Report UNL-CSE-2003-5, Department of Computer Science and Engineering, University of Nebraska, Lincoln

  • Snelson E, Ghahramani Z (2006) Sparse Gaussian processes using pseudo-inputs. In: Weiss Y, Scholkopf B, Platt J (eds) Advances in neural information processing systems. MIT Press, Cambridge

  • Tao Q, Scott S, Vinodchandran NV, Osugi TT (2004) SVM-based generalized multiple-instance learning via approximate box counting. In: International conference on machine learning

  • Viola PA, Platt JC, Zhang C (2005) Multiple instance boosting for object detection. In: Viola P, Platt JC, Zhang C (eds) Advances in neural information processing systems. MIT Press, Cambridge

  • Wang J, Zucker JD (2000) Solving the multiple-instance problem: a lazy learning approach. In: International conference on machine learning

  • Wang H, Huang H, Kamangar F, Nie F, Ding C (2011) A maximum margin multi-instance learning framework for image categorization. In: Advances in neural information processing systems

  • Weidmann N, Frank E, Pfahringer B (2003) A two-level learning method for generalized multi-instance problem. Lecture notes in artificial intelligence, vol 2837. Springer, Berlin

  • Zhang H, Fritts J (2005) Improved hierarchical segmentation. Technical Report, Washington University in St Louis

  • Zhang T, Oles FJ (2000) Text categorization based on regularized linear classification methods. Inf Retr 4:5–31

    Article  Google Scholar 

  • Zhang ML, Zhou ZH (2009) Multi-instance clustering with applications to multi-instance prediction. Appl Intell 31(1):47–68

    Article  Google Scholar 

  • Zhang Q, Goldman SA, Yu W, Fritts JE (2002) Content-based image retrieval using multiple-instance learning. In: International conference on machine learning

  • Zhang D, Wang F, Luo S, Li T (2011) Maximum margin multiple instance clustering with its applications to image and text clustering. IEEE Trans Neural Netw 22(5):739–751

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minyoung Kim.

Additional information

Responsible editor: Chih-Jen Lin.

Appendix: Gaussian process models and approximate inference

Appendix: Gaussian process models and approximate inference

We review the GP regression and classification models as well as the Laplace method for approximate inference.

1.1 Gaussian process regression

When the output variable \(y\) is a real-valued scalar, one natural likelihood model from \(f\) to \(y\) is the additive Gaussian noise model, namely

$$\begin{aligned} P(y|f) = \fancyscript{N}(y;f,\sigma ^2) = \frac{1}{\sqrt{2\pi \sigma ^2}} \exp \bigg (-\frac{(y-f)^2}{2\sigma ^2}\bigg ), \end{aligned}$$
(33)

where \(\sigma ^2\), the output noise variance, is another set of hyperparameters (together with the kernel parameters \(\varvec{\beta }\)). Given the training data \(\{(\mathbf{x}_i,y_i)\}_{i=1}^n\) and with (33), we can compute the posterior distribution for the training latent variables \(\mathbf{f}\) analytically, namely

$$\begin{aligned} P(\mathbf{f}|\mathbf{y},\mathbf{X}) \propto P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X}) \!=\! \fancyscript{N}(\mathbf{f}; \mathbf{K}(\mathbf{K}+\sigma ^2\mathbf{I})^{-1}\mathbf{y}, (\mathbf{I} \!-\! \mathbf{K}(\mathbf{K}\!+\!\sigma ^2\mathbf{I})^{-1})\mathbf{K}).\qquad \quad \end{aligned}$$
(34)

When the posterior of \(\mathbf{f}\) is obtained, we can readily compute the predictive distribution for a test output \(y_*\) on \(\mathbf{x}_*\). We first derive the posterior for \(f_*\), the latent variable for the test point, by marginalizing out \(\mathbf{f}\) as follows:

$$\begin{aligned} P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})&= \int P(f_*|\mathbf{x}_*,\mathbf{f},\mathbf{X})P(\mathbf{f}|\mathbf{y},\mathbf{X}) d\mathbf{f} \nonumber \\&= \fancyscript{N}(f_*; \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{y}, k(\mathbf{x}_*,\mathbf{x}_*) \nonumber \\&- \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{k}(\mathbf{x}_*)). \end{aligned}$$
(35)

Then it is not difficult to see that the posterior for \(y_*\) can be obtained by marginalizing out \(f_*\), namely

$$\begin{aligned} P(y_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})&= \int P(y_*|f_*) P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X}) d f_* \nonumber \\&= \fancyscript{N}(y_*; \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{y}, k(\mathbf{x}_*,\mathbf{x}_*) \nonumber \\&- \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{k}(\mathbf{x}_*) + \sigma ^2). \end{aligned}$$
(36)

So far, we have assumed that the hyperparameters (denoted by \(\varvec{\theta }=\{{\varvec{\beta }}, \sigma ^2\}\)) of the GP model are known. However, it is a very important issue to estimate \(\varvec{\theta }\) from the data, the task often known as the GP learning. Following the fully Bayesian treatment, it would be ideal to place a prior on \(\varvec{\theta }\) and compute a posterior distribution for \(\varvec{\theta }\) as well, however, due to the difficulty of the integration, it is usually handled by the evidence maximization, also referred to as the empirical Bayes. The evidence in this case is the data likelihood,

$$\begin{aligned} P(\mathbf{y}|\mathbf{X},{\varvec{\theta }}) = \int P(\mathbf{y}|\mathbf{f},\sigma ^2) P(\mathbf{f}|\mathbf{X},{\varvec{\beta }}) d\mathbf{f} = \fancyscript{N}(\mathbf{y}; \mathbf{0}, \mathbf{K}_{\varvec{\beta }} + \sigma ^2\mathbf{I}). \end{aligned}$$
(37)

We then maximize (37), which is equivalent to solving the following optimization problem:

$$\begin{aligned} {\varvec{\theta }}^* = \arg \max _{\varvec{\theta }} \ \log P(\mathbf{y}|\mathbf{X},{\varvec{\theta }}) = \arg \min _{\varvec{\theta }} \ |\mathbf{K}_{\varvec{\beta }} + \sigma ^2\mathbf{I}| + \mathbf{y}^{\top } (\mathbf{K}_{\varvec{\beta }} + \sigma ^2\mathbf{I})^{-1} \mathbf{y}. \end{aligned}$$
(38)

(38) is non-convex in general, and one can find a local minimum using the quasi-Newton or the conjugate gradient search.

1.2 Gaussian process classification

In the classification setting where the output variable \(y\) takes a binaryFootnote 11 value from \(\{+1,-1\}\), we have several choices for the likelihood model \(P(y|f)\). Two most popular ways to link the real-valued variable \(f\) to the binary \(y\) are the sigmoid and the probit.

$$\begin{aligned} P(y|f) = \left\{ \begin{array}{ll} \frac{1}{1+\exp (-yf)} &{} \text {(sigmoid)} \\ \varPhi (yf) &{} \text {(probit)} \end{array} \right. \end{aligned}$$
(39)

where \(\varPhi (\cdot )\) is the cumulative normal function.

We let \(l(y,f;{\varvec{\gamma }}) = -\log P(y|f,{\varvec{\gamma }})\), where \({\varvec{\gamma }}\) denotes the (hyper)parameters of the likelihood model.Footnote 12 Likewise, we often drop the dependency on \({\varvec{\gamma }}\) for the notational simplicity.

Unlike the regression case, it is unfortunate that the posterior distribution \(P(\mathbf{f}|\mathbf{y},\mathbf{X})\) has no analytic form as it is a product of the non-Gaussian \(P(\mathbf{y}|\mathbf{f})\) and the Gaussian \(P(\mathbf{f}|\mathbf{X})\). One can consider three standard approximation schemes within the Bayesian framework: (i) Laplace approximation, (ii) variational methods, and (iii) sampling-based approaches (e.g., MCMC). It is often the cases that the third method is avoided due to the computational overhead (as we sample \(\mathbf{f}\)s, \(n\)-dimensional vectors, where \(n\) is the number of training samples). In the below, we briefly review the Laplace approximation.

1.3 Laplace approximation

The Laplace approximation essentially replaces the product \(P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X})\) by a Gaussian with the mean equal to the mode of the product, and the covariance equal to the inverse Hessian of the product evaluated at the mode. More specifically, we let

$$\begin{aligned} S(\mathbf{f})&= -\log \big ( P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X}) \big ) = \sum _{i=1}^n l(y_i,f_i) + \frac{1}{2} \mathbf{f}^{\top } \mathbf{K}^{-1} \mathbf{f} \nonumber \\&+ \frac{1}{2} \log |\mathbf{K}| + \frac{n}{2}\log 2 \pi . \end{aligned}$$
(40)

Note that (40) is a convex function of \(\mathbf{f}\) since the Hessian of \(S(\mathbf{f})\), \({\varvec{\varLambda }}+\mathbf{K}^{-1}\), is positive definite where \({\varvec{\varLambda }}\) is the diagonal matrix with entries \([{\varvec{\varLambda }}]_{ii} = \frac{\partial ^2 l(y_i,f_i)}{\partial f_i^2}\). The minimum of \(S(\mathbf{f})\) can be attained by a gradient search. We denote the optimum by \(\mathbf{f}^{MAP} = \arg \max _{\mathbf{f}} S(\mathbf{f})\). Letting \({\varvec{\varLambda }}^{MAP}\) be \({\varvec{\varLambda }}\) evaluated at \(\mathbf{f}^{MAP}\), we approximate \(S(\mathbf{f})\) by the following quadratic function (i.e., using the Taylor expansion)

$$\begin{aligned} S(\mathbf{f}) \approx S(\mathbf{f}^{MAP}) + \frac{1}{2} (\mathbf{f}-\mathbf{f}^{MAP})^{\top } ({\varvec{\varLambda }}^{MAP}+\mathbf{K}^{-1}) (\mathbf{f}-\mathbf{f}^{MAP}), \end{aligned}$$
(41)

which essentially leads to a Gaussian approximation for \(P(\mathbf{f}|\mathbf{y},\mathbf{X})\), namely

$$\begin{aligned} P(\mathbf{f}|\mathbf{y},\mathbf{X}) \approx \fancyscript{N}(\mathbf{f}; \mathbf{f}^{MAP}, ({\varvec{\varLambda }}^{MAP}+\mathbf{K}^{-1})^{-1}). \end{aligned}$$
(42)

The data likelihood (i.e., evidence) immediately follows from the similar approximation,

$$\begin{aligned} P(\mathbf{y}|\mathbf{X},{\varvec{\theta }}) \approx \exp (-S(\mathbf{f}^{MAP})) (2\pi )^{n/2} |{\varvec{\varLambda }}^{MAP} + \mathbf{K}_{\varvec{\beta }}^{-1}|^{-1/2}, \end{aligned}$$
(43)

which can be maximized by gradient search with respect to the hyperparameters \({\varvec{\theta }}=\{{\varvec{\beta }},{\varvec{\gamma }}\}\).

Using the Gaussian approximated posterior, it is easy to derive the predictive distribution for \(f_*\):

$$\begin{aligned} P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})&= \fancyscript{N}(f_*; \mathbf{k}(\mathbf{x}_*)^{\top } \mathbf{K}^{-1} \mathbf{f}^{MAP}, k(\mathbf{x}_*,\mathbf{x}_*) \nonumber \\&- \mathbf{k}(\mathbf{x}_*)^{\top } (({\varvec{\varLambda }}^{MAP})^{-1}+\mathbf{K}^{-1})^{-1} \mathbf{k}(\mathbf{x}_*)). \end{aligned}$$
(44)

Finally, the predictive distribution for \(y_*\) can be obtained by marginalizing out \(f_*\), which has a closed form solution for the probit likelihood model as followsFootnote 13:

$$\begin{aligned} P(y_*=+1|\mathbf{x}_*,\mathbf{y},\mathbf{X}) = \varPhi \Big (\frac{\bar{\mu }}{\sqrt{1+\bar{\sigma }^2}}\Big ), \end{aligned}$$
(45)

where \(\bar{\mu }\) and \(\bar{\sigma }^2\) are the mean and the variance of \(P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})\), respectively.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, M., De la Torre, F. Multiple instance learning via Gaussian processes. Data Min Knowl Disc 28, 1078–1106 (2014). https://doi.org/10.1007/s10618-013-0333-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0333-y

Keywords

Navigation