Multiple instance learning via Gaussian processes

Kim, Minyoung; De la Torre, Fernando

doi:10.1007/s10618-013-0333-y

Multiple instance learning via Gaussian processes

Published: 10 July 2013

Volume 28, pages 1078–1106, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Minyoung Kim¹ &
Fernando De la Torre²

882 Accesses
8 Citations
Explore all metrics

Abstract

Multiple instance learning (MIL) is a binary classification problem with loosely supervised data where a class label is assigned only to a bag of instances indicating presence/absence of positive instances. In this paper we introduce a novel MIL algorithm using Gaussian processes (GP). The bag labeling protocol of the MIL can be effectively modeled by the sigmoid likelihood through the max function over GP latent variables. As the non-continuous max function makes exact GP inference and learning infeasible, we propose two approximations: the soft-max approximation and the introduction of witness indicator variables. Compared to the state-of-the-art MIL approaches, especially those based on the Support Vector Machine, our model enjoys two most crucial benefits: (i) the kernel parameters can be learned in a principled manner, thus avoiding grid search and being able to exploit a variety of kernel families with complex forms, and (ii) the efficient gradient search for kernel parameter learning effectively leads to feature selection to extract most relevant features while discarding noise. We demonstrate that our approaches attain superior or comparable performance to existing methods on several real-world MIL datasets including large-scale content-based image retrieval problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large-Scale Gaussian Process Inference with Generalized Histogram Intersection Kernels for Visual Recognition Tasks

Article 27 July 2016

Large scale multi-label learning using Gaussian processes

Article Open access 14 April 2021

Multiple Instance Classification in the Image Domain

Notes

It is an extension of our earlier work (conference paper) published in Kim and De la Torre (2010). We extend the previous work broadly in two aspects: (i) more technical details and complete derivations are provided for Gaussian process (GP) and our approaches based on it, which makes the manuscript comprehensive and self-contained, and (ii) The experimental evaluation includes more extensive MIL datasets including the SIVAL image retrieval database and the drug activity datasets.
We abuse the notation $f$ to indicate either a function or a response variable evaluated at $\mathbf{x}$ (i.e., $f(\mathbf{x})$ interchangeably.
We provide a better insight about our argument here. It is true that any other instances can make a bag positive, but it is the instance with the highest confidence score (what we called most likely positive instance) that solely determines the bag label. In other words, to have an effect on the label of a bag, an instance needs to get the largest confidence score. In formal probability terms, the instance $i$ can determine the bag label only for the events that assign the highest confidence to $i$. It is also important to note that even though $\max _{i\in B} f_i$ indicates a single instance, $\max $ function examines all instances in the bag to find it.
So, it is also possible to have a probit version of (12), namely $P(Y_b|\mathbf{f}_b) = \varPhi (Y_b \max _{i\in I_b} f_i)$, where $\varPhi (\cdot )$ is the cumulative normal function.
It is well known that the soft-max provides relatively tight bounds for the $\max $, $\max _{i=1}^m z_i \le \log \sum _{i=1}^m \exp (z_i) \le \max _{i=1}^m z_i + \log m$. Another nice property is that the soft-max is a convex function.
Although it is feasible, here we do not take the variational approximation into consideration for simplicity. Unlike the standard GP classification, it is difficult to perform, for instance, the Expectation Propagation (EP) approximation since the moment matching, the core step in EP that minimizes the KL divergence between the marginal posteriors, requires the integration over the likelihood function in (13), which requires further elaboration.
This has a close relation to Gehler and Chapelle (2007)’s DA approach to SVM. Similar to Gehler and Chapelle (2007), one can also consider a scheduled annealing, where the inverse of the smoothness parameter $\lambda $ in (29) serves as the annealing temperature. See Sect. 4.1 for further details.
Available at http://www.cs.cmu.edu/~juny/MILL.
http://www.cs.columbia.edu/~andrews/mil/datasets.html.
http://www.cs.wustl.edu/~sg/multi-inst-data/.
We only consider binary classification where the extension to multiclass cases is straightforward.
Although the models in (39) have no parameters involved, we can always consider more general cases. For instance, one may play with a generalized logistic regression (Zhang and Oles 2000): $P(y|f,\gamma ) \propto \big (\frac{1}{1+ \exp (\gamma (1-yf))}\big )^\gamma $.
For the sigmoid model, the integration over $f_*$ cannot be done analytically, and one needs to do further approximation.

References

Andrews S, Tsochantaridis I, Hofmann T (2003) Support vector machines for multiple-instance learning. Adv Neural Inf Process Syst 15:561–568
Google Scholar
Antić B, Ommer B (2012) Robust multiple-instance learning with superbags. In: Proceedings of the 11th Asian conference on computer vision
Babenko B, Verma N, Dollár P, Belongie S (2011) Multiple instance learning with manifold bags. In: International conference on machine learning
Chen Y, Bi J, Wang JZ (2006) MILES: multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 28(12):1931–1947
Article Google Scholar
Dietterich TG, Lathrop RH, Lozano-Perez T (1997) Solving the multiple-instance problem with axis-parallel rectangles. Artif Intell 89:31–71
Article MATH Google Scholar
Dooly DR, Zhang Q, Goldman SA, Amar RA (2002) Multiple-instance learning of real-valued data. J Mach Learn Res 3:651–678
Google Scholar
Friedman J (1999) Greedy function approximation: a gradient boosting machine. Technical Report, Department of Statistics, Stanford University
Fu Z, Robles-Kelly A, Zhou J (2011) MILIS: multiple instance learning with instance selection. IEEE Trans Pattern Anal Mach Intell 33(5):958–977
Article Google Scholar
Gärtner T, Flach PA, Kowalczyk A, Smola AJ (2002) Multi-instance kernels. In: International conference on machine learning
Gehler PV, Chapelle O (2007) Deterministic annealing for multiple-instance learning. In: AI and Statistics (AISTATS)
Kim M, De la Torre F (2010) Gaussian process multiple instance learning. In: International conference on machine learning
Lázaro-Gredilla M, Quinonero-Candela J, Rasmussen CE, Figueiras-Vidal AR (2010) Sparse spectrum Gaussian process regression. J Mach Learn Res 11:1865–1881
MATH MathSciNet Google Scholar
Li YF, Kwok JT, Tsang IW, Zhou ZH (2009) A convex method for locating regions of interest with multi-instance learning. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD’09), Bled, Slovenia
Liu G, Wu J, Zhou ZH (2012) Key instance detection in multi-instance learning. In: Proceedings of the 4th Asian conference on machine learning (ACML’12), Singapore
Mangasarian OL, Wild EW (2008) Multiple instance classification via successive linear programming. J Optim Theory Appl 137(3):555–568
Article MATH MathSciNet Google Scholar
Maron O, Lozano-Perez T (1998) A framework for multiple-instance learning. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information processing systems. MIT Press, Cambridge
Rahmani R, Goldman SA (2006) MISSL: multiple-instance semi-supervised learning. In: International conference on machine learning
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
MATH Google Scholar
Ray S, Page D (2001) Multiple instance regression. In: International conference on machine learning
Raykar VC, Krishnapuram B, Bi J, Dundar M, Rao RB (2008) Bayesian multiple instance learning: automatic feature selection and inductive transfer. In: International conference on machine learning
Scott S, Zhang J, Brown J (2003) On generalized multiple instance learning. Technical Report UNL-CSE-2003-5, Department of Computer Science and Engineering, University of Nebraska, Lincoln
Snelson E, Ghahramani Z (2006) Sparse Gaussian processes using pseudo-inputs. In: Weiss Y, Scholkopf B, Platt J (eds) Advances in neural information processing systems. MIT Press, Cambridge
Tao Q, Scott S, Vinodchandran NV, Osugi TT (2004) SVM-based generalized multiple-instance learning via approximate box counting. In: International conference on machine learning
Viola PA, Platt JC, Zhang C (2005) Multiple instance boosting for object detection. In: Viola P, Platt JC, Zhang C (eds) Advances in neural information processing systems. MIT Press, Cambridge
Wang J, Zucker JD (2000) Solving the multiple-instance problem: a lazy learning approach. In: International conference on machine learning
Wang H, Huang H, Kamangar F, Nie F, Ding C (2011) A maximum margin multi-instance learning framework for image categorization. In: Advances in neural information processing systems
Weidmann N, Frank E, Pfahringer B (2003) A two-level learning method for generalized multi-instance problem. Lecture notes in artificial intelligence, vol 2837. Springer, Berlin
Zhang H, Fritts J (2005) Improved hierarchical segmentation. Technical Report, Washington University in St Louis
Zhang T, Oles FJ (2000) Text categorization based on regularized linear classification methods. Inf Retr 4:5–31
Article Google Scholar
Zhang ML, Zhou ZH (2009) Multi-instance clustering with applications to multi-instance prediction. Appl Intell 31(1):47–68
Article Google Scholar
Zhang Q, Goldman SA, Yu W, Fritts JE (2002) Content-based image retrieval using multiple-instance learning. In: International conference on machine learning
Zhang D, Wang F, Luo S, Li T (2011) Maximum margin multiple instance clustering with its applications to image and text clustering. IEEE Trans Neural Netw 22(5):739–751
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and IT Media Engineering, Seoul National University of Science & Technology, Seoul, South Korea
Minyoung Kim
The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Fernando De la Torre

Authors

Minyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Fernando De la Torre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minyoung Kim.

Additional information

Responsible editor: Chih-Jen Lin.

Appendix: Gaussian process models and approximate inference

We review the GP regression and classification models as well as the Laplace method for approximate inference.

1.1 Gaussian process regression

When the output variable $y$ is a real-valued scalar, one natural likelihood model from $f$ to $y$ is the additive Gaussian noise model, namely

$$\begin{aligned} P(y|f) = \fancyscript{N}(y;f,\sigma ^2) = \frac{1}{\sqrt{2\pi \sigma ^2}} \exp \bigg (-\frac{(y-f)^2}{2\sigma ^2}\bigg ), \end{aligned}$$

(33)

where $\sigma ^2$, the output noise variance, is another set of hyperparameters (together with the kernel parameters $\varvec{\beta }$). Given the training data $\{(\mathbf{x}_i,y_i)\}_{i=1}^n$ and with (33), we can compute the posterior distribution for the training latent variables $\mathbf{f}$ analytically, namely

$$\begin{aligned} P(\mathbf{f}|\mathbf{y},\mathbf{X}) \propto P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X}) \!=\! \fancyscript{N}(\mathbf{f}; \mathbf{K}(\mathbf{K}+\sigma ^2\mathbf{I})^{-1}\mathbf{y}, (\mathbf{I} \!-\! \mathbf{K}(\mathbf{K}\!+\!\sigma ^2\mathbf{I})^{-1})\mathbf{K}).\qquad \quad \end{aligned}$$

(34)

When the posterior of $\mathbf{f}$ is obtained, we can readily compute the predictive distribution for a test output $y_*$ on $\mathbf{x}_*$. We first derive the posterior for $f_*$, the latent variable for the test point, by marginalizing out $\mathbf{f}$ as follows:

$$\begin{aligned} P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})&= \int P(f_*|\mathbf{x}_*,\mathbf{f},\mathbf{X})P(\mathbf{f}|\mathbf{y},\mathbf{X}) d\mathbf{f} \nonumber \\&= \fancyscript{N}(f_*; \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{y}, k(\mathbf{x}_*,\mathbf{x}_*) \nonumber \\&- \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{k}(\mathbf{x}_*)). \end{aligned}$$

(35)

Then it is not difficult to see that the posterior for $y_*$ can be obtained by marginalizing out $f_*$, namely

$$\begin{aligned} P(y_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})&= \int P(y_*|f_*) P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X}) d f_* \nonumber \\&= \fancyscript{N}(y_*; \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{y}, k(\mathbf{x}_*,\mathbf{x}_*) \nonumber \\&- \mathbf{k}(\mathbf{x}_*)^{\top } (\mathbf{K}+\sigma ^2\mathbf{I})^{-1} \mathbf{k}(\mathbf{x}_*) + \sigma ^2). \end{aligned}$$

(36)

So far, we have assumed that the hyperparameters (denoted by $\varvec{\theta }=\{{\varvec{\beta }}, \sigma ^2\}$) of the GP model are known. However, it is a very important issue to estimate $\varvec{\theta }$ from the data, the task often known as the GP learning. Following the fully Bayesian treatment, it would be ideal to place a prior on $\varvec{\theta }$ and compute a posterior distribution for $\varvec{\theta }$ as well, however, due to the difficulty of the integration, it is usually handled by the evidence maximization, also referred to as the empirical Bayes. The evidence in this case is the data likelihood,

$$\begin{aligned} P(\mathbf{y}|\mathbf{X},{\varvec{\theta }}) = \int P(\mathbf{y}|\mathbf{f},\sigma ^2) P(\mathbf{f}|\mathbf{X},{\varvec{\beta }}) d\mathbf{f} = \fancyscript{N}(\mathbf{y}; \mathbf{0}, \mathbf{K}_{\varvec{\beta }} + \sigma ^2\mathbf{I}). \end{aligned}$$

(37)

We then maximize (37), which is equivalent to solving the following optimization problem:

$$\begin{aligned} {\varvec{\theta }}^* = \arg \max _{\varvec{\theta }} \ \log P(\mathbf{y}|\mathbf{X},{\varvec{\theta }}) = \arg \min _{\varvec{\theta }} \ |\mathbf{K}_{\varvec{\beta }} + \sigma ^2\mathbf{I}| + \mathbf{y}^{\top } (\mathbf{K}_{\varvec{\beta }} + \sigma ^2\mathbf{I})^{-1} \mathbf{y}. \end{aligned}$$

(38)

(38) is non-convex in general, and one can find a local minimum using the quasi-Newton or the conjugate gradient search.

1.2 Gaussian process classification

In the classification setting where the output variable $y$ takes a binary^{Footnote 11} value from $\{+1,-1\}$, we have several choices for the likelihood model $P(y|f)$. Two most popular ways to link the real-valued variable $f$ to the binary $y$ are the sigmoid and the probit.

$$\begin{aligned} P(y|f) = \left\{ \begin{array}{ll} \frac{1}{1+\exp (-yf)} &{} \text {(sigmoid)} \\ \varPhi (yf) &{} \text {(probit)} \end{array} \right. \end{aligned}$$

(39)

where $\varPhi (\cdot )$ is the cumulative normal function.

We let $l(y,f;{\varvec{\gamma }}) = -\log P(y|f,{\varvec{\gamma }})$, where ${\varvec{\gamma }}$ denotes the (hyper)parameters of the likelihood model.^{Footnote 12} Likewise, we often drop the dependency on ${\varvec{\gamma }}$ for the notational simplicity.

Unlike the regression case, it is unfortunate that the posterior distribution $P(\mathbf{f}|\mathbf{y},\mathbf{X})$ has no analytic form as it is a product of the non-Gaussian $P(\mathbf{y}|\mathbf{f})$ and the Gaussian $P(\mathbf{f}|\mathbf{X})$. One can consider three standard approximation schemes within the Bayesian framework: (i) Laplace approximation, (ii) variational methods, and (iii) sampling-based approaches (e.g., MCMC). It is often the cases that the third method is avoided due to the computational overhead (as we sample $\mathbf{f}$s, $n$-dimensional vectors, where $n$ is the number of training samples). In the below, we briefly review the Laplace approximation.

1.3 Laplace approximation

The Laplace approximation essentially replaces the product $P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X})$ by a Gaussian with the mean equal to the mode of the product, and the covariance equal to the inverse Hessian of the product evaluated at the mode. More specifically, we let

$$\begin{aligned} S(\mathbf{f})&= -\log \big ( P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X}) \big ) = \sum _{i=1}^n l(y_i,f_i) + \frac{1}{2} \mathbf{f}^{\top } \mathbf{K}^{-1} \mathbf{f} \nonumber \\&+ \frac{1}{2} \log |\mathbf{K}| + \frac{n}{2}\log 2 \pi . \end{aligned}$$

(40)

Note that (40) is a convex function of $\mathbf{f}$ since the Hessian of $S(\mathbf{f})$, ${\varvec{\varLambda }}+\mathbf{K}^{-1}$, is positive definite where ${\varvec{\varLambda }}$ is the diagonal matrix with entries $[{\varvec{\varLambda }}]_{ii} = \frac{\partial ^2 l(y_i,f_i)}{\partial f_i^2}$. The minimum of $S(\mathbf{f})$ can be attained by a gradient search. We denote the optimum by $\mathbf{f}^{MAP} = \arg \max _{\mathbf{f}} S(\mathbf{f})$. Letting ${\varvec{\varLambda }}^{MAP}$ be ${\varvec{\varLambda }}$ evaluated at $\mathbf{f}^{MAP}$, we approximate $S(\mathbf{f})$ by the following quadratic function (i.e., using the Taylor expansion)

$$\begin{aligned} S(\mathbf{f}) \approx S(\mathbf{f}^{MAP}) + \frac{1}{2} (\mathbf{f}-\mathbf{f}^{MAP})^{\top } ({\varvec{\varLambda }}^{MAP}+\mathbf{K}^{-1}) (\mathbf{f}-\mathbf{f}^{MAP}), \end{aligned}$$

(41)

which essentially leads to a Gaussian approximation for $P(\mathbf{f}|\mathbf{y},\mathbf{X})$, namely

$$\begin{aligned} P(\mathbf{f}|\mathbf{y},\mathbf{X}) \approx \fancyscript{N}(\mathbf{f}; \mathbf{f}^{MAP}, ({\varvec{\varLambda }}^{MAP}+\mathbf{K}^{-1})^{-1}). \end{aligned}$$

(42)

The data likelihood (i.e., evidence) immediately follows from the similar approximation,

$$\begin{aligned} P(\mathbf{y}|\mathbf{X},{\varvec{\theta }}) \approx \exp (-S(\mathbf{f}^{MAP})) (2\pi )^{n/2} |{\varvec{\varLambda }}^{MAP} + \mathbf{K}_{\varvec{\beta }}^{-1}|^{-1/2}, \end{aligned}$$

(43)

which can be maximized by gradient search with respect to the hyperparameters ${\varvec{\theta }}=\{{\varvec{\beta }},{\varvec{\gamma }}\}$.

Using the Gaussian approximated posterior, it is easy to derive the predictive distribution for $f_*$:

$$\begin{aligned} P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})&= \fancyscript{N}(f_*; \mathbf{k}(\mathbf{x}_*)^{\top } \mathbf{K}^{-1} \mathbf{f}^{MAP}, k(\mathbf{x}_*,\mathbf{x}_*) \nonumber \\&- \mathbf{k}(\mathbf{x}_*)^{\top } (({\varvec{\varLambda }}^{MAP})^{-1}+\mathbf{K}^{-1})^{-1} \mathbf{k}(\mathbf{x}_*)). \end{aligned}$$

(44)

Finally, the predictive distribution for $y_*$ can be obtained by marginalizing out $f_*$, which has a closed form solution for the probit likelihood model as follows^{Footnote 13}:

$$\begin{aligned} P(y_*=+1|\mathbf{x}_*,\mathbf{y},\mathbf{X}) = \varPhi \Big (\frac{\bar{\mu }}{\sqrt{1+\bar{\sigma }^2}}\Big ), \end{aligned}$$

(45)

where $\bar{\mu }$ and $\bar{\sigma }^2$ are the mean and the variance of $P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})$, respectively.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, M., De la Torre, F. Multiple instance learning via Gaussian processes. Data Min Knowl Disc 28, 1078–1106 (2014). https://doi.org/10.1007/s10618-013-0333-y

Download citation

Received: 24 August 2012
Accepted: 01 July 2013
Published: 10 July 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s10618-013-0333-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple instance learning via Gaussian processes

Abstract

Access this article

Similar content being viewed by others

Large-Scale Gaussian Process Inference with Generalized Histogram Intersection Kernels for Visual Recognition Tasks

Large scale multi-label learning using Gaussian processes

Multiple Instance Classification in the Image Domain

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Gaussian process models and approximate inference

1.1 Gaussian process regression

1.2 Gaussian process classification

1.3 Laplace approximation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multiple instance learning via Gaussian processes

Abstract

Access this article

Similar content being viewed by others

Large-Scale Gaussian Process Inference with Generalized Histogram Intersection Kernels for Visual Recognition Tasks

Large scale multi-label learning using Gaussian processes

Multiple Instance Classification in the Image Domain

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Gaussian process models and approximate inference

Appendix: Gaussian process models and approximate inference

1.1 Gaussian process regression

1.2 Gaussian process classification

1.3 Laplace approximation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation