Abstract
Multiple instance learning (MIL) is a binary classification problem with loosely supervised data where a class label is assigned only to a bag of instances indicating presence/absence of positive instances. In this paper we introduce a novel MIL algorithm using Gaussian processes (GP). The bag labeling protocol of the MIL can be effectively modeled by the sigmoid likelihood through the max function over GP latent variables. As the non-continuous max function makes exact GP inference and learning infeasible, we propose two approximations: the soft-max approximation and the introduction of witness indicator variables. Compared to the state-of-the-art MIL approaches, especially those based on the Support Vector Machine, our model enjoys two most crucial benefits: (i) the kernel parameters can be learned in a principled manner, thus avoiding grid search and being able to exploit a variety of kernel families with complex forms, and (ii) the efficient gradient search for kernel parameter learning effectively leads to feature selection to extract most relevant features while discarding noise. We demonstrate that our approaches attain superior or comparable performance to existing methods on several real-world MIL datasets including large-scale content-based image retrieval problems.
Similar content being viewed by others
Notes
It is an extension of our earlier work (conference paper) published in Kim and De la Torre (2010). We extend the previous work broadly in two aspects: (i) more technical details and complete derivations are provided for Gaussian process (GP) and our approaches based on it, which makes the manuscript comprehensive and self-contained, and (ii) The experimental evaluation includes more extensive MIL datasets including the SIVAL image retrieval database and the drug activity datasets.
We abuse the notation \(f\) to indicate either a function or a response variable evaluated at \(\mathbf{x}\) (i.e., \(f(\mathbf{x})\) interchangeably.
We provide a better insight about our argument here. It is true that any other instances can make a bag positive, but it is the instance with the highest confidence score (what we called most likely positive instance) that solely determines the bag label. In other words, to have an effect on the label of a bag, an instance needs to get the largest confidence score. In formal probability terms, the instance \(i\) can determine the bag label only for the events that assign the highest confidence to \(i\). It is also important to note that even though \(\max _{i\in B} f_i\) indicates a single instance, \(\max \) function examines all instances in the bag to find it.
So, it is also possible to have a probit version of (12), namely \(P(Y_b|\mathbf{f}_b) = \varPhi (Y_b \max _{i\in I_b} f_i)\), where \(\varPhi (\cdot )\) is the cumulative normal function.
It is well known that the soft-max provides relatively tight bounds for the \(\max \), \(\max _{i=1}^m z_i \le \log \sum _{i=1}^m \exp (z_i) \le \max _{i=1}^m z_i + \log m\). Another nice property is that the soft-max is a convex function.
Although it is feasible, here we do not take the variational approximation into consideration for simplicity. Unlike the standard GP classification, it is difficult to perform, for instance, the Expectation Propagation (EP) approximation since the moment matching, the core step in EP that minimizes the KL divergence between the marginal posteriors, requires the integration over the likelihood function in (13), which requires further elaboration.
This has a close relation to Gehler and Chapelle (2007)’s DA approach to SVM. Similar to Gehler and Chapelle (2007), one can also consider a scheduled annealing, where the inverse of the smoothness parameter \(\lambda \) in (29) serves as the annealing temperature. See Sect. 4.1 for further details.
Available at http://www.cs.cmu.edu/~juny/MILL.
We only consider binary classification where the extension to multiclass cases is straightforward.
For the sigmoid model, the integration over \(f_*\) cannot be done analytically, and one needs to do further approximation.
References
Andrews S, Tsochantaridis I, Hofmann T (2003) Support vector machines for multiple-instance learning. Adv Neural Inf Process Syst 15:561–568
Antić B, Ommer B (2012) Robust multiple-instance learning with superbags. In: Proceedings of the 11th Asian conference on computer vision
Babenko B, Verma N, Dollár P, Belongie S (2011) Multiple instance learning with manifold bags. In: International conference on machine learning
Chen Y, Bi J, Wang JZ (2006) MILES: multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 28(12):1931–1947
Dietterich TG, Lathrop RH, Lozano-Perez T (1997) Solving the multiple-instance problem with axis-parallel rectangles. Artif Intell 89:31–71
Dooly DR, Zhang Q, Goldman SA, Amar RA (2002) Multiple-instance learning of real-valued data. J Mach Learn Res 3:651–678
Friedman J (1999) Greedy function approximation: a gradient boosting machine. Technical Report, Department of Statistics, Stanford University
Fu Z, Robles-Kelly A, Zhou J (2011) MILIS: multiple instance learning with instance selection. IEEE Trans Pattern Anal Mach Intell 33(5):958–977
Gärtner T, Flach PA, Kowalczyk A, Smola AJ (2002) Multi-instance kernels. In: International conference on machine learning
Gehler PV, Chapelle O (2007) Deterministic annealing for multiple-instance learning. In: AI and Statistics (AISTATS)
Kim M, De la Torre F (2010) Gaussian process multiple instance learning. In: International conference on machine learning
Lázaro-Gredilla M, Quinonero-Candela J, Rasmussen CE, Figueiras-Vidal AR (2010) Sparse spectrum Gaussian process regression. J Mach Learn Res 11:1865–1881
Li YF, Kwok JT, Tsang IW, Zhou ZH (2009) A convex method for locating regions of interest with multi-instance learning. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD’09), Bled, Slovenia
Liu G, Wu J, Zhou ZH (2012) Key instance detection in multi-instance learning. In: Proceedings of the 4th Asian conference on machine learning (ACML’12), Singapore
Mangasarian OL, Wild EW (2008) Multiple instance classification via successive linear programming. J Optim Theory Appl 137(3):555–568
Maron O, Lozano-Perez T (1998) A framework for multiple-instance learning. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information processing systems. MIT Press, Cambridge
Rahmani R, Goldman SA (2006) MISSL: multiple-instance semi-supervised learning. In: International conference on machine learning
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
Ray S, Page D (2001) Multiple instance regression. In: International conference on machine learning
Raykar VC, Krishnapuram B, Bi J, Dundar M, Rao RB (2008) Bayesian multiple instance learning: automatic feature selection and inductive transfer. In: International conference on machine learning
Scott S, Zhang J, Brown J (2003) On generalized multiple instance learning. Technical Report UNL-CSE-2003-5, Department of Computer Science and Engineering, University of Nebraska, Lincoln
Snelson E, Ghahramani Z (2006) Sparse Gaussian processes using pseudo-inputs. In: Weiss Y, Scholkopf B, Platt J (eds) Advances in neural information processing systems. MIT Press, Cambridge
Tao Q, Scott S, Vinodchandran NV, Osugi TT (2004) SVM-based generalized multiple-instance learning via approximate box counting. In: International conference on machine learning
Viola PA, Platt JC, Zhang C (2005) Multiple instance boosting for object detection. In: Viola P, Platt JC, Zhang C (eds) Advances in neural information processing systems. MIT Press, Cambridge
Wang J, Zucker JD (2000) Solving the multiple-instance problem: a lazy learning approach. In: International conference on machine learning
Wang H, Huang H, Kamangar F, Nie F, Ding C (2011) A maximum margin multi-instance learning framework for image categorization. In: Advances in neural information processing systems
Weidmann N, Frank E, Pfahringer B (2003) A two-level learning method for generalized multi-instance problem. Lecture notes in artificial intelligence, vol 2837. Springer, Berlin
Zhang H, Fritts J (2005) Improved hierarchical segmentation. Technical Report, Washington University in St Louis
Zhang T, Oles FJ (2000) Text categorization based on regularized linear classification methods. Inf Retr 4:5–31
Zhang ML, Zhou ZH (2009) Multi-instance clustering with applications to multi-instance prediction. Appl Intell 31(1):47–68
Zhang Q, Goldman SA, Yu W, Fritts JE (2002) Content-based image retrieval using multiple-instance learning. In: International conference on machine learning
Zhang D, Wang F, Luo S, Li T (2011) Maximum margin multiple instance clustering with its applications to image and text clustering. IEEE Trans Neural Netw 22(5):739–751
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Chih-Jen Lin.
Appendix: Gaussian process models and approximate inference
Appendix: Gaussian process models and approximate inference
We review the GP regression and classification models as well as the Laplace method for approximate inference.
1.1 Gaussian process regression
When the output variable \(y\) is a real-valued scalar, one natural likelihood model from \(f\) to \(y\) is the additive Gaussian noise model, namely
where \(\sigma ^2\), the output noise variance, is another set of hyperparameters (together with the kernel parameters \(\varvec{\beta }\)). Given the training data \(\{(\mathbf{x}_i,y_i)\}_{i=1}^n\) and with (33), we can compute the posterior distribution for the training latent variables \(\mathbf{f}\) analytically, namely
When the posterior of \(\mathbf{f}\) is obtained, we can readily compute the predictive distribution for a test output \(y_*\) on \(\mathbf{x}_*\). We first derive the posterior for \(f_*\), the latent variable for the test point, by marginalizing out \(\mathbf{f}\) as follows:
Then it is not difficult to see that the posterior for \(y_*\) can be obtained by marginalizing out \(f_*\), namely
So far, we have assumed that the hyperparameters (denoted by \(\varvec{\theta }=\{{\varvec{\beta }}, \sigma ^2\}\)) of the GP model are known. However, it is a very important issue to estimate \(\varvec{\theta }\) from the data, the task often known as the GP learning. Following the fully Bayesian treatment, it would be ideal to place a prior on \(\varvec{\theta }\) and compute a posterior distribution for \(\varvec{\theta }\) as well, however, due to the difficulty of the integration, it is usually handled by the evidence maximization, also referred to as the empirical Bayes. The evidence in this case is the data likelihood,
We then maximize (37), which is equivalent to solving the following optimization problem:
(38) is non-convex in general, and one can find a local minimum using the quasi-Newton or the conjugate gradient search.
1.2 Gaussian process classification
In the classification setting where the output variable \(y\) takes a binaryFootnote 11 value from \(\{+1,-1\}\), we have several choices for the likelihood model \(P(y|f)\). Two most popular ways to link the real-valued variable \(f\) to the binary \(y\) are the sigmoid and the probit.
where \(\varPhi (\cdot )\) is the cumulative normal function.
We let \(l(y,f;{\varvec{\gamma }}) = -\log P(y|f,{\varvec{\gamma }})\), where \({\varvec{\gamma }}\) denotes the (hyper)parameters of the likelihood model.Footnote 12 Likewise, we often drop the dependency on \({\varvec{\gamma }}\) for the notational simplicity.
Unlike the regression case, it is unfortunate that the posterior distribution \(P(\mathbf{f}|\mathbf{y},\mathbf{X})\) has no analytic form as it is a product of the non-Gaussian \(P(\mathbf{y}|\mathbf{f})\) and the Gaussian \(P(\mathbf{f}|\mathbf{X})\). One can consider three standard approximation schemes within the Bayesian framework: (i) Laplace approximation, (ii) variational methods, and (iii) sampling-based approaches (e.g., MCMC). It is often the cases that the third method is avoided due to the computational overhead (as we sample \(\mathbf{f}\)s, \(n\)-dimensional vectors, where \(n\) is the number of training samples). In the below, we briefly review the Laplace approximation.
1.3 Laplace approximation
The Laplace approximation essentially replaces the product \(P(\mathbf{y}|\mathbf{f}) P(\mathbf{f}|\mathbf{X})\) by a Gaussian with the mean equal to the mode of the product, and the covariance equal to the inverse Hessian of the product evaluated at the mode. More specifically, we let
Note that (40) is a convex function of \(\mathbf{f}\) since the Hessian of \(S(\mathbf{f})\), \({\varvec{\varLambda }}+\mathbf{K}^{-1}\), is positive definite where \({\varvec{\varLambda }}\) is the diagonal matrix with entries \([{\varvec{\varLambda }}]_{ii} = \frac{\partial ^2 l(y_i,f_i)}{\partial f_i^2}\). The minimum of \(S(\mathbf{f})\) can be attained by a gradient search. We denote the optimum by \(\mathbf{f}^{MAP} = \arg \max _{\mathbf{f}} S(\mathbf{f})\). Letting \({\varvec{\varLambda }}^{MAP}\) be \({\varvec{\varLambda }}\) evaluated at \(\mathbf{f}^{MAP}\), we approximate \(S(\mathbf{f})\) by the following quadratic function (i.e., using the Taylor expansion)
which essentially leads to a Gaussian approximation for \(P(\mathbf{f}|\mathbf{y},\mathbf{X})\), namely
The data likelihood (i.e., evidence) immediately follows from the similar approximation,
which can be maximized by gradient search with respect to the hyperparameters \({\varvec{\theta }}=\{{\varvec{\beta }},{\varvec{\gamma }}\}\).
Using the Gaussian approximated posterior, it is easy to derive the predictive distribution for \(f_*\):
Finally, the predictive distribution for \(y_*\) can be obtained by marginalizing out \(f_*\), which has a closed form solution for the probit likelihood model as followsFootnote 13:
where \(\bar{\mu }\) and \(\bar{\sigma }^2\) are the mean and the variance of \(P(f_*|\mathbf{x}_*,\mathbf{y},\mathbf{X})\), respectively.
Rights and permissions
About this article
Cite this article
Kim, M., De la Torre, F. Multiple instance learning via Gaussian processes. Data Min Knowl Disc 28, 1078–1106 (2014). https://doi.org/10.1007/s10618-013-0333-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0333-y