Bayesian Nonlinear Support Vector Machines for Big Data

Wenzel, Florian; Galy-Fajou, Théo; Deutsch, Matthäus; Kloft, Marius

doi:10.1007/978-3-319-71249-9_19

Florian Wenzel¹⁸,
Théo Galy-Fajou¹⁸,
Matthäus Deutsch¹⁹ &
…
Marius Kloft¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10534))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4733 Accesses
10 Citations
3 Altmetric

Abstract

We propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.

Code related to this chapter is available at: https://doi.org/10.6084/m9.figshare.5443627

Data related to this chapter are available at: https://doi.org/10.6084/m9.figshare.5443624 and https://doi.org/10.6084/m9.figshare.5443621

You have full access to this open access chapter, Download conference paper PDF

An Empirical Comparison of Support Vector Machines Versus Nearest Neighbour Methods for Machine Learning Applications

A Note on Parameter Selection for Support Vector Machines

The comparative efficiency of algorithms for the construction of support-vector machines for regression reconstruction tasks

Article 01 November 2015

Keywords

1 Introduction

Statistical machine learning branches into two classic strands of research: Bayesian and frequentist. In the classic supervised learning setting, both paradigms aim to find, based on training data, a function $f_\beta $ that predicts well on yet unseen test data. The difference in the Bayesian and frequentist approach lies in the treatment of the parameter vector $\beta $ of this function. In the frequentist setting, we select the parameter $\beta $ that minimizes a certain loss given the training data, from a restricted set $\mathcal B$ of limited complexity. In the Bayesian school of thinking, we express our prior belief about the parameter, in the form of a probability distribution over the parameter vector. When we observe data, we adapt our belief, resulting in a posterior distribution over $\beta $

Advantages of the Bayesian approach include automatic treatment of hyperparameters and direct quantification of the uncertainty^{Footnote 1} of the prediction in the form of class membership probabilities which can be of tremendous importance in practice. As examples consider the following. (1) We have collected blood samples of cancer patients and controls. The aim is to screen individuals that have increased likelihood of developing cancer. The knowledge of the uncertainty in those predictions is invaluable to clinicians. (2) In the domain of physics it is important to have a sense about the certainty level of predictions since it is mandatory to assert the statistical confidence in any physical variable measurement. (3) In the general context of decision making, it is crucial that the uncertainty of the estimated outcome of an action can be reliably determined.

Recently, it was shown that the support vector machine (SVM) [1]—which is a classic supervised classification algorithm— admits a Bayesian interpretation through the technique of data augmentation [2, 3]. This so-called Bayesian nonlinear SVM combines the best of both worlds: it inherits the geometric interpretation, its robustness against outliers, state-of-the-art accuracy [4], and theoretical error guarantees [5] from the frequentist formulation of the SVM, but like Bayesian methods it also allows for flexible feature modeling, automatic hyperparameter tuning, and predictive uncertainty quantification.

However, existing inference methods for the Bayesian support vector machine (such as the expectation conditional maximization method introduced in [3]) scale rather poorly with the number of samples and are limited in application to datasets with thousands of data points [3]. Based on stochastic variational inference [6] and inducing points [7], we develop in this paper a fast and scalable inference method for the nonlinear Bayesian SVM.

Our experiments show superior performance of our method over competing methods for uncertainty quantification of SVMs such as Platt’s method [8]. Furthermore, we show that our approach is faster (by one to three orders of magnitude) than the following competitors: expectation conditional maximization (ECM) for nonlinear Bayesian SVM by [3], Gaussian process classification [9], and the recently proposed scalable variational Gaussian process classification method [10]. We apply our method to the domain of particle physics, namely on the SUSY dataset [11] (a standard benchmark in particle physics containing 5 million data points) where our method takes only 10 min to train on a single CPU machine.

Our experiments demonstrate that Bayesian inference techniques are mature enough to compete with corresponding frequentist approaches (such as nonlinear SVMs) in terms of scalability to big data, yet they offer additional benefits such as uncertainty estimation and automated hyperparameter search.

Our paper is structured as follows. In Sect. 2 we discuss related work and review the Bayesian nonlinear SVM model in Sect. 3. In Sect. 4 we propose our novel scalable inference algorithm, show how to optimize hyperparameters and obtain an approximate predictive distribution. We discuss also the special case of the linear SVM, for which we propose a specially tailored fast inference algorithm. Section 5 concludes with experimental results.

2 Related Work

There has recently been significant interest in utilizing max-margin based discriminative Bayesian models for various applications. For example, [12] employs a max-margin based Bayesian classification to discover latent semantic structures for topic models, [13] uses a max-margin approach for efficient Bayesian matrix factorization, and [14] develops a new max-margin approach to Hidden Markov models.

All these approaches apply the Bayesian reformulation of the classic SVM introduced by [2]. This model is extended by [3] to the nonlinear case. The authors show improved accuracy compared to standard methods such as (non-Bayesian) SVMs and Gaussian process (GP) classification.

However, the inference methods proposed in [2, 3] have the drawback that they partially rely on point estimates of the latent variables and do not scale well to large datasets. In [15] the authors apply mean field variational inference to the linear case of the model, but their proposed technique does not lead to substantial performance improvements and neglects the nonlinear model.

Uncertainty estimation for SVMs is usually done via Platt’s technique [8], which consists of applying a logistic regression on the function scores produced by the SVM. In contrast, our technique directly yields a sound predictive distribution instead of using a heuristically motivated transformation. We make use of the idea of inducing point GPs to develop a scalable inference method for the Bayesian nonlinear SVM. Sparse GPs using pseudo-inputs were first introduced in [16]. Building on this idea Hensman et al. developed a stochastic variational inference scheme for GP regression and GP classification [7, 10]. We further extend this ideas to the setting of Bayesian nonlinear SVM.

3 The Bayesian SVM Model

Let $\mathcal{D} = \left\{ x_i, y_i\right\} _{i=1}^n$ be n observations where $x_i \in {\mathbb {R}}^d$ is a feature vector with corresponding labels $y_i \in \{-1,1\}$. The SVM aims to find an optimal score function f by solving the following regularized risk minimization objective:

$$\begin{aligned} \arg \min _{f} \; \gamma R\left( f\right) + \sum _{i=1}^n \max \left( 0,1-y_i f(x_i)\right) , \end{aligned}$$

(1)

where R is a regularizer function controlling the complexity of the decision function f, and $\gamma $ is a hyperparameter to adjust the trade-off between training error and the complexity of f. The loss $\max \left( 0,1-yf(x)\right) $ is called hinge loss. The classifier is then defined as $\text {sign}(f(x))$.

For the case of a linear decision function, i.e. $f(x)=x^T\beta $, the SVM optimization problem (1) is equivalent to estimating the mode of a pseudo-posterior

$$\begin{aligned} p(\beta | \mathcal{D}) \propto \prod _{i=1}^n L(y_i | x_i, \beta ) p(\beta ). \end{aligned}$$

Here $p(\beta )$ denotes a prior such that $\log p(\beta ) \propto -2 \gamma R(\beta )$. In the following we use the prior $\beta \sim \mathcal{N}(0,\varSigma )$, where $\varSigma \in {\mathbb {R}} ^{d\times d}$ is a positive definite matrix. From a frequentist SVM view, this choice generalizes the usual $L^2$-regularization to non-isotropic regularizers. Note that our proposed framework can be easily extended to other regularization techniques by adjusting the prior on $\beta $ (e.g. block $\ell _{(2,p)}$-norm regularization which is known as multiple kernel learning [17]). In order to obtain a Bayesian interpretation of the SVM, we need to define a pseudolikelihood L such that the following holds,

$$\begin{aligned} L\left( y | x, f(\cdot ) \right) \propto \exp \left( -2 \max (1-y_if(x_i),0) \right) . \end{aligned}$$

(2)

By introducing latent variables $\lambda := (\lambda _1,\dots ,\lambda _n)^\top $ (data augmentation) and making use of integral identities stemming from function theory, [2] show that the specification of L in terms of the following marginal distribution satisfies (2):

$$\begin{aligned} L(y_i\vert x_i,\beta ) = \int _0^\infty \frac{1 }{\sqrt{2\pi \lambda _i }}\exp \left( -\frac{1 }{2}\frac{\left( 1+\lambda _i -y_ix_i^T\beta \right) ^2 }{\lambda _i}\right) \mathrm {d}\lambda _i. \end{aligned}$$

(3)

Writing $X\in \mathbb {R}^{d\times n}$ for the matrix of data points and $Y = {{\mathrm{diag}}}(y)$, the full conditional distributions of this model are

$$\begin{aligned} \begin{aligned} \beta | \lambda , \varSigma , \mathcal{D}&\sim \mathcal{N}\left( B(\lambda ^{-1}+1),\, B\right) ,\\ \lambda _i | \beta , \mathcal{D}_i&\sim \mathcal {GIG}\left( 1/2, 1, (1-y_i x_i^\top \beta )^2\right) , \end{aligned} \end{aligned}$$

(4)

with $Z=YX$, $B^{-1}=Z\varLambda ^{-1}Z^\top + \varSigma ^{-1}$, $\varLambda = {{\mathrm{diag}}}(\lambda )$ and where $\mathcal {GIG}$ denotes a generalized inverse Gaussian distribution. The n latent variables $\lambda _i$ of the model scale the variance of the full posteriors locally. The model thus constitutes a special case of a normal variance-mean mixture, where we implicitly impose the improper prior $p(\lambda )=\mathbbm {1}_{[0,\infty )}(\lambda )$ on $\lambda $. This could be generalized by using a generalized inverse Gaussian prior on $\lambda _i$, leading to a conjugate model for $\lambda _i$. Henao et al. show that in the case of an exponential prior on $\lambda _i$, this leads to a skewed Laplace full conditional for $\lambda _i$. Note that this, however, destroys the equivalency to the frequentist linear SVM.

By using the ideas of Gaussian processes [9], Henao et al. develop a nonlinear (kernelized) version of this model [3]. They assume a continuous decision function f(x) to be drawn from a zero-mean Gaussian process $\mathrm {GP}(0, k)$, where k is a kernel function. The random Gaussian vector $f = (f_1, \ldots , f_n)^\top $ corresponds to f(x) evaluated at the data points. They substitute the linear function $x_i^\top \beta $ by $f_i$ in (3) and obtain the conditional posteriors

$$\begin{aligned} \begin{aligned} f | \lambda , \mathcal{D}&\sim \mathcal{N}\left( CY(\lambda ^{-1} +1),\, C\right) ,\\ \lambda _i | f_i, \mathcal{D}_i&\sim \mathcal {GIG}\left( 1/2, 1, (1-y_if_i)^2\right) , \end{aligned} \end{aligned}$$

(5)

with $C^{-1} = \varLambda ^{-1} + K^{-1}$. For a test point $x_*$ the conditional predictive distribution for $f_* = f(x_*)$ under this model is

$$\begin{aligned} f_* | \lambda , x_*, \mathcal{D} \sim \mathcal{N}\left( k_*^\top (K + \varLambda )^{-1}Y(1+\lambda ),\, k_{**} - k_*^\top (K + \varLambda )^{-1} k_* \right) , \end{aligned}$$

where $K:= k(X,X)$, $k_{X*}:=k(X,x_*)$, $k_{**}:= k(x_*,x_*)$. The conditional class membership probability is

$$\begin{aligned} p (y_* = 1 | \lambda , x_*, \mathcal{D}) = \varPhi \left( \frac{k_*^T(K+\varLambda )^{-1}Y(1+\lambda )}{1+ k_{**} - k_*^\top (K + \varLambda )^{-1} k_*}\right) , \end{aligned}$$

where $\varPhi (.)$ is the probit link function.

Note that the conditional posteriors as well as the class membership probability still depend on the local latent variables $\lambda _i$. We are interested in the marginal predictive distributions, but unfortunately the latent variables cannot be integrated out analytically. Both [2, 3] propose MCMC-algorithms and stepwise inference schemes similar to EM-algorithms to overcome this problem. These methods do not scale well to big data problems and the probability estimation still relies on point estimates of the n-dimensional $\lambda $. We overcome these problems proposing a scalable inference method and obtaining approximate marginal predictive distributions (that are not conditioned on $\lambda $).

4 Scalable Inference and Automated Hyperparameter Tuning

In the following we develop a fast and reliable inference method for the Bayesian nonlinear SVM. Our method builds on the idea of using inducing points for Gaussian Processes in a stochastic variational inference setting [7] that scales easily to millions of data points. We proceed by first discussing a standard batch variational scheme in Sect. 4.1 and then in Sect. 4.2 we develop our fast and scalable inference method. We show how to automatically tune hyperparameters in Sect. 4.3 and obtain uncertainty estimates for predictions in Sect. 4.4. Finally, we discuss the special case of the Bayesian linear SVM in Sect. 4.5.

4.1 Batch Variational Inference

The idea of variational inference is to approximate the typically intractable posterior of a probabilistic model by a variational (typically factorized) distribution. We find the optimal approximating distribution by maximizing a lower bound on the evidence (the so-called ELBO) with respect to the parameters of the variational distribution, which is equivalent to minimizing the Kullback-Leibler divergence between the variational distribution and the posterior [18, 19].

In this section we first develop a batch variational inference scheme [18, 19], which uses the full dataset in every iteration. We follow the structured mean field approach and choose the variational distributions within the same families as the full conditional distributions $q(f,\lambda ) = q(f) \prod _{i=1}^n q(\lambda _i)$, with $q(f) \equiv \mathcal{N}(\mu , \zeta ) \text {and} q(\lambda _i) \equiv \mathcal {GIG}(1/2,1,\alpha _i)$. The coordinate ascent updates can be computed by the expected natural parameters of the corresponding full conditionals (5) leading to

$$\begin{aligned} \alpha _i&= \mathbb {E}_{q(f)} [(1- y_i f_i)^2] = (1- y_i^\top \mu )^2 + y_i^\top \zeta y_i,\\ \zeta&= \mathbb {E}_{q(\lambda )}[\left( \varLambda ^{-1} + K^{-1}\right) ^{-1}] = \left( A^{-\frac{1}{2}} + K^{-1}\right) ^{-1},\\ \mu&= \zeta \mathbb {E}_{q(\lambda )}[Y(\lambda ^{-1} + 1)] = \zeta Y(\alpha ^{-\frac{1}{2}} + 1). \end{aligned}$$

This concludes the batch variational inference scheme.

The downside of this approach is that it does not scale to big datasets. The covariance matrix of the variational distribution q(f) has dimension $n \times n$ and has to be updated and inverted at every inference step. This operation exhibits the computational complexity $\mathcal{O}(n^3)$, where n is the number of data points. Furthermore, in this setup we cannot apply stochastic gradient descent. We show how to overcome both problems in the next section paving the way to perform inference on big datasets.

4.2 Stochastic Variational Inference Using Inducing Points

We aim to develop a stochastic variational inference (SVI) scheme using only minibatches of the data in each iteration. The Bayesian nonlinear SVM model does not exhibit a set of global variables. Both the number of latent variables $\lambda $ and the observations of the latent GP f grow with number of data points (c.f. Eq. 5), i.e. they are local variables. This hinders us from directly developing a SVI scheme. We make use of the concept of inducing points [7] imposing a sparse GP acting as global variable. This allows us to apply SVI and reduces the complexity to $\mathcal{O}(m^3)$, where m is the number of inducing points, which is independent of the number of data points.

We augment our original model (5) with $m<n$ inducing points. Let $u\in \mathbb {R}^m$ be pseudo observations at inducing locations $\{\hat{x}_1,\ldots ,\hat{x}_m \}$. We employ a prior on the inducing points, $p(u) = \mathcal N(0,K_{mm})$ and connect f and u setting

$$\begin{aligned} p(f|u) = \mathcal N(K_{nm}K_{mm}^{-1}u, \widetilde{K}) \end{aligned}$$

(6)

where $K_{mm}$ is the kernel matrix resulting from evaluating the kernel function between all inducing points locations, $K_{nm}$ is the cross-covariance between the data points and the inducing points and $\widetilde{K}$ is given by $\widetilde{K} = K_{nn} - K_{nm}K_{mm}^{-1}K_{mn}$. The augmented model exhibits the joint distribution

$$\begin{aligned} p(y,u,f,\lambda ) = p(y, \lambda |f) p(f | u) p(u). \end{aligned}$$

Note that we can recover the original joint distribution by marginalizing over u. We now aim to apply the methodology of variational inference to the marginal joint distribution $p(y,u,\lambda )=\int p(y,u,f,\lambda )\mathrm {d}f$. We impose a variational distribution $q(u) = \mathcal N(u|\mu , \zeta )$ on the inducing points u. We follow [7] and apply Jensen’s inequality to obtain a lower bond on the following intractable conditional probability,

$$\begin{aligned} \log p(y,\lambda |u)= & {} \log \mathbb {E}_{p(f|u)}\left[ p(y,\lambda |f ) \right] \\\ge & {} \mathbb {E}_{p(f|u)}\left[ \log p(y,\lambda |f)\right] \\= & {} \sum _{i=1}^n\mathbb {E}_{p(f_i|u)}\left[ \log p(y_i,\lambda _i|f_i)\right] \\= & {} \sum _{i=1}^n\mathbb {E}_{p(f_i|u)}\left[ \log \left( (2\pi \lambda _i)^{-\frac{1}{2}}\exp \left( -\frac{1}{2}\frac{(1+\lambda _i-y_if_i)^2}{\lambda _i}\right) \right) \right] \end{aligned}$$

Plugging the lower bound $\mathcal {L}_1$ into the standard evidence lower bound (ELBO) [18] leads to the new variational objective

$$\begin{aligned} \log p(y)\ge & {} \mathbb {E}_q\left[ \log p(y,\lambda ,u)\right] - \mathbb {E}_q\left[ \log q(\lambda ,u) \right] \nonumber \\= & {} \mathbb {E}_q\left[ \log p(y,\lambda |u) \right] + \mathbb {E}_q\left[ \log p(u)\right] - \mathbb {E}_q\left[ \log q(\lambda ,u) \right] \nonumber \\\ge & {} \mathbb {E}_q\left[ \mathcal {L}_1\right] + \mathbb {E}_q\left[ \log p(u)\right] - \mathbb {E}_q\left[ \log q(\lambda ,u)\right] \\= & {} -\frac{1}{2}\sum _{i=1}^n\mathbb {E}_q\left[ \log \lambda _i + \frac{1}{\lambda _i}\left( \widetilde{K}_{ii} + \left( 1+\lambda _i - y_i K_{im}K_{mm}^{-1}u\right) ^2\right) \right] \nonumber \\&\quad -\,\mathrm {KL}\left( q(u)||p(u)\right) - \mathbb {E}_{q(\lambda )}\left[ \log q(\lambda )\right] \nonumber \\= & {} : \mathcal {L}.\nonumber \end{aligned}$$

(7)

The expectations can be computed analytically (details are given in the appendix) and we obtain $\mathcal{L}$ in closed form,

(8)

where $\kappa =K_{nm}K_{mm}^{-1}$ and $\mathrm {B}_{\frac{1}{2}}(.)$ is the modified Bessel function with parameter $\frac{1}{2}$ [20]. This objective is amenable to stochastic optimization where we subsample from the sum to obtain a noisy gradient estimate. We develop a stochastic variational inference scheme by following noisy natural gradients of the variational objective $\mathcal {L}$. Using the natural gradient over the standard euclidean gradient is often favorable since natural gradients are invariant to reparameterization of the variational family [21, 22] and provide effective second-order optimization updates [6, 23]. The natural gradients of $\mathcal {L}$ w.r.t. the Gaussian natural parameters $\eta _1 = \zeta ^{-1}\mu $, $\eta _2 = -\frac{1}{2}\zeta ^{-1}$ are

$$\begin{aligned} \widetilde{\nabla }_{\eta _1} \mathcal {L}= & {} \kappa ^\top Y(\alpha ^{-\frac{1}{2}}+1) - \eta _1 \end{aligned}$$

(9)

$$\begin{aligned} \widetilde{\nabla }_{\eta _2} \mathcal {L}= & {} -\frac{1}{2}(K_{mm}^{-1} + \kappa ^\top A^{-\frac{1}{2}} \kappa ) - \eta _2, \end{aligned}$$

(10)

with $A = {{\mathrm{diag}}}(\alpha )$. Details can be found in the appendix. The natural gradient updates always lead to a positive definite covariance matrix^{Footnote 2} and in our implementation $\zeta $ has not to be parametrized in any way to ensure positive-definiteness. The derivative of $\mathcal {L}$ w.r.t. $\alpha _i$ is

$$\begin{aligned} \nabla _\alpha \mathcal {L}&= \frac{(1-y_i\kappa _i\mu )^2 + y_i(\kappa _i\zeta \kappa _i^\top + \widetilde{K}_{ii})y_i}{4\sqrt{\alpha _i}^3} - \frac{1}{4\sqrt{\alpha _i}}. \end{aligned}$$

(11)

Setting it to zero gives the coordinate ascent update for $\alpha _i$,

$$\begin{aligned} \alpha _i = (1-y_i\kappa _i\mu )^2 + y_i(\kappa _i\zeta \kappa _i^\top + \widetilde{K}_{ii})y_i. \end{aligned}$$

Details can be found in the appendix. The inducing point locations can be either treated as hyperparameters and optimized while training [24] or can be fixed before optimizing the variational objective. We follow the first approach which is often preferred in a stochastic variational inference setup [7, 10]. The inducing point locations can be either randomly chosen as subset of the training set or via a density estimator. In our experiments we have observed that the k-means clustering algorithm (kMeans) [25] yields the best results. Combining our results, we obtain a fast stochastic variational inference algorithm for the Bayesian nonlinear SVM which is outlined in Algorithm 1. We apply the adaptive learning rate method described in [26].

4.3 Auto Tuning of Hyperparameters

The probabilistic formulation of the SVM lets us directly learn the hyperparameters while training. To this end we maximize the marginal likelihood p(y|X, h), where h denotes the set of hyperparameters (this approach is called empirical Bayes [27]). We follow an approximate approach and optimize the fitted variational lower bound $\mathcal{L}(h)$ over h by alternating between optimization steps w.r.t. the variational parameters and the hyperparameters [28]. We include a gradient ascent step w.r.t. h after multiple variational updates in the SVI scheme, this is commonly known as Type II maximum likelihood (ML-II) [9]

$$\begin{aligned} h^{(t)} = h^{(t-1)} + \widetilde{\rho }_t \nabla _h \mathcal{L}(\alpha ^{(t-1)} , \mu ^{(t-1)}, \zeta ^{(t-1)}, h). \end{aligned}$$

(12)

Since the standard SVM does not exhibit a probabilistic formulation, the hyperparameters have to be tuned via computationally very expensive methods as grid search and cross validation. Our approach allows us to estimate the hyperparameters during training time and lets us follow gradients instead of only evaluating single hyperparameters.

In the appendix we provide the gradient of the variational objective $\mathcal{L}$ w.r.t. to a general kernel and show how to optimize arbitrary differentiable hyperparameters. Our experiments exemplify our automated hyperparameter tuning approach by optimizing the hyper parameter of an RBF kernel.

4.4 Uncertainty Predictions

Besides the advantage of automated hyperparameter tuning, the probabilistic formulation of the SVM leads directly to uncertainty estimates of the predictions. The standard SVM lacks this capability, and only heuristic approaches as e.g. Platt [8] exist. Using the approximate posterior $q(u| \mathcal{D}) = \mathcal{N}(u | \mu , \zeta )$ obtained by our stochastic variational inference method (Algorithm 1) we compute the class membership probability for a test point $x^*$,

$$\begin{aligned} p(f^*|x^*, \mathcal{D})= & {} \int p(y^*|u, x^*)p(u|\mathcal{D})\mathrm {d}u\\\approx & {} \int p(y^*|u, x^*)q(u|\mathcal{D})\mathrm {d}u\\= & {} \mathcal{N}\left( y^* | K_{*m}K_{mm}^{-1}m, \; K_{**} - K_{*m}K_{mm}^{-1}(K_{m*} + \zeta K_{mm}^{-1}K_{m*})\right) \\=: & {} q(f^*|x^*, \mathcal{D}), \end{aligned}$$

where $K_{*m}$ denotes the kernel matrix between test and inducing points and $K_{**}$ the kernel matrix between test points. This leads to the approximate class membership distribution

$$\begin{aligned} q(y^*|x^*, \mathcal{D})&= \varPhi \left( \frac{K_{*m}K_{mm}^{-1}m}{K_{**} - K_{*m}K_{mm}^{-1}(K_{m*} + \zeta K_{mm}^{-1}K_{m*}) + 1} \right) \end{aligned}$$

(13)

where $\varPhi (.)$ is the probit link function. Note that we already computed inverse $K_{mm}^{-1}$ for the training procedure leading to a computational overhead stemming only from simple matrix multiplication. Our experiments show that (13) leads to reasonable uncertainty estimates.

4.5 Special Case of Linear Bayesian SVM

We now consider the special case of using a linear kernel. If we are interested in this case we may consider the Bayesian model for the linear SVM proposed by Polson et al. (c.f. Eq. 4). This can be favorable over using the nonlinear version since this model is formulated in primal space and, therefore, the computational complexity depends on the dimension d and not on the number of data points n. Furthermore, focusing directly on the linear model allows us to optimize the true ELBO, $\mathbb {E}_q\left[ \log p(y,\lambda ,\beta )\right] - \mathbb {E}_q\left[ \log q(\lambda ,\beta ) \right] $, without the need of relying on a lower bound (as in Eq. 7). This typically leads to a better approximate posterior.

We again follow the structured mean field approach and chose our variational distributions to be in the same families as the full conditionals (4),

$$\begin{aligned} q(\lambda _i)&\equiv \mathcal {GIG}(\frac{1}{2}, 1, \alpha _i ) \text { and } q(\beta ) \equiv \mathcal {N}(\mu , \zeta ). \end{aligned}$$

We use again the fact that the coordinate updates of the variational parameters can be obtained by computing the expected natural parameters of the corresponding full conditionals (4) and obtain

$$\begin{aligned} \alpha _i= & {} (1-z_i^T\mu )^2 + z_i^T\zeta z_i\nonumber \\ \zeta= & {} (Z A^{-\frac{1}{2}} Z^T+\varSigma ^{-1} )^{-1}\\ \mu= & {} \zeta Z (\alpha ^{-\frac{1}{2}} + 1),\nonumber \end{aligned}$$

(14)

where $\alpha = (\alpha _i)_{1\le i\le n}$, $A={{\mathrm{diag}}}(\alpha )$ and $Z=YX$. Since the Bayesian Linear SVM model exhibits global and local variables we can directly employ stochastic variational inference by subsampling the data and only updating minibatches of $\alpha $. Note that for the linear case the covariance matrices have size $d \times d$, i.e. being independent of the number of data points. Therefore, the SVI Algorithm (14) for the Bayesian Linear SVM exhibits the computational complexity $\mathcal{O}(d^3)$. Luts et al. develop a batch variational inference scheme for the Bayesian linear SVM but do not scale to big datasets.

The hyperparameter can be tuned analogously to (12). The class membership probabilities are

$$\begin{aligned} p(y_* = 1 | x^*, \mathcal{D}) \approx \int \varPhi (f_*) p(f_*| f, x^*) q(f|\mathcal{D}) \mathrm {d}f \mathrm {d}f_* = \varPhi \left( \frac{x_*^\top \mu }{x_*^\top \zeta x_*+1} \right) , \end{aligned}$$

where $x_*$ are the test points and $q(f|\mathcal{D}) = \mathcal{N}(f | \mu , \zeta )$ the approximate posterior obtained by the above described SVI scheme.

5 Experiments

We compare our approach against the expectation conditional maximization (ECM) method proposed by Henao et al. [3], Gaussian process classification (GPC) [9], its recently proposed scalable stochastic variational inference version (S-GPC) [10], and libSVM with Platt scaling [8, 29] (SVM + Platt). For all experiments we use an RBF kernel^{Footnote 3} with length-scale parameter $\theta $. We perform all experiments using only one CPU core with 2.9 GHz and 386 GB RAM.

Code is available at github.com/theogf/BayesianSVM.

Table 1. Average prediction error and Brier score with one standard deviation.

Full size table

5.1 Prediction Performance and Uncertainty Estimation

We experiment on seven real-world datasets and compare the prediction performance, the quality of the uncertainty estimates and run time of the methods. The results are presented in Table 1. We show that our method (S-BSVM) is up to 22 times faster than the direct competitor ECM and up to 700 times faster than Gaussian process classification^{Footnote 4} while outperforming the competitors in terms of prediction performance and quality of uncertainty estimates in most cases. The non-probabilistic SVM is naturally the fastest method. Combined with the heuristic Platt scaling approach it leads to class membership probabilities but, however, still lacks the advantages of a probabilistic model (as e.g. uncertainty quantification of the learned parameters and automatic hyperparameter tuning).

To evaluate the quality of the uncertainty estimates we compute the Brier score which is considered as a good performance measure for probabilistic predictions [30] being defined as $BS = \frac{1}{n} \sum _{i=1}^N \left( y_i - q(x_i) \right) ^2$, where $y_i \in \{0,1\}$ is the observed output and $q(x_i) \in [0,1]$ is the predicted class membership probability. Note that smaller Brier score indicates better performance.

The datasets are all from the Rätsch benchmark datasets [31] commonly used to test the accuracy of binary nonlinear classifiers. We perform a 10-fold cross-validation and use an RBF kernel with fixed parameters for all methods. For S-BSVM we choose the number of inducing points as $20\%$ of the training set size, except for the datasets Splice, German and Waveform where we use 100 inducing points. For each dataset minibatches of 10 samples are used.

5.2 Big Data Experiments

We demonstrate the scalability of our method on the SUSY dataset [11] containing 5 million points with 17 features. This dataset size is very common in particle physics due to the simplicity of artificially generating new events as well as the quantity of data coming from particle detectors. Since it is important to have a sense of the confidence of the predictions for such datasets the Bayesian SVM is an appropriate choice. We use an RBF kernel^{Footnote 5}, 64 inducing points and minibatches of 100 points. The training of our model takes only 10 min without any parallelization. We use the area under the receiver operating characteristic (ROC) curve (AUC) as performance measure since it is a standard evaluation measure on this dataset [11].

Our method achieves an AUC of 0.84 and a Brier score of 0.22, whereby the state-of-the-art obtains an AUC of 0.88 using a deep neural network (5 layers, 300 hidden units each) [11]. Note that this approach takes much longer to train and does not include uncertainty estimates.

5.3 Run Time

We examine the run time of our methods and the competitors. We include both the batch variational inference method (B-BSVM) described in Sect. 4.1 and our fast and scalable inference method (S-BSVM) described in Sect. 4.2 in the experiments. For each method we iteratively evaluate the prediction performance on a held-out dataset given a certain training time budget. The prediction error as function of the training time is shown in Fig. 1. We experiment on the Waveform dataset from the Rätsch benchmark dataset ($N=5000,\;d=21$). We use an RBF kernel with fixed length-scale parameter $\theta = 5.0$ and for the stochastic variational inference methods, S-BSVM and S-GPC, we use a batch size of 10 and 100 inducing points.

Our scalable method (S-BSVM) is around 10 times faster than the direct competitor ECM while having slightly better prediction performance. The batch variational inference version (B-BSVM) is the slowest of the Bayesian SVM inference methods. The related probabilistic model, Gaussian process classification, is around 5000 times slower than S-BSVM. Its stochastic inducing point version (S-GPC) has comparable run time to S-BSVM but is very unstable leading to bad prediction performance. S-GPC showed these instabilities for multiple settings of the hyperparameters. The classic SVM (libSVM) has a similar run time as our method. The speed and prediction performance of S-BSVM depend on the number of inducing points. See Sect. 5.5 for an empirical study. Note that the run time in Table 1 is determined after the methods have converged.

5.4 Auto Tuning of Hyperparameters

In Sect. 4.3 we show that our inference method possesses the ability of automatic hyperparameter tunning. In this experiment we demonstrate that our method, indeed, finds the optimal length-scale hyperparameter of the RBF kernel. We use the optimizing scheme (12) and alternate between 10 variational parameter updates and one hyperparameter update. We compute the true validation loss of the length-scale parameter $\theta $ by a grid search approach which consists of training our model (S-BSVM) for each $\theta $ and measuring the prediction performance using 10-fold cross validation. In Fig. 2 we plot the validation loss and the length-scale parameter found by our method. We find the true optimum by only using 5 hyperparameter optimization steps. Training and hyperparameter optimization takes only 0.3 s for our method, whereas grid search takes 188 s (with a grid size of 1000 points).

5.5 Inducing Points Selection

The sparse GP model used in our inference scheme builds on a set of inducing points where both the number and the locations of the inducing points are free parameters. We investigate three different inducing point selection methods: random subset selection from the training set, the Gaussian Mixture Model (GMM), and the k-means clustering algorithm with an improved k-means++ seeding (kMeans) [32]. Furthermore we show how the number of inducing points affects the prediction accuracy and the run time. We test the three inducing point selection methods on the USPS dataset [33] which we reduced to a binary problem using only the digits 3 and 5 (N = 1350 and d = 256). For all methods we progressively increase the number of inducing points and compute the prediction error by 10-fold cross validation. We present our results in Fig. 3.

The GMM is unable to fit large numbers of samples and dimensions and fails to converge for almost all datasets tried, therefore, we do not include it in the plot. Using the k-means selection algorithm leads for small numbers of inducing points to much better prediction performance than random subset selection. Furthermore, we show that using only a small fraction of inducing points (around 1% of the original dataset) leads to a nearly optimal prediction performance by simultaneously significantly decreasing the run time. We observe similar results on all datasets we considered.

6 Conclusion

We presented a fast, scalable and reliable approximate inference method for the Bayesian nonlinear SVM. While previous methods were restricted to rather small datasets our method enables the application of the Bayesian nonlinear SVM to large real world datasets containing millions of samples. Our experiments showed that our method is orders of magnitudes faster than the state-of-the-art while still yielding comparable prediction accuracies. We showed how to automatically tune the hyperparameters and obtain prediction uncertainties which is important in many real world scenarios.

In future work we plan to further extend the Bayesian nonlinear SVM model to deal with missing data and account for correlations between data points building on ideas from [34]. Furthermore, we want to develop Bayesian formulations of important variants of the SVM as for instance one-class SVMs [35].

Notes

1.
Note that frequentist approaches can also lead to other forms of uncertainty estimates, e.g. in form of confidence intervals. But since the classic SVM does not exhibit a probabilistic formulation these uncertainty estimates cannot be directly computed.
2.
This follows directly since $K_{mm}$ and $A^{-\frac{1}{2}}$ are positive definite.
3.
The RBF kernel is defined as $k(x_1,x_2,\theta )=\exp \left( -\frac{||x_1-x_2||}{\theta ^2}\right) $, where $\theta $ is the length scale parameter.
4.
For a comparison with the stochastic variational inference version of GPC, see Sect. 5.3.
5.
The length scale parameter tuning is not included in the training time. We found $\theta = 5.0$ by our proposed automatic tuning approach.

References

Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–24 (2011)
Article MathSciNet MATH Google Scholar
Henao, R., Yuan, X., Carin, L.: Bayesian nonlinear support vector machines and discriminative factor modeling. In: NIPS (2014)
Google Scholar
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds of classifiers to solve real world classification problems? JMLR 15(1), 3133–3181 (2014)
MathSciNet MATH Google Scholar
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT press, Cambridge (2012)
MATH Google Scholar
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. JMLR 14, 1303–1347 (2013)
MathSciNet MATH Google Scholar
Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. In: Conference on Uncertainty in Artificial Intellegence (2013)
Google Scholar
Platt, P.J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 10(3), 61–74 (1999)
Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). MIT Press, Cambridge (2005)
Google Scholar
Hensman, J., Matthews, A.: Scalable variational Gaussian process classification. In: AISTATS (2015)
Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 4 (2014). Article no. 4308
Google Scholar
Zhu, J., Chen, N., Perkins, H., Zhang, B.: Gibbs max-margin topic models with data augmentation. JMLR 15(1), 1073–1110 (2014)
MathSciNet MATH Google Scholar
Xu, M., Zhu, J., Zhang, B.: Fast max-margin matrix factorization with data augmentation. In: ICML, pp. 978–986 (2013)
Google Scholar
Zhang, A., Zhu, J., Zhang, B.: Max-margin infinite hidden Markov models. In: ICML (2014)
Google Scholar
Luts, J., Ormerod, J.T.: Mean field variational Bayesian inference for support vector machine classification. Comput. Stat. Data Anal. 73, 163–176 (2014)
Article MathSciNet Google Scholar
Snelson, E., Ghahramani, Z.: Sparse GPs using pseudo-inputs. In: NIPS (2006)
Google Scholar
Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: $lp$-norm multiple kernel learning. JMLR 12, 953–997 (2011)
MATH Google Scholar
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An introduction to variational methods for graphical models. Mach. Learn. 37(2), 183–233 (1999)
Article MATH Google Scholar
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1(1–2), 1–305 (2008)
MATH Google Scholar
Jørgensen, B.: Statistical Properties of the Generalized Inverse Gaussian Distribution. Springer, New York (2012). https://doi.org/10.1007/978-1-4612-5698-4
Google Scholar
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2007)
MATH Google Scholar
Martens, J.: New insights and perspectives on the natural gradient method. Arxiv Preprint (2017)
Google Scholar
Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)
Article Google Scholar
Titsias, M.K.: Variational learning of inducing variables in sparse Gaussian processes. In: Artificial Intelligence and Statistics, vol. 12, pp. 567–574 (2009)
Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)
MATH Google Scholar
Ranganath, R., Wang, C., Blei, D.M., Xing, E.P.: An adaptive learning rate for stochastic variational inference. In: ICML (2013)
Google Scholar
Maritz, J., Lwin, T.: Empirical Bayes Methods with Applications: Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton (1989)
MATH Google Scholar
Mandt, S., Hoffman, M., Blei, D.: A variational analysis of stochastic gradient algorithms. In: ICML (2016)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)
Article Google Scholar
Brier, G.W.: Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78(1), 1–3 (1950)
Article Google Scholar
Diethe, T.: 13 benchmark datasets derived from the UCI, DELVE and STATLOG repositories (2015)
Google Scholar
Bachem, O., Lucic, M., Hassani, H., Krause, A.: Fast and provably good seedings for k-means. In: NIPS (2016)
Google Scholar
Lichman, M.: UCI machine learning repository (2013)
Google Scholar
Mandt, S., Wenzel, F., Nakajima, S., Cunningham, J.P., Lippert, C., Kloft, M.: Sparse probit linear mixed model. Mach. Learn. 106(9–10), 1621–1642 (2017)
Article MathSciNet Google Scholar
Perdisci, R., Gu, G., Lee, W.: Using an ensemble of one-class SVM classifiers to H. P.-based anomaly detection systems. In: Data Mining (2006)
Google Scholar

Download references

Acknowledgments

We thank Stephan Mandt, Manfred Opper and Patrick Jähnichen for fruitful discussions. This work was partly funded by the German Research Foundation (DFG) award KL 2698/2-1.

Author information

Authors and Affiliations

Humboldt University of Berlin, Berlin, Germany
Florian Wenzel, Théo Galy-Fajou & Marius Kloft
G+J Digital Products Hamburg, Hamburg, Germany
Matthäus Deutsch

Authors

Florian Wenzel
View author publications
You can also search for this author in PubMed Google Scholar
Théo Galy-Fajou
View author publications
You can also search for this author in PubMed Google Scholar
Matthäus Deutsch
View author publications
You can also search for this author in PubMed Google Scholar
Marius Kloft
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Wenzel .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Aalto University School of Science, Espoo, Finland
Jaakko Hollmén
University of Ljubljana, Ljubljana, Slovenia
Ljupčo Todorovski
KU Leuven Kulak, Kortrijk, Belgium
Celine Vens
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wenzel, F., Galy-Fajou, T., Deutsch, M., Kloft, M. (2017). Bayesian Nonlinear Support Vector Machines for Big Data. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10534. Springer, Cham. https://doi.org/10.1007/978-3-319-71249-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-71249-9_19
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71248-2
Online ISBN: 978-3-319-71249-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics