Advertisement

Machine Learning

, Volume 108, Issue 8–9, pp 1677–1699 | Cite as

Distribution-free uncertainty quantification for kernel methods by gradient perturbations

  • Balázs Cs. CsájiEmail author
  • Krisztián B. Kis
Open Access
Article
Part of the following topical collections:
  1. Special Issue of the ECML PKDD 2019 Journal Track

Abstract

We propose a data-driven approach to quantify the uncertainty of models constructed by kernel methods. Our approach minimizes the needed distributional assumptions, hence, instead of working with, for example, Gaussian processes or exponential families, it only requires knowledge about some mild regularity of the measurement noise, such as it is being symmetric or exchangeable. We show, by building on recent results from finite-sample system identification, that by perturbing the residuals in the gradient of the objective function, information can be extracted about the amount of uncertainty our model has. Particularly, we provide an algorithm to build exact, non-asymptotically guaranteed, distribution-free confidence regions for ideal, noise-free representations of the function we try to estimate. For the typical convex quadratic problems and symmetric noises, the regions are star convex centered around a given nominal estimate, and have efficient ellipsoidal outer approximations. Finally, we illustrate the ideas on typical kernel methods, such as LS-SVC, KRR, \(\varepsilon \)-SVR and kernelized LASSO.

Keywords

Kernel methods Confidence regions Nonparametric regression Classification Support vector machines Distribution-free methods 

1 Introduction

Kernel methods build on the fundamental concept of Reproducing Kernel Hilbert Spaces (Aronszajn 1950; Giné and Nickl 2015) and are widely used in machine learning (Shawe-Taylor and Cristianini 2004; Hofmann et al. 2008) and related fields, such as system identification (Pillonetto et al. 2014). One of the reasons of their popularity is the representer theorem (Kimeldorf and Wahba 1971; Schölkopf et al. 2001) which shows that finding an estimate in an infinite dimensional space of functions can be traced back to a finite dimensional problem. Support vector machines (Schölkopf and Smola 2001; Steinwart and Christmann 2008), rooted in statistical learning theory (Vapnik 1998), are typical examples of kernel methods.

Besides how to construct efficient models from data, it is also a fundamental question how to quantify the uncertainty of the obtained models. While standard approaches like Gaussian processes (Rasmussen and Williams 2006) or exponential families (Hofmann et al. 2008) offer a nice theoretical framework, making strong statistical assumptions on the system is sometimes unrealistic, since in practice we typically have very limited knowledge about the noise affecting the measurements. Building on asymptotic results, such as limiting distributions, is also widespread (Giné and Nickl 2015), but they usually lack finite sample guarantees.

Here, we propose a non-asymptotic, distribution-free approach to quantify the uncertainty of kernel-based models, which can be used for hypothesis testing and confidence region constructions. We build on recent developments in finite-sample system identification (Campi and Weyer 2005; Carè et al. 2018), more specifically, we build on the Sign-Perturbed Sums (SPS) algorithm (Csáji et al. 2015) and its generalizations, the Data Peturbation (DP) methods (Kolumbán 2016).

We consider the case where there is an underlying “true” function that generates the measurements, but we only have noisy observations of its outputs. Since we want to minimize the needed assumptions, for example, we do not want to assume that the true underlying function belongs to the Hilbert space in which we search our estimate, we take a “honest” approach (Li 1989) and consider “ideal” representations of the target function from our function space. A representation is ideal w.r.t. the data sample, if its outputs coincide with the corresponding noise-free outputs of the true underlying function for all available inputs.

Despite our method is distribution-free, i.e., it does not depend on any parameterized distributions, it has strong finite-sample guarantees. We argue that, the constructed confidence region contains the ideal representation exactly with a user-chosen probability. In case the noises are independent and symmetric about zero, and the objective function is convex quadratic, the resulting regions are star convex and have efficient ellipsoidal outer approximations, which can be computed by solving semi-definite optimization problems. Finally, we demonstrate our approach on typical kernel methods, such as KRR, SVMs and kernelized LASSO.

Our approach has some similarities to bootstrap (Efron and Tibshirani 1994) and conformal prediction (Vovk et al. 2005). One of the fundamental differences w.r.t bootstrap is, e.g., that we avoid building alternative samples and fitting bootstrap estimates to them (since it is computationally challenging), but perturb directly the gradient of the objective function. Key differences w.r.t. conformal prediction are, e.g., that we want to quantify the uncertainty of the model and not necessarily that of the next observation (though the two problems are related), and more importantly, exchangeability is not fundamental for our approach.

2 Preliminaries

A Hilbert space, \(\mathcal {H}\), of functions \(f: \mathcal {X} \rightarrow \mathbb {R}\), with inner product \(\langle \cdot , \cdot \rangle _{\mathcal {H}}\), is called a Reproducing Kernel Hilbert Space (RKHS), if the point evaluation functional
$$\begin{aligned} \delta _z : f \rightarrow f(z), \end{aligned}$$
(1)
is continuous (or equivalently bounded) for all \(z \in \mathcal {X}\), at any \(f \in \mathcal {H}\) (Giné and Nickl 2015). Then, by using the Riesz representation theorem, one can construct a (unique) kernel, \(k : \mathcal {X} \times \mathcal {X} \rightarrow \mathbb {R}\), having the reproducing property, that is
$$\begin{aligned} \langle k(\cdot , z), f\rangle _{\mathcal {H}} \,=\, f(z), \end{aligned}$$
(2)
for all \(z \in \mathcal {X}\) and \(f \in \mathcal {H}\). In particular, the kernel satisfies for all \(z,s \in \mathcal {X}\) that
$$\begin{aligned} k(z, s) \,=\, \langle k(\cdot , z), k(\cdot , s) \rangle _{\mathcal {H}}. \end{aligned}$$
(3)
Hence, the kernel of an RKHS is a symmetric and positive-definite function; moreover, the Moore-Aronszajn theorem states that the converse is also true: for every symmetric, positive-definite function there is a unique RKHS (Aronszajn 1950).

Typical kernels include, e.g., the Gaussian kernel \(k(z, s) = \exp (\nicefrac {-\Vert z-s\Vert ^2}{2 \sigma ^2})\), with \(\sigma > 0\), the polynomial kernel, \(k(z, s) = (\langle z, s \rangle + c)^p\), with \(c \ge 0\) and \(p \in \mathbb {N}\), and the sigmoidal kernel, \(k(z, s) = \tanh (a \langle z, s \rangle + b)\) for some \(a, b \ge 0\), where \(\langle \cdot , \cdot \rangle \) denotes the standard Euclidean inner product (Hofmann et al. 2008).

By a data sample, \(\mathcal {D}_n\), we mean a finite set of input-output measurements,
$$\begin{aligned} (x_1, y_1), \,\dots ,\, (x_n, y_n)\, \in \, \mathcal {X} \times \mathbb {R}, \end{aligned}$$
(4)
with \(\mathcal {X} \ne \emptyset \). We also introduce \(x \doteq (x_1, \dots , x_n)^\mathrm {T}\in \mathcal {X}^n\) and \(y \doteq (y_1, \dots , y_n)^\mathrm {T}\in \mathbb {R}^n\). The Gram matrix of \(k(\cdot , \cdot )\), w.r.t. input x, is denoted by \(\mathrm {K}_x\in \mathbb {R}^{n\times n}\), where
$$\begin{aligned}{}[\,\mathrm {K}_x\,]_{i,j}\, \doteq \,k(x_i, x_j). \end{aligned}$$
(5)
A kernel is called strictly positive definite if its Gram matrix, \(\mathrm {K}_x\), is (strictly) positive definite for distinct inputs \(\{x_i\}\) (Hofmann et al. 2008).

One of the fundamental reasons for the successes of kernel methods is the so-called representer theorem, originally given by Kimeldorf and Wahba (1971), but the generalization presented here is due to Schölkopf et al. (2001).

Theorem 1

Suppose we are given a sample, \(\mathcal {D}_n\), a positive-definite kernel \(k(\cdot , \cdot )\), an associated RKHS with a norm \(\Vert \cdot \Vert _{\mathcal {H}}\) induced by \(\langle \cdot , \cdot \rangle _{\mathcal {H}}\), and a class of functions
$$\begin{aligned} \mathcal {F}\,\, \doteq \,&\,\,\Big \{\, f: \mathcal {X} \rightarrow \mathbb {R}\, \mid \, f(z) = \sum _{i=1}^{\infty } \beta _i k(z, z_i),\, \beta _i \in \mathbb {R},\, z_i \in \mathcal {X},\, \Vert f\Vert _{\mathcal {H}} < \infty \, \Big \}, \end{aligned}$$
(6)
then, for any monotonically increasing regularization function, \({\Lambda }: [0, \infty ) \rightarrow [0, \infty )\), and an arbitrary loss function \(\mathrm {L}: (\mathcal {X} \times \mathbb {R}^2)^n \rightarrow \mathbb {R}\cup \{ \infty \}\), the objective
$$\begin{aligned} g(f, \mathcal {D}_n) \; \doteq \; \mathrm {L}\,\big ( \,(x_1, y_1, f(x_1)), \dots , (x_n, y_n, f(x_n))\, \big ) \,+\, {\Lambda }\,(\,\Vert f\Vert _{\mathcal {H}}\,), \end{aligned}$$
(7)
has a minimizer admitting the following representation
$$\begin{aligned} f_{\alpha }(z) \; = \; \sum _{i=1}^{n} \alpha _i k(z, x_i), \end{aligned}$$
(8)
where \(\alpha \doteq (\alpha _1, \dots , \alpha _n)^\mathrm {T}\in \mathbb {R}^n\) is the vector of coefficients. If \({\Lambda }\) is strictly monotonically increasing, then each minimizer admits a representation having form (8).

The theorem can be extended with a bias term (Schölkopf and Smola 2001), in which case if the solution exists, it also contains a multiple of the bias term. For further generalizations, see Yu et al. (2013) and Argyriou and Dinuzzo (2014).

The power of the representer theorem comes from the fact that it shows that computing the point estimate in a high, typically infinite, dimensional space of models can be reduced to a much simpler (finite dimensional) optimization problem whose dimension does not exceed the size of the data sample we have, that is n.

If the data is noisy, then of course, the obtained estimate is a random function and it is of natural interest to study the distribution of the resulting function, for example, to evaluate its uncertainty or to test hypotheses about the system.

3 Confidence regions for kernel methods

Now, we turn our attention to a stochastic variant of the problem discussed above. There are several advantages of taking a statistical point of view on kernel methods, including conditional modeling, dealing with structured responses, handling missing measurements and building prediction regions (Hofmann et al. 2008).

Following a standard statistical viewpoint (Davies et al. 2009), we assume that the outputs \(\{y_i\}\) are generated by some noisy observations of an underlying “true” function, denoted by \(f_*\), that is for all \(i = 1, \dots , n\), the outputs can be written as
$$\begin{aligned} y_i \; \doteq \; f_*(x_i)\, +\, \varepsilon _i, \end{aligned}$$
(9)
where \(\{\varepsilon _i\}\) are the noise terms. The entire noise vector is \(\varepsilon \,\doteq \, (\varepsilon _1, \dots , \varepsilon _n)^\mathrm {T}\). The noiseless outputs of function \(f_*\) will be denote by \(y_i^* \doteq f_*(x_i)\), for \(i = 1,\dots ,n\).

3.1 Ideal representations

We aim at quantifying the uncertainty of our estimated model. A standard way to measure the quality of a point-estimate is to build confidence regions around it. However, it is not obvious what we should aim for with our confidence regions. For example, since all of our models live in our RKHS, \(\mathcal {H}\), we would like to treat the confidence region as a subset of \(\mathcal {H}\). On the other hand, we want to minimize the assumptions, for example, we may not want to assume that \(f_*\) is an element of \(\mathcal {H}\). Furthermore, since unless we make strong smoothness assumptions on the underlying unobserved function, we only have information about it at the actual inputs, \(\{x_i\}\). Hence, we aim for a “honest” nonparametric approach (Li 1989) and search for functions which correctly describe the hidden function, \(f_*\), on the given inputs. Then, by the representer theorem, we may restrict ourselves to a finite dimensional subspace of \(\mathcal {H}\). This leads us to the definition of ideal representations:

Definition 1

Let \(\mathcal {H}_{\alpha } \subseteq \mathcal {H}\) denote the subspace of functions that can be represented as (8). A function \(f_0 \in \mathcal {H}_{\alpha }\), having coefficients \(\alpha _0 \in \mathbb {R}^n\), is called an ideal or noise-free representation of the “true” unobserved function \(f_*\), if we have
$$\begin{aligned} f_{0}(x_i) \; =\; y_i^* \; \doteq \; f_*(x_i),\qquad \text{ for } \text{ all } \qquad i \in \{ \, 1, \dots , n \,\}. \end{aligned}$$
(10)
The set of all ideal representations, w.r.t. data sample \(\mathcal {D}_n\), is denoted by \(\mathcal {H}_{0} \subseteq \mathcal {H}_{\alpha }\), and the set of their coefficients, called ideal coefficients, is denoted by \(A_0 \subseteq \mathbb {R}^n\).

An ideal representation does not simply interpolate the observed (noisy) outputs \(\{y_i\}\), but it interpolates the unobserved (noise-free) outputs, that is \(\{y^*_i\}\).

A natural question which arises is: when does such an ideal representation exist? To answer this question, first note that since ideal representations have the form (8), equation system (10) can be rewritten in a matrix form by using the Gram matrix. That is, vector \(\alpha \) is an ideal coefficient vector, if and only if
$$\begin{aligned} \mathrm {K}_x\alpha \; = \; y^*, \end{aligned}$$
(11)
where \(y^* \,\doteq \, (y^*_1, \dots , y^*_n)^\mathrm {T}\). If \(\mathrm {K}_x\) is (strictly) positive definite, which is the case if for example the kernel is Gaussian and all inputs are distinct, then \(\text{ rank }(\mathrm {K}_x) = n\) and every\(f_*: \mathcal {X} \rightarrow \mathbb {R}\) has a unique ideal representation w.r.t. data sample \(\mathcal {D}_n\).

On the other hand, if \(\text{ rank }(\mathrm {K}_x) < n\), then (11) places a restriction on the functions which have ideal representations. For example, if \(\mathcal {X} = \mathbb {R}\) and \(\ker (z,s) = \langle z,s\rangle = z^\mathrm {T}s\), then \(\text{ rank }(\mathrm {K}_x) = 1\) and in general only functions which are linear on the data sample have ideal representations. This is of course not surprising, as it is well-known that the choice of the kernel encodes our inductive bias on the underlying true function we aim at estimating (Schölkopf and Smola 2001).

If \(\text{ rank }(\mathrm {K}_x) < n\) and there is an \(\alpha \) which satisfies (11), then there are infinitely many ideal representations, as for all \(\nu \in \text{ null }(\mathrm {K}_x)\), the null space of \(\mathrm {K}_x\), we have \( \mathrm {K}_x\, (\alpha + \nu ) \, = \, \mathrm {K}_x\, \alpha \, +\,\mathrm {K}_x\, \nu \, =\, \mathrm {K}_x\, \alpha \,=\, y^*. \) The opposite is also true, if \(\alpha \) and \(\beta \) both satisfy (11), then \( \mathrm {K}_x\, (\alpha - \beta ) \, =\, \mathrm {K}_x\, \alpha \, -\,\mathrm {K}_x\,\beta \,=\,0\), thus, \(\alpha - \beta \in \text{ null }(\mathrm {K}_x)\). Hence, to avoid allowing infinitely many ideal representations, we may form equivalence classes by treating coefficient vectors \(\alpha \) and \(\beta \) equivalent if  \(\mathrm {K}_x\, \alpha \, = \,\mathrm {K}_x\,\beta \). Then, we can work with the resulting quotient space of coefficients to ensure that there is only one ideal representation (i.e., one equivalence class of such representations).

All of our theory goes trough if we work with the quotient space of representations, but to simplify the presentation we make the assumption (cf. Sect. 4.2) that \(\mathrm {K}_x\) is full rank, therefore, there always uniquely exists an ideal representation (for any “true” function), whose unique coefficient vector will be denoted by \(\alpha ^*\).

3.2 Exact and honest confidence regions

Let \((\,{\Omega }, \mathcal {A}, \{ \mathbb {P}_{\theta } \}_{\theta \in {\Theta }})\) be a statistical space, where \({\Theta }\) denotes an arbitrary index set. In other words, for all \(\theta \in {\Theta }\), \(({\Omega }, \mathcal {A}, \mathbb {P}_{\theta } )\) is a probability space, where \({\Omega }\) is the sample space, \(\mathcal {A}\) is the \(\sigma \)-algebra of events, and \(\mathbb {P}_{\theta }\) is a probability measure. Note that it is not assumed that \({\Theta } \subseteq \mathbb {R}^d\), for some d; therefore, this formulation covers nonparametric inference, as well (and that is why we do not call \(\theta \) a “parameter”).

In our case, index \(\theta \) is identified with the underlying true function, therefore, each possible \(f_*\) induces a different probability distribution according to which the observations are generated. Confidence regions constitute a classical form of statistical inference, when we aim at constructing sets which cover with high probability some target function of \(\theta \) (DeGroot and Schervish 2012). These sets are usually random as they are typically built using observations. In our case, we will build confidence regions for the ideal coefficient vector (equivalently, the ideal representation), which itself is a random element, as it depends on the sample.

Let \(\gamma \) be a random element (it corresponds to the available observations), let \(g(\theta , \gamma )\) be some target function of \(\theta \) (which can possibly also depend on the observations) and let \(p \in [\,0,1\,]\) be a target probability, also called significance level. A confidence region for \(g(\theta , \gamma )\) is a random set, \(C(p, \gamma ) \subseteq \text{ range }(g)\), i.e., the codomain of function g. The following definition formalizes two important types of stochastic guarantees for confidence regions (Davies et al. 2009).

Definition 2

A confidence region \( C(p, \gamma )\) for \(g(\theta , \gamma )\) is called exact, if
$$\begin{aligned} \forall \,\theta \in {\Theta }: \mathbb {P}_{\theta }\! \left( \,g(\theta , \gamma ) \in C(p, \gamma )\,\right) \, = \; p, \end{aligned}$$
(12)
and it is called honest, if it satisfies  \(\forall \,\theta \in {\Theta }: \mathbb {P}_{\theta }\! \left( \,g(\theta , \gamma ) \in C(p, \gamma )\,\right) \, \ge \; p.\)

In our case, \(\gamma \) is basically1 the sample of input-output pairs, \(\mathcal {D}_n\); and the target object we aim at covering is \(g(\theta , \gamma ) = \alpha ^*_{\theta }\), i.e., the (unique) ideal coefficient vector corresponding to the underlying true function (identified by \(\theta \)) and the sample. Since the ideal coefficient vector uniquely determines the ideal representation (together with the inputs, which however we observe), it is enough to estimate the former. The main question of this paper is how can we construct exact or honest confidence regions for the ideal coefficient vector based on a finite sample without strong distributional assumptions on the statistical space.

Henceforth, we will treat \(\theta \) (the underlying true function) fixed, and omit the \(\theta \) indexes from the notations, to simplify the formulas. Therefore, instead of writing \(\mathbb {P}_{\theta }\) or \(\alpha ^*_{\theta }\), we will simply use \(\mathbb {P}\) or \(\alpha ^*\). The results are of course valid for all \(\theta \).

Standard ways to construct confidence regions for kernel-based estimates typically either make strong distributional assumptions, like assuming Gaussian processes (Rasmussen and Williams 2006), or resort to asymptotic results, such as Donsker-type theorems for Kolmogorov-Smirnov confidence bands. An alternative approach is to build on Rademacher complexities, which can provide non-asymptotic, distribution-free confidence bands (Giné and Nickl 2015). Nevertheless, these regions are conservative (not exact) and are constructed independently of the applied kernel method. In contrast, our approach provides exact, non-asymptotic, distribution-free confidence sets for a user-chosen kernel estimate.

4 Non-asymptotic, distribution-free framework

This section presents the proposed framework to quantify the uncertainty of kernel-based estimates. It is inspired by and builds on recent results from finite-sample system identification, such as the SPS and DP methods (Campi and Weyer 2005; Csáji et al. 2015; Csáji 2016; Kolumbán 2016; Carè et al. 2018). Novelties with respect to these approaches are, e.g., that our framework considers nonparametric regression and does not require the “true” function to be in the model class.

4.1 Distributional invariance

The proposed method is distribution-free in the sense that it does not presuppose any parametric distribution about the noise vector \(\varepsilon \). We only assume some mild regularity about the measurement noises, more precisely that their (joint) distribution is invariant with respect to a known group of transformations.

Definition 3

An \(\mathbb {R}^n\)-valued random vector v is distributionally invariant with respect to a compact group of transformations, \((\mathcal {G}, \circ )\), where “\(\circ \)” is the function composition and each \(G \in \mathcal {G}\) maps \(\mathbb {R}^n\) to itself, if for all transformation \(G \in \mathcal {G}\), random vectors v and G(v) have the same distribution.

The two most important examples of the above definition are as follows.
  • If \(\{\varepsilon _i\}\) are exchangeable random variables, then the (joint) distribution of the noise vector \(\varepsilon \) is invariant w.r.t. multiplications by permutation matrices (which are orthogonal and form a finite, thus compact, group).

  • On the other hand, if \(\{\varepsilon _i\}\) are independent, each having a (possibly different!) symmetric distribution about zero, then the (joint) distribution of \(\varepsilon \) is invariant w.r.t. multiplications by diagonal matrices having \(+1\) or \(-1\) as diagonal elements (which are also orthogonal, and form a finite group).

Both of these examples assume only mild regularities about the measurement noises: for example, it is a standard assumption in statistical learning theory that the sample is independent and identically distributed (i.i.d.) which immediately implies exchangeability (which is a more general concept than i.i.d.). But even this assumption can be omitted if we work with symmetric noises, which are widespread as most standard distributions in statistics are symmetric, such as Gauss, Laplace, Cauchy, Student’s t, uniform, plus a large class of multimodal ones.

Note that for these examples no assumptions about other properties of the (noise) distributions are needed, e.g., they can be heavy-tailed, with even infinite variance, skewed, their expectations need not exist, hence, no moment assumptions are necessary. For the case of symmetric distributions, it is even allowed that the observations are affected by a noise where each \(\varepsilon _i\) has a different distribution.

4.2 Main assumptions

Before the general construction of our method is explained, first, we highlight the core assumptions we apply. We also discuss their relevance and implications.

Assumption 1

The kernel, \(k(\cdot , \cdot )\), is strictly positive definite and all inputs, \(\{x_i\}\), are distinct with probability one ( in other words, \(\forall \,i\ne j: \mathbb {P}\,(\,x_i = x_j\,) \,= \,0\) ).

As we discussed in Sect. 3.1, this assumption ensures that \(\text{ rank }(\mathrm {K}_x) = n\) (a.s.), hence there uniquely exists an ideal representation (a.s.), whose unique ideal coefficient vector is denoted by \(\alpha ^*\). The primary choices are universal kernels for which \(\mathcal {H}\) is dense in the space of continuous functions on compact domains of \(\mathcal {X}\).

Assumption 2

The input vector x and the noise vector \(\varepsilon \) are independent.

Assumption 2 implies that the measurement noises, \(\{\varepsilon _i\}\), do not affect the inputs, \(\{x_i\}\); for example, the system is not autoregressive. It is possible to extend our approach to dynamical systems, e.g., using similar ideas as in Csáji et al. (2012), Csáji and Weyer (2015), Csáji (2016), but we leave the extension for future research. Note that Assumption 2 allows deterministic inputs, as a special case.

Assumption 3

Noise \(\varepsilon \) is distributionally invariant w.r.t. a known group of transformations, \((\mathcal {G}, \circ )\), where each \(G \in \mathcal {G}\) acts on \(\mathbb {R}^n\) and \(\circ \) is the function composition.

Assumption 3 states that we known transformations that do not change the (joint) distribution of the measurement noises. As it was discussed in Sect. 4.1, symmetry and exchangeablity are two standard examples for which we know such group of transformations. Thus, if the noise vector is either exchangeable (e.g., it is i.i.d.), or symmetric, or both properties hold, then the theory applies. We also note that the suggested methodology is not limited to exchangeabe or symmetric noises, e.g., power defined noises constitute another example (Kolumbán 2016).

Assumption 4

The gradient, or a subgradient, of the objective w.r.t. \(\alpha \) exists and it only depends on the output vector, y, through the residuals, i.e., there is \({\bar{g}}\),
$$\begin{aligned} \nabla _{\!\alpha }\, g({f}_{\alpha }, \mathcal {D}_n)\; = \; {\bar{g}}(x, \alpha , {\widehat{\varepsilon }}(x, y, \alpha ) ), \end{aligned}$$
(13)
where the residuals w.r.t. the sample and the coefficients are defined as
$$\begin{aligned} {\widehat{\varepsilon }}(x, y, \alpha ) \;\doteq \; y\, -\, \mathrm {K}_x\, \alpha . \end{aligned}$$
(14)

For Assumption 4, it is enough if a subgradient is defined for each coefficient vector \(\alpha \), hence, e.g., the cases of \(\varepsilon \)-insensitive and Huber loss functions are also covered. Even in such cases (when we work with subderivaties), we still treat \({\bar{g}}\) as a vector-valued function and choose arbitrarily from the set of possible subgradients.

This requirement is also very mild as it is typically the case that the objective function is differentiable or convex and has subgradients (we will present several demonstrative examples in Sect. 5); furthermore, the objective typically only depends on y through the residuals, which immediately imply Assumption 4.

To see this assume that g is differentiable; then clearly, if the objective function can be written as \(g({f}_{\alpha }, \mathcal {D}_n)\, = \, g_0(x, \alpha , {\widehat{\varepsilon }}(x, y, \alpha ) )\) for some function \(g_0\), then
$$\begin{aligned} \nabla _{\!\alpha }\, g({f}_{\alpha }, \mathcal {D}_n)\, =&\,\,\,\nabla _{\!\alpha }\! \left( g_0(x, \alpha , y - \mathrm {K}_x\,\alpha ) )\right) \nonumber \\ =&\,\,\, -\mathrm {K}_x\left( \nabla _{\!\alpha }\, g_0 \right) (x, \alpha , y - \mathrm {K}_x\,\alpha ))\nonumber \\ =&\,\,\, {\bar{g}}(x, \alpha , {\widehat{\varepsilon }}(x, y, \alpha ) ), \end{aligned}$$
(15)
where during the derivation we applied the chain rule, used the fact that matrix \(\mathrm {K}_x\) is symmetric and the definition of the residuals, \({\widehat{\varepsilon }}(x, y, \alpha )\, =\, y - \mathrm {K}_x\, \alpha \).

4.3 Perturbed gradients

At first, the proposed method can be understood as a hypothesis testing approach. Given coefficient vector \(\alpha \in \mathbb {R}^n\) we test the null hypothesis \(H_0: \alpha \, = \, \alpha ^*\), i.e., it is the ideal coefficient vector; against the alternative hypothesis \(H_1: \alpha \, \ne \, \alpha ^*\). Under \(H_0\), the residuals of \(f_{\alpha }\) coincide with the “true” (unobserved) noise terms, since by definition (for ideal representations), we have
$$\begin{aligned} {\widehat{\varepsilon }}(x, y, \alpha ^*) \,=\,&\,\,\, y \,-\, \mathrm {K}_x\, \alpha ^* \nonumber \\ = \,&\,\,\, [\,f_*(x_1) + \varepsilon _1, \dots , f_*(x_n) + \varepsilon _n\,]^\mathrm {T}\nonumber \\ - \,&\, \, \, [\,f_*(x_1), \dots , f_*(x_n)\,]^\mathrm {T}\,\,=\,\, \varepsilon . \end{aligned}$$
(16)
Consequently, based on the group of invariant transformations, \(\mathcal {G}\), we know that the (joint) distribution of the residuals does not change if we transform them by any \(G \in \mathcal {G}\) (under \(H_0\)). Then, we can generate alternative realizations of the residuals, \({\widehat{\varepsilon }}(x, y, \alpha ^*)\), by applying a random transformation \(G \in \mathcal {G}\), and the resulting alternative realization, \(G({\widehat{\varepsilon }}(x, y, \alpha ^*))\), will behave “similarly” (in the statistical sense) to the original residual vector (i.e., the true noise vector).

However, under \(H_1\), if coefficient vector \(\alpha \) does not define an ideal representation, \({\widehat{\varepsilon }}(x, y, \alpha )\), in general, will not coincide with the true noises. Therefore, the distributions of their randomly transformed variants will be distorted and will statistically not behave “similarly” to the original residuals.

Of course, we need a way to measure “similar behavior”. Since we want to measure the uncertainty of a model constructed by using a certain objective function, we will measure similarity by recalculating (the magnitude of) its gradient (w.r.t. \(\alpha \)) with the transformed residuals and apply a rank test (Good 2005).

Let us define a reference function, \(Z_0 : \mathbb {R}^n \rightarrow \mathbb {R}\), and \(m-1\)perturbed functions, \(\{Z_i\}\), with \(Z_i: \mathbb {R}^n \rightarrow \mathbb {R}\), where m is a user-chosen hyper-parameter, as follows
$$\begin{aligned} Z_0(\alpha ) \,&\doteq \, \Vert \, {\Psi }(x) \, {\bar{g}}(x, \alpha , G_0 ({\widehat{\varepsilon }}(x, y, \alpha )) ) \, \Vert ^2, \end{aligned}$$
(17)
$$\begin{aligned} Z_i(\alpha ) \,&\doteq \, \Vert \, {\Psi }(x) \, {\bar{g}}(x, \alpha , G_i ({\widehat{\varepsilon }}(x, y, \alpha )) ) \, \Vert ^2, \end{aligned}$$
(18)
for \(i= 1, \dots , m-1\), where \({\Psi }(x)\) is some (possibly input dependent) positive definite weighting matrix, \(G_0\) is the identity element of \(\mathcal {G}\) (w.l.o.g. the identity transformation), and \(\{G_i\}\) are i.i.d. random transformations from \(\mathcal {G}\), sampled using the uniform distribution on \(\mathcal {G}\). They are generated independently of the other random elements of the system, such as the input vector x and the noise vector \(\varepsilon \).

For symmetric noises, transformation \(G_i \in \mathcal {G}\) is basically a random \(n \times n\) diagonal matrix whose diagonal elements are \(+1\) or \(-1\), each having \(\nicefrac {1}{2}\) probability to be selected, independently of the other elements of the diagonal.

On the other hand, for the case of exchangeable noise terms, each transformation \(G_i \in \mathcal {G}\) is a randomly (uniformly) chosen \(n \times n\) permutation matrix.

Weighting matrix \({\Psi }(x)\) is included in the construction to allow some additional flexibility, e.g., if we have some a priori information on the measurement noises. We will see an example for the special case of quadratic objectives in Sect. 4.6. In case no such information is available, \({\Psi }(x)\) can be chosen as identity.

We can observe that for the ideal coefficient vector \(\alpha ^*\), we have
$$\begin{aligned} Z_0(\alpha ^*)\, \, =&\,\,\, \Vert \, {\Psi }(x) \, {\bar{g}}(x, \alpha ^*, \varepsilon ) \, \Vert ^2 \nonumber \\ {\buildrel d \over =}&\,\,\, \Vert \, {\Psi }(x) \, {\bar{g}}(x, \alpha ^*, G_i (\varepsilon ) ) \, \Vert ^2 \nonumber \\ =&\,\,\, Z_i(\alpha ^*), \end{aligned}$$
(19)
for \(i= 1, \dots , m-1\), where ,,\({\buildrel d \over =}\)” denotes equality in distribution. Therefore, the \(\{Z_i(\alpha ^*)\}_{i=0}^{m-1}\) variables have the same (marginal) distribution, though, they are of course not independent. It can be shown, however, that they are conditionally independent, and therefore all of their possible orderings are equally likely, with possible tie-breakings, which can be used to measure similar behavior.

On the other hand, for \(\alpha \ne \alpha ^*\), this distributional equivalence does not hold, and we expect that if  \(\Vert \, \alpha - \alpha ^*\, \Vert \) is large enough, the reference element \(Z_0(\alpha )\) will dominate the perturbed elements, \(\{Z_i(\alpha )\}_{i=1}^{m-1}\), with high probability, from which we can detect (statistically) that coefficient vector \(\alpha \) is not the ideal one, \(\alpha \ne \alpha ^*\).

4.4 Normalized ranks

Now, we make our argument, including possible tie-breakings, more precise by introducing the concept of normalized ranks. Formally, the normalized rank of the reference element, \(Z_0(\alpha )\), among all \(\{Z_i(\alpha )\}_{i=0}^{m-1}\) elements is defined as follows
$$\begin{aligned} \mathcal {R}(\alpha )\, \doteq \,\mathcal {R}_m(\alpha )\, \doteq \, \frac{1}{m} \bigg [\, 1 + \sum _{i=1}^{m-1} \mathbb {I}\left( Z_0(\alpha ) \prec _{\pi } Z_i(\alpha ) \right) \bigg ], \end{aligned}$$
(20)
where \(\mathbb {I}(\cdot )\) is an indicator function, namely, its value is 1 if its argument is true and 0 otherwise; \(m \in \mathbb {N}\) is a user-chosen hyper-parameter; and binary relation “\(\prec _{\pi }\)” is the standard “<” with random tie-breaking (according to a fixed, pre-generated random order). More precisely, let \(\pi \) be a random (uniformly chosen) permutation of the set \(\{ 0, \dots , m-1 \}\). Then, given m arbitrary real numbers, \(Z_0, \dots , Z_{m-1}\), we can construct a strict total order, denoted by “\(\prec _{\pi }\)”, by defining \(Z_k \prec _{\pi } Z_j\) if and only if \(Z_k < Z_j\) or it both holds that \(Z_k = Z_j\) and \(\pi (k) < \pi (j)\).

4.5 Exact confidence

Parameter m influences the resolution of the confidence probability we can achieve. Namely, a probability \(p \in (0, 1)\) is admissible if it can be written in the form of \(p = 1 - \nicefrac {q}{m}\), where q is an integer satisfying \(0< q < m\). On the other hand, since both m and q are (hyper) parameters, their values are user-chosen. Hence, every rational probability \(p \in (0, 1)\) is admissible, by choosing m and q appropriately. Then, a confidence set for an admissible probability \(p = p(m,q)\) is
$$\begin{aligned} A_p \, \doteq \, \left\{ \, \alpha : \mathcal {R}(\alpha ) \le p \, \right\} \, = \,\left\{ \, \alpha : \mathcal {R}_m(\alpha ) \le 1 - \nicefrac {q}{m} \, \right\} . \end{aligned}$$
(21)
One of the main questions is: what kind of stochastic guarantees do such confidence regions have? The following theorem states that they are exact.

Theorem 2

Under Assumptions 1, 2, 3 and 4, the coverage probability of the constructed confidence region with respect to the ideal coefficient vector \(\alpha ^*\) is
$$\begin{aligned} \mathbb {P}\,\big (\, \alpha ^* \in A_p \,\big ) \, = \,\, p \,\, = \,\, 1 - \frac{q}{m}, \end{aligned}$$
(22)
for any choice of the integer hyper-parameters satisfying  \(0\,<\, q\, < \,m\).

Proof

Following Csáji et al. (2015), the core idea is to show that variables
$$\begin{aligned} Z_0(\alpha ^*),\; Z_1(\alpha ^*),\; \dots ,\; Z_{m-1}(\alpha ^*) \end{aligned}$$
(23)
are uniformly ordered, which means that each ordering of them, with respect to the strict total order \(\prec _{\pi }\), has the same probability, that is 1 / m!, formally,
$$\begin{aligned} \mathbb {P}\,\big (\, Z_{i_0}(\alpha ^*) \prec _{\pi } Z_{i_2}(\alpha ^*) \prec _{\pi } \dots \prec _{\pi } Z_{i_{m-1}}(\alpha ^*) \,\big ) \, = \,\, \frac{1}{m!}, \end{aligned}$$
(24)
where \((i_0, i_1, \dots , i_{m-1})\) is an arbitrary permutation of \((0,1,\dots , m-1)\). This ordering property is not obvious, since they are not independent, even though we already observed that they are identically distributed (for ideal coefficients).
By definition, \(\alpha ^* \in A_p\) if and only if \(\mathcal {R}(\alpha ^*) \le 1 - \nicefrac {q}{m}\), i.e., if the reference element, \(Z_0(\alpha ^*)\) takes one of the positions \(1, \dots , m-q\) in the ordering of \(\{Z_i(\alpha ^*)\}_{i=0}^{m-1}\) variables, w.r.t. the strict total order \(\prec _{\pi }\). Then, assuming they are uniformly ordered (yet to be shown), we know that \(Z_0(\alpha ^*)\) takes each position in the ordering with probability exactly 1 / m. Therefore, for \(i \in \{1, \dots , m\}\), we have
$$\begin{aligned} \mathbb {P}\Big (\,\mathcal {R}(\alpha ^*) \,=\, \frac{i}{m}\;\Big )\, =\, \frac{1}{m}, \end{aligned}$$
(25)
from which it follows that \(\mathbb {P}\bigl (\alpha ^* \in A_p\bigr )\, =\, 1 - \nicefrac {q}{m}\) by taking into account that events \(\{\,\mathcal {R}(\alpha ^*) = \nicefrac {i}{m}\,\}\) and \(\{\,\mathcal {R}(\alpha ^*) = \nicefrac {j}{m}\,\}\) are disjoint, if \(i \ne j\).
In order to show that \(\{Z_i(\alpha ^*)\}_{i=0}^{m-1}\) are indeed uniformly ordered, we can apply Theorem 2.17 of Kolumbán (2016). Our proposed approach can be interpreted as a variant of a DP method, even though formally the DP “performance measures” can depend on the parameters, \(\alpha \), the inputs, x, and the perturbed outputs, \(y^{(i)}\), but not directly on the perturbed residuals. Nevertheless, in our case, \(y^{(i)}\) is
$$\begin{aligned} y^{(i)} \, \doteq \, f_{\alpha }(x)\, +\, G_i ({\widehat{\varepsilon }}(x, y, \alpha )), \end{aligned}$$
(26)
where \(f_{\alpha }(x) \doteq [\,f_{\alpha }(x_1), \dots , f_{\alpha }(x_n) ]^\mathrm {T}\). Then, obviously we can compute the transformed residuals, \(G_i ({\widehat{\varepsilon }}(x, y, \alpha ))\), from \(\alpha \), x, and \(y^{(i)}\) by using that \(G_i ({\widehat{\varepsilon }}(x, y, \alpha )) = y^{(i)} - f_{\alpha }(x)\). Hence, the DP performance measure in our case is defined as
$$\begin{aligned} Z(\alpha , x, y^{(i)}) \,\doteq \,\Vert \, {\Psi }(x)\, {\bar{g}}(x, \alpha , y^{(i)} - f_{\alpha }(x) ) \, \Vert ^2, \end{aligned}$$
(27)
which now fits the DP framework. Our Assumption 4 ensures that this function is well-defined and, together with Assumption 2, it also guarantees that we do not need to compute \(\{y^{(i)}\}\) to evaluate the perturbed functions. Our Assumption 3 directly states that the noise, \(\varepsilon \), is invariant under a compact group of transformations, which is a requirement of Theorem 2.17, and we already observed that true errors coincide with the residuals of ideal representations, \({\widehat{\varepsilon }}(x, y, \alpha ^*)\, =\, \varepsilon \). \(\square \)

Theorem 2 shows that the confidence region contains the ideal coefficient vector exactly with probability p that statement is non-asymptotically guaranteed, despite the method is distribution-free. Since m and q are user-chosen (hyper-parameters), the confidence probability is under our control. The confidence level does not depend on the weighting matrix, but it influences the shape of the region. Ideally, it should be proportional to the square root of the covariance of the estimate.

4.6 Quadratic objectives and symmetric noises

If we work with convex quadratic objectives, which have special importance for kernel methods (Hofmann et al. 2008), and assume independent and symmetric noises, we get the Sign-Perturbed Sums (SPS) method (Csáji et al. 2015) as a special case (using the inverse square root of the Hessian as a weighting matrix).

The SPS method uses the classical least-squares (LS) objective function,
$$\begin{aligned} g(f_{\alpha }, \mathcal {D}_n) \, = \, \Vert \,z \,-\, {\Phi }\, \alpha \, \Vert ^2, \end{aligned}$$
(28)
where z denotes the vector of outputs and \({\Phi }\) is the regressor matrix. Objective (28) can be seen the canonical form of many quadratic functions (cf. Sect. 5).

When using the SPS method, we make the following assumptions: the noise terms, \(\{\varepsilon _i\}\), are independent and have symmetric distributions about zero; and the regressor matrix, \({\Phi }\), has independent rows, it is skinny and full rank.

For SPS, the reference and the perturbed functions are defined as
$$\begin{aligned} Z_i(\alpha ) \, \doteq \, \Vert \, ({\Phi }^\mathrm {T}{\Phi })^{-\nicefrac {1}{2}} {\Phi }^\mathrm {T}G_i (z - {\Phi }\, \alpha )\, \Vert ^2, \end{aligned}$$
(29)
for \(i= 0, \dots , m-1\), where \(G_i = \text{ diag }(\sigma _{i, 1}, \dots , \sigma _{i, n})\), for \(i \ne 0\), where random variables \(\{ \sigma _{i, j} \}\) are i.i.d. having Rademacher distribution, i.e., they take values \(+1\) and \(-1\) with probability \(\nicefrac {1}{2}\) each; and \(G_0 = I_n\) is the identity matrix.
It is easy to see that (29) is a special case of construction (17)–(18), where z are the outputs and \({\Phi }\) is computed from the inputs. Besides being exact, the confidence regions of SPS have additional important properties, such as they are star convex with the LS estimate, \({\widehat{\alpha }}\), as a star center (Csáji et al. 2015). Moreover, they have ellipsoidal outer approximations, that is there are regions of the form
$$\begin{aligned} A^\circ _p \; \doteq \; \Big \{\, \alpha \in \mathbb {R}^n\, :\, (\alpha -{\widehat{\alpha }})^\mathrm {T}\frac{1}{n}{\Phi }^\mathrm {T}{\Phi }(\alpha -{\widehat{\alpha }})\,\le \, r \, \Big \}, \end{aligned}$$
(30)
where \(A_p \, \subseteq A^\circ _p\) and radius of the ellipsoid, r, can be computed (in polynomial time) by solving semi-definite programming problems (Csáji et al. 2015).

Hence, for quadratic problems, the obtained regions are star convex, thus connected, have ellipsoidal outer approximation, thus bounded. These properties ensure that it is easy to work with them. For example, using star convexity and boundedness, we can efficiently explore the region by knowing that every point of it can be reached from the given star center by a line segment inside the region. Moreover, the ellipsoidal outer approximation provides a compact representation.

5 Applications and experiments

In this section, we show specific applications of the proposed uncertainty quantification (UQ) approach for typical kernel methods, such as LS-SVC, KRR, \(\varepsilon \)-SVR and KLASSO, in order to demonstrate the usage and the power of the framework.

We also present several numerical experiments to illustrate the family of confidence regions we get for various confidence levels. We always set hyper-parameter m to 100 in the experiments. The figures were constructed by Monte Carlo simulations, i.e., evaluating \(1\,000\,000\) random coefficients and drawing the graphs of their induced models with colors indicating their confidence levels.

5.1 Uncertainty quantification for least-squares support vector classification

We start with a classification problem and consider the Least-Squares Support Vector Classification (LS-SVC) method (Suykens and Vandewalle 1999). LS-SVC under the Euclidean distance is known to be equivalent to hard-margin SVC using the Mahalanobis distance (Ye and Xiong 2007). It has the advantage that it can be solved by a system of linear equations, in contrast to a quadratic problem.

We assume that \(x_k \in \mathbb {R}^d\) and \(y_k \in \{+1, -1\}\), for all \(k \in \{1, \dots n\}\), as well as that the slack variables, i.e., the algebraic (signed) distances of the objects from the corresponding margins, are independent and distributed symmetrically, for the ideal representation; which we will identify with the best possible classifier.

For simplicity, we consider linear classification, that is models of the form
$$\begin{aligned} h_{\alpha } (x_k) \; \doteq \; \text{ sign }(\, w^\mathrm {T}x_k + b\,) \,=\, \text{ sign }(\, \alpha ^\mathrm {T}{\tilde{x}}_k \,), \end{aligned}$$
(31)
where \(x_k\) is an input vector, \(\alpha \, \doteq \, [\,b,\, w^\mathrm {T}\,]^\mathrm {T}\) and \({\tilde{x}}_k\, \doteq \, [\,1,\, x_k^\mathrm {T}\,]^\mathrm {T}\).
The standard (primal) formulation of (soft-margin) LS-SVM classifcation is
$$\begin{aligned} \text{ minimize }\quad&\frac{1}{2}\, w^\mathrm {T}w \,+\, \lambda \sum _{k=1}^n \xi ^2_k \end{aligned}$$
(32)
$$\begin{aligned} \text{ subject } \text{ to }\quad&y_k (w^\mathrm {T}x_k + b) \,=\, 1 - \xi _k \end{aligned}$$
(33)
for \(k= 1, \dots , n\), where \(\lambda > 0\) is fixed. Variables \(\{\xi _i\}\) are called the slack variables. The convex quadratic problem above can be rewritten as minimizing
$$\begin{aligned} g(f_{\alpha }, \mathcal {D}_n) \;\doteq \; \frac{1}{2}\, \Vert \, B\, \alpha \,\Vert ^2 \,+\, \lambda \, \Vert \,\mathbb {1}_n - y \odot (X \alpha ) \, \Vert ^2, \end{aligned}$$
(34)
where \(\mathbb {1}_n \in \mathbb {R}^n\) is the all-one vector, \(\odot \) denotes the Hadamard (entrywise) product, \(X \doteq [\,{\tilde{x}}_1, \dots , {\tilde{x}}_n\,]^\mathrm {T}\) and the role of matrix B is to remove the bias, b, from \(\alpha \), i.e., \(B \,\doteq \, \text{ diag }(0, 1, \dots , 1)\). Note that the reformulated problem (34) is unconstrained.
Observe that the objective function, \(g(f_{\alpha }, \mathcal {D}_n)\), can be further reformulated to take the canonical form of \(\Vert \,z - {\Phi }\, \alpha \, \Vert ^2\) by using the following \({\Phi }\) and z,
$$\begin{aligned} {\Phi }= & {} \left[ \begin{array} {c} \sqrt{\lambda }\, (y\mathbb {1}_d^\mathrm {T}) \odot X \\ (\nicefrac {1} {\sqrt{2}})\, B \end{array} \right] \!,\\ z= & {} \left[ \begin{array} {c} \sqrt{\lambda }\, \mathbb {1}_n \\ 0_d \end{array} \right] \!, \end{aligned}$$
(35)
where \(0_d \in \mathbb {R}^d\) is the all-zero vector. Then, we can apply SPS to the obtained (ordinary) LS formulation. However, we should be a careful with the transformations, as the new problem has some auxiliary output terms, the zero part of z, for which there are no slack variables. The residuals corresponding to that part are not even stochastic, therefore, the last d terms of the residual vector, \(z - {\Phi }\, \alpha \), should not be perturbed. Consequently, the transformation matrices \(\{{G}_i\}\) are defined as
$$\begin{aligned} G_i \,\,\doteq \,\, \left[ \begin{array}{cc} \,{\bar{G}}_i &{}\,\,\, 0\; \\ \,0 &{}\,\,\, I\; \end{array} \right] \!, \end{aligned}$$
(36)
for \(i=0,\dots , m-1\), where \({\bar{G}}_0 = I_n\) is the identity, and \({\bar{G}}_i \doteq \text{ diag }(\sigma _{i,1}, \dots , \sigma _{i,n})\), for \(i\ne 0\), where \(\{\sigma _{i,j}\}\) are i.i.d. Rademacher random variables, as before.

Then, (exact) confidence regions and (honest) ellipsoidal outer approximations can be constructed for the best linear classifier in the domain of coefficients by the SPS method, i.e., (29), with regressor matrix and output vector as defined in (35) and transformations as in (36). The regions will be centered around the LS-SVM classifier, i.e., for all (rational) \(p \in (0, 1)\), the coefficients of LS-SVC are contained in \(A_p\), assuming it is non-empty. As each coefficient vector uniquely identifies a classifier, the obtained region can be mapped to the model space, as well.

UQ for LS-SVC is illustrated in Fig. 1. The observations were generated by adding Laplace noises to the coordinates of the corresponding class centers. The constructed confidence regions are shown both in the coefficient and model spaces, without the bias term, for simplicity. The possibility of constructing (honest) ellipsoidal outer approximations of the (exact) regions is also illustrated.
Fig. 1

Exact, non-asymptotic, distribution-free confidence regions for ideal RKHS representations. Parts a and b present UQ for Least-Squares Support Vector Classification (LS-SVC) with \(\lambda = 0.1\) in the model and coefficient spaces, respectively. The ellipsoidal outer approximations of the regions having probabilities \(10\,\%\), \(50\,\%\) and \(90\,\%\) are also presented in the coefficient space. There were \(n = 100\) observations, 50 for each class. The centers of the classes were (0, 0.5) and \((-0.5, 0)\). For each observation i.i.d. Laplace noises were added to the coordinates of the corresponding centers. The parameters of the noises were \(\mu = 0\) (location) and \(b = \nicefrac {1}{2}\) (scale). The confidence level of each color can be interpreted by using the scale bars. The regions are increasing, i.e., \(A_p \subseteq A_q\) if \(p \le q\), thus, only the smallest levels are shown

5.2 Uncertainty quantification for kernel ridge regression

Our next example is Kernel Ridge Regression (KRR) which is a kernelized version of Tikhonov regularized LS (Shawe-Taylor and Cristianini 2004). The KRR estimate minimizes a quadratic loss function with a Hilbert space norm regularizer,
$$\begin{aligned} {\hat{f}}_{\mathrm{KRR}} \in \mathop {\hbox {argmin}}\limits _{f \in \mathcal {H}}\, \frac{1}{n}\,\sum _{i=1}^n w_i (y_i - f(x_i))^2 \,+\, \lambda \, \Vert f \Vert ^2_{\mathcal {H}}, \end{aligned}$$
(37)
where \(\lambda > 0\), \(w_i > 0\), \(i=1, \dots , n\), are some a priori given (constant) weights. After using the representer theorem, the objective function can be rewritten as
$$\begin{aligned} g(f_{\alpha }, \mathcal {D}_n) \, \doteq&\,\,\, \frac{1}{n}\,\sum _{i=1}^n w_i (y_i - f_{\alpha }(x_i))^2 \,+\, \lambda \, \Vert f \Vert ^2_{\mathcal {H}} \, \nonumber \\ =&\,\,\, \frac{1}{n}\,\Vert \, y - f_{\alpha }(x) \,\Vert ^2_{W} \,+\, \lambda \, \Vert f \Vert ^2_{\mathcal {H}} \, \nonumber \\ =&\,\, \, \frac{1}{n}\,(y - \mathrm {K}_x\, \alpha )^\mathrm {T}W (y - \mathrm {K}_x\, \alpha ) \,+\, \lambda \, \alpha ^\mathrm {T}\mathrm {K}_x\, \alpha , \end{aligned}$$
(38)
where \(f_{\alpha }(x) \doteq [\,f_{\alpha }(x_1), \dots , f_{\alpha }(x_n) ]^\mathrm {T}\), \(W \doteq \text{ diag }(w_1,\dots , w_n)\), and we used the reproducing property to replace the Hilbert space norm with a quadratic term.
We can reformulate (38) in the canonical form, \(\Vert \,z \,-\, {\Phi }\, \alpha \,\Vert ^2\), by using
$$\begin{aligned} {\Phi }\; =\; \left[ \begin{array}{c} \;(\nicefrac {1}{\sqrt{n}})\,W^{\frac{1}{2}} \mathrm {K}_x\; \\ \sqrt{\lambda }\, \mathrm {K}_x^{\frac{1}{2}} \end{array} \right] \!,\qquad \text{ and }\qquad z \;=\; \left[ \begin{array}{c}\; (\nicefrac {1}{\sqrt{n}})\, W^{\frac{1}{2}} y\; \\ \;0_n\; \end{array} \right] \!, \end{aligned}$$
(39)
where \(W^{\frac{1}{2}}\) and \(\mathrm {K}_x^{\frac{1}{2}}\) denote the square roots of matrices W and \(\mathrm {K}_x\), respectively. Note that the square roots exist as these matrices are positive semidefinite.

Then, assuming symmetric and independent measurement noises, formula (29), with regressor matrix and output vector defined by (39), can be applied to build confidence regions. As in the case of LS-SVM classifier, the canonical reformulation also contains some auxiliary terms, the zero part of z, for which there are no real noise terms, therefore, they should not be perturbed. Thus, we should again use the transformations defined by (36) to get guaranteed confidence regions.

Experiments illustrating the family of (exact, non-asymptotic, distribution-free) confidence regions of KRR with Gaussian kernels and Laplacian measurement noises, and comparing the results with that of support vector regression, are shown in Fig. 2. The discussion of the comparison is located in Sect. 5.3.

5.3 Uncertainty quantification for support vector regression

The previous examples were quadratic and therefore, for symmetric noises, their uncertainty could be quantified with SPS. This may be no more true if we change the applied norms. In this section we study support vector regression, particularly, \(\varepsilon \)-SVR (Hofmann et al. 2008; Schölkopf and Smola 2001; Steinwart and Christmann 2008). A well-known advantage of \(\varepsilon \)-SVR, for example, over KRR, is that it ensures sparse representations through the \(\varepsilon \)-insensitive loss function. In order to avoid confusion with the true noise vector, \(\varepsilon \), we denote the tolerance parameter of the loss function by \({\bar{\varepsilon }}\). The primal objective function of \(\varepsilon \)-SVR is
$$\begin{aligned} h({f}, \mathcal {D}_n) \doteq \frac{1}{2} \Vert f \Vert ^2_{\mathcal {H}} + \frac{c}{n} \sum _{k=1}^{n} \max \{ 0, | \langle f, \phi (x_k) \rangle _{\mathcal {H}} - y_k | -{{\bar{\varepsilon }}}\}, \end{aligned}$$
(40)
where \(f \in \mathcal {H}\), \(c>0\), and \(\phi (z) \doteq k(z, \cdot )\) is the feature map. Function (40) can be reformulated by applying slack variables, then using standard arguments based on the Lagrangian and the Karush–Kuhn–Tucker (KKT) conditions, we arrive at the Wolfe dual of \(\varepsilon \)-SVR (Schölkopf and Smola 2001), where we have to maximize
$$\begin{aligned}&g({f}_{\alpha ^+, \alpha ^-}, \mathcal {D}_n) \,= \, y ^\mathrm {T}(\alpha ^+ - \alpha ^-) \nonumber \\&\quad - \,\frac{1}{2} (\alpha ^+ - \alpha ^-) ^\mathrm {T}\mathrm {K}_x\, (\alpha ^+ - \alpha ^-) - {\bar{\varepsilon }}\, (\alpha ^+ + \alpha ^-)^\mathrm {T}\mathbb {1}, \end{aligned}$$
(41)
subject to the (linear) constraints: \(\alpha ^+, \alpha ^- \, \in \, [\,0, \nicefrac {c}{n}\,]^n\) and \((\alpha ^+ - \alpha ^-) ^\mathrm {T}\mathbb {1} \, = \,0\). One can work directly with the quadratic dual objective, but then the confidence region will be constructed for \(\alpha ^+, \alpha ^-\). Since, \(\alpha = \alpha ^+ - \alpha ^-\), the region could be mapped to a confidence region in the space of coefficient vectors. Alternatively, one can reformulate (41) directly for \(\alpha \) as
$$\begin{aligned} g({f}_{\alpha }, \mathcal {D}_n) \;= \; y ^\mathrm {T}\alpha - \frac{1}{2}\,\alpha ^\mathrm {T}\mathrm {K}_x\, \alpha - {\bar{\varepsilon }}\, \Vert \alpha \Vert _1, \end{aligned}$$
(42)
where \(\Vert \cdot \Vert _1\) is the 1-norm. A subgradient of (42) w.r.t. \(\alpha \) is given by
$$\begin{aligned} \nabla _{\!\alpha }\, g({f}_{\alpha }, \mathcal {D}_n) \; = \; y\, - \,\mathrm {K}_x\, \alpha \, -\, {\bar{\varepsilon }}\,\text{ sign }(\alpha ), \end{aligned}$$
(43)
where \(\text{ sign }(\cdot )\) denotes the signum function and it is understood component-wise.
Fig. 2

Exact, non-asymptotic, distribution-free confidence regions for ideal RKHS representations. Parts a and b show UQ for Kernel Ridge Regression (KRR) with \(\lambda = 0.1\) and \(\varepsilon \)-Support Vector Regression (\(\varepsilon \)-SVR) with \(c = 250\) and \({{\bar{\varepsilon }}} = 0.2\), respectively. The same data was used for both regression problems, namely, the true function was \(f_{*}(x) = x\, \sin (x)\), there were \(n = 20\) observations having i.i.d. Laplace noise with parameters \(\mu = 0\) (location) and \(b = \nicefrac {1}{2}\) (scale), and Gaussian kernels were applied with \(\sigma = \nicefrac {1}{2}\). Part a was built by the Sign-Perturbed Sums (SPS) method, (29), and formula (44) was used with sign-change matrices for part b. The confidence level of each color can be interpreted by using the scale bars. The regions are increasing, i.e., \(A_p \subseteq A_q\) if \(p \le q\), thus, only the smallest levels are shown

Then, building on the subgradient of the dual objective, i.e., (43), reference and perturbed evaluation functions can be defined, for \(i=0, \dots , m-1\), as
$$\begin{aligned} Z_i(\alpha ) \, \doteq \, \left\| \, G_i\,(\,y - \mathrm {K}_x\, \alpha \,) \,-\, {\bar{\varepsilon }}\,\text{ sign }(\alpha ) \,\right\| ^2, \end{aligned}$$
(44)
where \(G_0\) is the identity matrix and \(G_i\) is a (uniformly chosen) element of the applied compact transformation group, such as a diagonal matrix with \(\pm 1\) entries, for symmetric noises (or permutation matrices for exchangeable noises, etc.).

A numerical experiment illustrating the obtained family of confidence regions of the \(\varepsilon \)-SVR estimate for various significance levels is shown in Fig. 2.

The same data sample was used for all regression models, to allow their comparison. The noise affecting the observations was Laplacian, thus heavy-tailed. Since the coefficient space is high-dimensional, and there is a one-to-one correspondence between coefficient vectors and kernel models, the confidence regions are mapped and shown in the model space, i.e., in the space of RKHS functions.

Note that it is meaningful to plot the confidence regions even for unknown input values, because the confidence regions are built for the ideal representation, which belongs to the chosen RKHS, unlike the underlying true function.

We can observe that the uncertainty of \(\varepsilon \)-SVR was higher than that of KRR, which can be explained as the price of using \(\varepsilon \)-insensitive loss. As the experiments with KLASSO show (cf. Fig. 3), the higher uncertainty of \(\varepsilon \)-SVR is not simply a consequence of sparse representations, as KLASSO also ensures sparsity. Naturally, the confidence regions are also influenced by the specific choice of hyper-parameters which should be taken into account when the confidence regions are compared.

5.4 Uncertainty quantification for kernelized LASSO

Our last example covers the LASSO (least absolute shrinkage and selection operator) method, which ensures sparsity via 1-norm regularization. Let us consider the kernelized version of LASSO with objective (Wang et al. 2007):
$$\begin{aligned} g(f_{\alpha }, \mathcal {D}_n) \; \doteq \; \frac{1}{2} \, \Vert \, y - \mathrm {K}_x\, \alpha \,\Vert ^2 \,+ \,\lambda \, \Vert \,\alpha \,\Vert _1, \end{aligned}$$
(45)
were \(\Vert \,\cdot \,\Vert _1\) is the L1 (or Manhattan) norm. Though, function (45) cannot be written as \(\Vert \,z \,-\, {\Phi }\, \alpha \,\Vert ^2\), the proposed framework, i.e., construction (17)–(18), can still be applied. A sub-gradient of the KLASSO objective (45) is given by
$$\begin{aligned} \nabla _{\alpha }\, g(f_{\alpha }, \mathcal {D}_n) \, = \ \mathrm {K}_x(\mathrm {K}_x\, \alpha - y) \,+\, \lambda \,\text{ sign }(\alpha ), \end{aligned}$$
(46)
where the \(\text{ sign }(\cdot )\) function is applied component-wise. Then, using the construction of (17)–(18), the reference and perturbed functions can be defined as
$$\begin{aligned} Z_0(\alpha ) \; \doteq&\, \left\| \, \mathrm {K}_x\,(\mathrm {K}_x\, \alpha - y) \,+\, \lambda \,\text{ sign }(\alpha ) \,\right\| ^2, \end{aligned}$$
(47)
$$\begin{aligned} Z_i(\alpha ) \; \doteq&\, \left\| \, \mathrm {K}_x\, G_i\,(\mathrm {K}_x\, \alpha - y)\, +\, \lambda \,\text{ sign }(\alpha ) \,\right\| ^2, \end{aligned}$$
(48)
were \(\{G_i\}\) are from a suitable transformation group, e.g., diagonal matrices with Rademacher random variables as diagonal elements for symmetric noises.

Numerical experiments illustrating the confidence regions we get for KLASSO are presented in Fig. 3. The figure also presents the confidence regions constructed by applying the standard Gaussian Process (GP) regression with estimated parameters. Note that the GP confidence regions are only approximate, namely, they do not come with strict finite-sample guarantees unless the noise is indeed Gaussian. Moreover, during our experiment the noise had a Laplace distribution, which has a heavier tail than Gaussians, therefore even if the true covariance of the noise was known, the confidence regions of GP regression would underestimate the uncertainty of the estimate (would be too optimistic), while the confidence regions of our framework are always non-conservative, independently of the particular distribution of the noise, assuming it has the necessary invariance.

Also note that for our method the noises can even have different (marginal) distributions for each input. Therefore, even though the confidence regions generated by GP are smaller than the ones our framework produces, the GP regions are imprecise and underestimate the uncertainty of the model, while ours come with strict finite-sample guarantees for a broad class of noises (e.g., symmetric ones).
Fig. 3

Exact, non-asymptotic, distribution-free confidence regions for ideal RKHS representations obtained using our framework and approximate confidence regions obtained by Gaussian Process (GP) regression (Rasmussen and Williams 2006). Part a shows UQ for Kernelized LASSO with \(\lambda = 1\), and part b shows UQ with GP. The applied transformations were sign-change matrices. The same data was used for both regression problems, namely, the true function was \(f_{*}(x) = x\, \sin (x)\), there were \(n = 20\) observations having i.i.d. Laplace noise with parameters \(\mu = 0\) (location) and \(b = \nicefrac {1}{2}\) (scale), and Gaussian kernels were applied with \(\sigma = 1\). The confidence level of each color can be interpreted by using the scale bars. The confidence regions are increasing, i.e., \(A_p \subseteq A_q\) if \(p \le q\), therefore, only the smallest levels are shown

6 Conclusions

In this paper we addressed the problem of quantifying the uncertainty of kernel estimates by using minimal distributional assumptions. The main aim was to measure the uncertainty of finding the (noise-free) ideal representation of the underlying (hidden) function at the available inputs. By building on recent developments in finite-sample system identification, we proposed a method that delivers exact, distribution-free confidence regions with strong finite-sample guarantees, based on the knowledge of some mild regularity of the measurement noises. The standard examples of such regularities are exchangeable or symmetric noise terms. Note that either of these properties in itself is sufficient for the theory to be applicable.

The needed statistical assumptions are very mild, as for example, no particular (parametric) family of distributions was assumed, no moment assumptions were made (the noises can be heavy-tailed, and may even have infinite variances); moreover, for the case of symmetric noises, it is allowed that each noise term affecting the observations has a different distribution, i.e., the noise can be nonstationary.

The core idea of the approach is to evaluate the uncertainty of the estimate by perturbing the residuals in the gradient of the objective function. The norms of the (potentially weighted) perturbed gradients are then compared to that of the unperturbed one, and a rank test is applied for the construction of the region.

The proposed method was also demonstrated on specific examples of kernel methods. Particularly, we showed how to construct exact, non-asymptotic, distribution-free confidence regions for least-squares support vector classification, kernel ridge regression, support vector regression and kernelized LASSO.

Several numerical experiments were presented, as well, demonstrating that the method provides meaningful regions even for heavy-tailed (e.g., Laplacian) noises. The figures illustrate whole families of confidence regions for various standard kernel estimates. Ellipsoidal outer approximations are also shown for LS-SVC. Additionally, the method was compared to Gaussian Process (GP) regression, and it was found that although the (approximate) GP confidence regions are smaller in general than our (exact) confidence sets, but the GP regions are typically imprecise and they underestimate the real uncertainty, e.g., if the noises are heavy-tailed.

Our approach to build non-asymptotic, distribution-free, non-conservative confidence regions for kernel methods can be a promising alternative to existing constructions, which arch-typically either build on strong distributional assumptions or on asymptotic theories or only bound the error between the true and empirical risks. As our approach explicitly builds on the constructions of the underlying kernel methods, it can provide new insights on how the specific methods influence the uncertainty of the estimates, and therefore, besides being vital for risk management, it also has the potential to inspire refinements or new constructions.

There are several open questions about the framework which can facilitate future research directions. For example, finding efficient outer-approximations for cases when the objective function is not convex quadratic should be addressed. Also the consistency of the method should be studied to see whether the uncertainty decreases as the sample size tends to infinity. Finally, it would be interesting, as well, to extend the method to (stochastic) dynamical systems and to formally analyze the size and shape of the constructed regions in a finite-sample setting.

Footnotes

  1. 1.

    We used the word “basically”, since there will also be some other random elements in the construction, e.g., for tie-breaking, and those should also constitute part of observation \(\gamma \).

Notes

Acknowledgements

Open access funding provided by MTA Institute for Computer Science and Control (MTA SZTAKI). This research was supported by the National Research, Development and Innovation Office (NKFIH), grant numbers ED_18-2-2018-0006, 2018-1.2.1-NKP-00008 and KH_17 125698. The authors are grateful to Algo Carè for the valuable discussions.

Supplementary material

References

  1. Argyriou, A., & Dinuzzo, F. (2014). A unifying view of representer theorems. In International conference on machine learning (ICML), pp. 748–756.Google Scholar
  2. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3), 337–404.MathSciNetCrossRefzbMATHGoogle Scholar
  3. Campi, M. C., & Weyer, E. (2005). Guaranteed non-asymptotic confidence regions in system identification. Automatica, 41(10), 1751–1764.MathSciNetCrossRefzbMATHGoogle Scholar
  4. Carè, A., Csáji, B. C., Campi, M., & Weyer, E. (2018). Finite-sample system identification: An overview and a new correlation method. IEEE Control Systems Letters, 2(1), 61–66.CrossRefGoogle Scholar
  5. Csáji, B. C. (2016). Score permutation based finite sample inference for generalized autoregressive conditional heteroskedasticity (GARCH) models. In 19th international conference on artificial intelligence and statistics (AISTATS). Spain: Cadiz, pp. 296–304.Google Scholar
  6. Csáji, B. C., Campi, M. C., & Weyer, E. (2012). Sign-perturbed sums (SPS): A method for constructing exact finite-sample confidence regions for general linear systems. In 51st IEEE conference on decision and control. Hawaii: Maui, pp. 7321–7326.Google Scholar
  7. Csáji, B. C., Campi, M. C., & Weyer, E. (2015). Sign-perturbed sums: A new system identification approach for constructing exact non-asymptotic confidence regions in linear regression models. IEEE Transactions on Signal Processing, 63, 169–181.MathSciNetCrossRefzbMATHGoogle Scholar
  8. Csáji, B. C., & Weyer, E. (2015). Closed-loop applicability of the Sign-Perturbed Sums method. In 54th IEEE conference on decision and control (CDC). IEEE, pp. 1441–1446.Google Scholar
  9. Davies, P. L., Kovac, A., & Meise, M. (2009). Nonparametric regression, confidence regions and regularization. The Annals of Statistics, 37(5B), 2597–2625.MathSciNetCrossRefzbMATHGoogle Scholar
  10. DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics (4th ed.). London: Pearson Education.Google Scholar
  11. Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Boca Raton: CRC Press.zbMATHGoogle Scholar
  12. Giné, E., & Nickl, R. (2015). Mathematical foundations of infinite-dimensional statistical models (Vol. 40). Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  13. Good, P. (2005). Permutation, parametric, and bootstrap tests of hypotheses. Berlin: Springer.zbMATHGoogle Scholar
  14. Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics, 36, 1171–1220.MathSciNetCrossRefzbMATHGoogle Scholar
  15. Kimeldorf, G., & Wahba, G. (1971). Some results on tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95.MathSciNetCrossRefzbMATHGoogle Scholar
  16. Kolumbán, S. (2016). System identification in highly non-informative environment. Ph.D. thesis, Budapest University of Technology and Economics, Hungary, and Vrije Univesiteit Brussels, Belgium.Google Scholar
  17. Li, K. C. (1989). Honest confidence regions for nonparametric regression. The Annals of Statistics, 17(3), 1001–1008.MathSciNetCrossRefzbMATHGoogle Scholar
  18. Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., & Ljung, L. (2014). Kernel methods in system identification, machine learning and function estimation: A survey. Automatica, 50(3), 657–682.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning., Adaptive computation and machine learning Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  20. Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. In Annual conference on learning theory (COLT), Springer, pp. 416–426.Google Scholar
  21. Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
  22. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  23. Steinwar, T. I., & Christmann, A. (2008). Support vector machines. Berlin: Springer.Google Scholar
  24. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300.CrossRefGoogle Scholar
  25. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.zbMATHGoogle Scholar
  26. Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. Berlin: Springer.zbMATHGoogle Scholar
  27. Wang, G., Yeung, D. Y., & Lochovsky, F. H. (2007). The kernel path in kernelized LASSO. In Proceedings of the eleventh international conference on artificial intelligence and statistics (AISTATS), pp. 580–587.Google Scholar
  28. Ye, J., & Xiong, T. (2007). SVM versus least squares SVM. In 11th international conference on artificial intelligence and statistics (AISTATS), pp. 644–651.Google Scholar
  29. Yu, Y., Cheng, H., Schuurmans, D., & Szepesvári, C. S. (2013). Characterizing the representer theorem. In International conference on machine learning, pp. 570–578.Google Scholar

Copyright information

© The Author(s) 2019

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.EPIC Centre of Excellence, MTA SZTAKI: Institute for Computer Science and ControlHungarian Academy of SciencesBudapestHungary

Personalised recommendations